* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-07-15 12:18 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-07-15 12:18 UTC (permalink / raw
To: gentoo-commits
commit: 9f27167757173dcde5f5673d721e8dd7047df9e1
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Tue Jul 15 12:18:08 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Tue Jul 15 12:18:08 2014 +0000
URL: http://git.overlays.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=9f271677
Zero copy for infiniband psm userspace driver. ACPI: Disable Windows 8 compatibility for some Lenovo ThinkPads Ensure that /dev/root doesn't appear in /proc/mounts when bootint without an initramfs. Do not lock when UMH is waiting on current thread spawned by linuxrc. (bug #481344). Bootsplash ported by Uladzimir Bely (bug #513334). Support for Pogoplug e02 (bug #460350), adjusted to be opt-in by TomWij. Add Gentoo Linux support config settings and defaults.
---
0000_README | 25 +
2400_kcopy-patch-for-infiniband-driver.patch | 731 +++++++++
2700_ThinkPad-30-brightness-control-fix.patch | 67 +
2900_dev-root-proc-mount-fix.patch | 29 +
2905_2disk-resume-image-fix.patch | 24 +
4200_fbcondecor-3.15.patch | 2119 +++++++++++++++++++++++++
4500_support-for-pogoplug-e02.patch | 172 ++
7 files changed, 3167 insertions(+)
diff --git a/0000_README b/0000_README
index 9018993..6276507 100644
--- a/0000_README
+++ b/0000_README
@@ -43,6 +43,31 @@ EXPERIMENTAL
Individual Patch Descriptions:
--------------------------------------------------------------------------
+Patch: 2400_kcopy-patch-for-infiniband-driver.patch
+From: Alexey Shvetsov <alexxy@gentoo.org>
+Desc: Zero copy for infiniband psm userspace driver
+
+Patch: 2700_ThinkPad-30-brightness-control-fix.patch
+From: Seth Forshee <seth.forshee@canonical.com>
+Desc: ACPI: Disable Windows 8 compatibility for some Lenovo ThinkPads
+
+Patch: 2900_dev-root-proc-mount-fix.patch
+From: https://bugs.gentoo.org/show_bug.cgi?id=438380
+Desc: Ensure that /dev/root doesn't appear in /proc/mounts when bootint without an initramfs.
+
+Patch: 2905_s2disk-resume-image-fix.patch
+From: Al Viro <viro <at> ZenIV.linux.org.uk>
+Desc: Do not lock when UMH is waiting on current thread spawned by linuxrc. (bug #481344)
+
+Patch: 4200_fbcondecor-3.15.patch
+From: http://www.mepiscommunity.org/fbcondecor
+Desc: Bootsplash ported by Uladzimir Bely (bug #513334)
+
+Patch: 4500_support-for-pogoplug-e02.patch
+From: Cristoph Junghans <ottxor@gentoo.org>
+Desc: Support for Pogoplug e02 (bug #460350), adjusted to be opt-in by TomWij.
+
Patch: 4567_distro-Gentoo-Kconfig.patch
From: Tom Wijsman <TomWij@gentoo.org>
Desc: Add Gentoo Linux support config settings and defaults.
+
diff --git a/2400_kcopy-patch-for-infiniband-driver.patch b/2400_kcopy-patch-for-infiniband-driver.patch
new file mode 100644
index 0000000..759f451
--- /dev/null
+++ b/2400_kcopy-patch-for-infiniband-driver.patch
@@ -0,0 +1,731 @@
+From 1f52075d672a9bdd0069b3ea68be266ef5c229bd Mon Sep 17 00:00:00 2001
+From: Alexey Shvetsov <alexxy@gentoo.org>
+Date: Tue, 17 Jan 2012 21:08:49 +0400
+Subject: [PATCH] [kcopy] Add kcopy driver
+
+Add kcopy driver from qlogic to implement zero copy for infiniband psm
+userspace driver
+
+Signed-off-by: Alexey Shvetsov <alexxy@gentoo.org>
+---
+ drivers/char/Kconfig | 2 +
+ drivers/char/Makefile | 2 +
+ drivers/char/kcopy/Kconfig | 17 ++
+ drivers/char/kcopy/Makefile | 4 +
+ drivers/char/kcopy/kcopy.c | 646 +++++++++++++++++++++++++++++++++++++++++++
+ 5 files changed, 671 insertions(+)
+ create mode 100644 drivers/char/kcopy/Kconfig
+ create mode 100644 drivers/char/kcopy/Makefile
+ create mode 100644 drivers/char/kcopy/kcopy.c
+
+diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
+index ee94686..5b81449 100644
+--- a/drivers/char/Kconfig
++++ b/drivers/char/Kconfig
+@@ -6,6 +6,8 @@ menu "Character devices"
+
+ source "drivers/tty/Kconfig"
+
++source "drivers/char/kcopy/Kconfig"
++
+ config DEVKMEM
+ bool "/dev/kmem virtual device support"
+ default y
+diff --git a/drivers/char/Makefile b/drivers/char/Makefile
+index 0dc5d7c..be519d6 100644
+--- a/drivers/char/Makefile
++++ b/drivers/char/Makefile
+@@ -62,3 +62,5 @@
+ js-rtc-y = rtc.o
+
+ obj-$(CONFIG_TILE_SROM) += tile-srom.o
++
++obj-$(CONFIG_KCOPY) += kcopy/
+diff --git a/drivers/char/kcopy/Kconfig b/drivers/char/kcopy/Kconfig
+new file mode 100644
+index 0000000..453ae52
+--- /dev/null
++++ b/drivers/char/kcopy/Kconfig
+@@ -0,0 +1,17 @@
++#
++# KCopy character device configuration
++#
++
++menu "KCopy"
++
++config KCOPY
++ tristate "Memory-to-memory copies using kernel assist"
++ default m
++ ---help---
++ High-performance inter-process memory copies. Can often save a
++ memory copy to shared memory in the application. Useful at least
++ for MPI applications where the point-to-point nature of vmsplice
++ and pipes can be a limiting factor in performance.
++
++endmenu
++
+diff --git a/drivers/char/kcopy/Makefile b/drivers/char/kcopy/Makefile
+new file mode 100644
+index 0000000..9cb269b
+--- /dev/null
++++ b/drivers/char/kcopy/Makefile
+@@ -0,0 +1,4 @@
++#
++# Makefile for the kernel character device drivers.
++#
++obj-$(CONFIG_KCOPY) += kcopy.o
+diff --git a/drivers/char/kcopy/kcopy.c b/drivers/char/kcopy/kcopy.c
+new file mode 100644
+index 0000000..a9f915c
+--- /dev/null
++++ b/drivers/char/kcopy/kcopy.c
+@@ -0,0 +1,646 @@
++#include <linux/module.h>
++#include <linux/fs.h>
++#include <linux/cdev.h>
++#include <linux/device.h>
++#include <linux/mutex.h>
++#include <linux/mman.h>
++#include <linux/highmem.h>
++#include <linux/spinlock.h>
++#include <linux/sched.h>
++#include <linux/rbtree.h>
++#include <linux/rcupdate.h>
++#include <linux/uaccess.h>
++#include <linux/slab.h>
++
++MODULE_LICENSE("GPL");
++MODULE_AUTHOR("Arthur Jones <arthur.jones@qlogic.com>");
++MODULE_DESCRIPTION("QLogic kcopy driver");
++
++#define KCOPY_ABI 1
++#define KCOPY_MAX_MINORS 64
++
++struct kcopy_device {
++ struct cdev cdev;
++ struct class *class;
++ struct device *devp[KCOPY_MAX_MINORS];
++ dev_t dev;
++
++ struct kcopy_file *kf[KCOPY_MAX_MINORS];
++ struct mutex open_lock;
++};
++
++static struct kcopy_device kcopy_dev;
++
++/* per file data / one of these is shared per minor */
++struct kcopy_file {
++ int count;
++
++ /* pid indexed */
++ struct rb_root live_map_tree;
++
++ struct mutex map_lock;
++};
++
++struct kcopy_map_entry {
++ int count;
++ struct task_struct *task;
++ pid_t pid;
++ struct kcopy_file *file; /* file backpointer */
++
++ struct list_head list; /* free map list */
++ struct rb_node node; /* live map tree */
++};
++
++#define KCOPY_GET_SYSCALL 1
++#define KCOPY_PUT_SYSCALL 2
++#define KCOPY_ABI_SYSCALL 3
++
++struct kcopy_syscall {
++ __u32 tag;
++ pid_t pid;
++ __u64 n;
++ __u64 src;
++ __u64 dst;
++};
++
++static const void __user *kcopy_syscall_src(const struct kcopy_syscall *ks)
++{
++ return (const void __user *) (unsigned long) ks->src;
++}
++
++static void __user *kcopy_syscall_dst(const struct kcopy_syscall *ks)
++{
++ return (void __user *) (unsigned long) ks->dst;
++}
++
++static unsigned long kcopy_syscall_n(const struct kcopy_syscall *ks)
++{
++ return (unsigned long) ks->n;
++}
++
++static struct kcopy_map_entry *kcopy_create_entry(struct kcopy_file *file)
++{
++ struct kcopy_map_entry *kme =
++ kmalloc(sizeof(struct kcopy_map_entry), GFP_KERNEL);
++
++ if (!kme)
++ return NULL;
++
++ kme->count = 1;
++ kme->file = file;
++ kme->task = current;
++ kme->pid = current->tgid;
++ INIT_LIST_HEAD(&kme->list);
++
++ return kme;
++}
++
++static struct kcopy_map_entry *
++kcopy_lookup_pid(struct rb_root *root, pid_t pid)
++{
++ struct rb_node *node = root->rb_node;
++
++ while (node) {
++ struct kcopy_map_entry *kme =
++ container_of(node, struct kcopy_map_entry, node);
++
++ if (pid < kme->pid)
++ node = node->rb_left;
++ else if (pid > kme->pid)
++ node = node->rb_right;
++ else
++ return kme;
++ }
++
++ return NULL;
++}
++
++static int kcopy_insert(struct rb_root *root, struct kcopy_map_entry *kme)
++{
++ struct rb_node **new = &(root->rb_node);
++ struct rb_node *parent = NULL;
++
++ while (*new) {
++ struct kcopy_map_entry *tkme =
++ container_of(*new, struct kcopy_map_entry, node);
++
++ parent = *new;
++ if (kme->pid < tkme->pid)
++ new = &((*new)->rb_left);
++ else if (kme->pid > tkme->pid)
++ new = &((*new)->rb_right);
++ else {
++ printk(KERN_INFO "!!! debugging: bad rb tree !!!\n");
++ return -EINVAL;
++ }
++ }
++
++ rb_link_node(&kme->node, parent, new);
++ rb_insert_color(&kme->node, root);
++
++ return 0;
++}
++
++static int kcopy_open(struct inode *inode, struct file *filp)
++{
++ int ret;
++ const int minor = iminor(inode);
++ struct kcopy_file *kf = NULL;
++ struct kcopy_map_entry *kme;
++ struct kcopy_map_entry *okme;
++
++ if (minor < 0 || minor >= KCOPY_MAX_MINORS)
++ return -ENODEV;
++
++ mutex_lock(&kcopy_dev.open_lock);
++
++ if (!kcopy_dev.kf[minor]) {
++ kf = kmalloc(sizeof(struct kcopy_file), GFP_KERNEL);
++
++ if (!kf) {
++ ret = -ENOMEM;
++ goto bail;
++ }
++
++ kf->count = 1;
++ kf->live_map_tree = RB_ROOT;
++ mutex_init(&kf->map_lock);
++ kcopy_dev.kf[minor] = kf;
++ } else {
++ if (filp->f_flags & O_EXCL) {
++ ret = -EBUSY;
++ goto bail;
++ }
++ kcopy_dev.kf[minor]->count++;
++ }
++
++ kme = kcopy_create_entry(kcopy_dev.kf[minor]);
++ if (!kme) {
++ ret = -ENOMEM;
++ goto err_free_kf;
++ }
++
++ kf = kcopy_dev.kf[minor];
++
++ mutex_lock(&kf->map_lock);
++
++ okme = kcopy_lookup_pid(&kf->live_map_tree, kme->pid);
++ if (okme) {
++ /* pid already exists... */
++ okme->count++;
++ kfree(kme);
++ kme = okme;
++ } else
++ ret = kcopy_insert(&kf->live_map_tree, kme);
++
++ mutex_unlock(&kf->map_lock);
++
++ filp->private_data = kme;
++
++ ret = 0;
++ goto bail;
++
++err_free_kf:
++ if (kf) {
++ kcopy_dev.kf[minor] = NULL;
++ kfree(kf);
++ }
++bail:
++ mutex_unlock(&kcopy_dev.open_lock);
++ return ret;
++}
++
++static int kcopy_flush(struct file *filp, fl_owner_t id)
++{
++ struct kcopy_map_entry *kme = filp->private_data;
++ struct kcopy_file *kf = kme->file;
++
++ if (file_count(filp) == 1) {
++ mutex_lock(&kf->map_lock);
++ kme->count--;
++
++ if (!kme->count) {
++ rb_erase(&kme->node, &kf->live_map_tree);
++ kfree(kme);
++ }
++ mutex_unlock(&kf->map_lock);
++ }
++
++ return 0;
++}
++
++static int kcopy_release(struct inode *inode, struct file *filp)
++{
++ const int minor = iminor(inode);
++
++ mutex_lock(&kcopy_dev.open_lock);
++ kcopy_dev.kf[minor]->count--;
++ if (!kcopy_dev.kf[minor]->count) {
++ kfree(kcopy_dev.kf[minor]);
++ kcopy_dev.kf[minor] = NULL;
++ }
++ mutex_unlock(&kcopy_dev.open_lock);
++
++ return 0;
++}
++
++static void kcopy_put_pages(struct page **pages, int npages)
++{
++ int j;
++
++ for (j = 0; j < npages; j++)
++ put_page(pages[j]);
++}
++
++static int kcopy_validate_task(struct task_struct *p)
++{
++ return p && (uid_eq(current_euid(), task_euid(p)) || uid_eq(current_euid(), task_uid(p)));
++}
++
++static int kcopy_get_pages(struct kcopy_file *kf, pid_t pid,
++ struct page **pages, void __user *addr,
++ int write, size_t npages)
++{
++ int err;
++ struct mm_struct *mm;
++ struct kcopy_map_entry *rkme;
++
++ mutex_lock(&kf->map_lock);
++
++ rkme = kcopy_lookup_pid(&kf->live_map_tree, pid);
++ if (!rkme || !kcopy_validate_task(rkme->task)) {
++ err = -EINVAL;
++ goto bail_unlock;
++ }
++
++ mm = get_task_mm(rkme->task);
++ if (unlikely(!mm)) {
++ err = -ENOMEM;
++ goto bail_unlock;
++ }
++
++ down_read(&mm->mmap_sem);
++ err = get_user_pages(rkme->task, mm,
++ (unsigned long) addr, npages, write, 0,
++ pages, NULL);
++
++ if (err < npages && err > 0) {
++ kcopy_put_pages(pages, err);
++ err = -ENOMEM;
++ } else if (err == npages)
++ err = 0;
++
++ up_read(&mm->mmap_sem);
++
++ mmput(mm);
++
++bail_unlock:
++ mutex_unlock(&kf->map_lock);
++
++ return err;
++}
++
++static unsigned long kcopy_copy_pages_from_user(void __user *src,
++ struct page **dpages,
++ unsigned doff,
++ unsigned long n)
++{
++ struct page *dpage = *dpages;
++ char *daddr = kmap(dpage);
++ int ret = 0;
++
++ while (1) {
++ const unsigned long nleft = PAGE_SIZE - doff;
++ const unsigned long nc = (n < nleft) ? n : nleft;
++
++ /* if (copy_from_user(daddr + doff, src, nc)) { */
++ if (__copy_from_user_nocache(daddr + doff, src, nc)) {
++ ret = -EFAULT;
++ goto bail;
++ }
++
++ n -= nc;
++ if (n == 0)
++ break;
++
++ doff += nc;
++ doff &= ~PAGE_MASK;
++ if (doff == 0) {
++ kunmap(dpage);
++ dpages++;
++ dpage = *dpages;
++ daddr = kmap(dpage);
++ }
++
++ src += nc;
++ }
++
++bail:
++ kunmap(dpage);
++
++ return ret;
++}
++
++static unsigned long kcopy_copy_pages_to_user(void __user *dst,
++ struct page **spages,
++ unsigned soff,
++ unsigned long n)
++{
++ struct page *spage = *spages;
++ const char *saddr = kmap(spage);
++ int ret = 0;
++
++ while (1) {
++ const unsigned long nleft = PAGE_SIZE - soff;
++ const unsigned long nc = (n < nleft) ? n : nleft;
++
++ if (copy_to_user(dst, saddr + soff, nc)) {
++ ret = -EFAULT;
++ goto bail;
++ }
++
++ n -= nc;
++ if (n == 0)
++ break;
++
++ soff += nc;
++ soff &= ~PAGE_MASK;
++ if (soff == 0) {
++ kunmap(spage);
++ spages++;
++ spage = *spages;
++ saddr = kmap(spage);
++ }
++
++ dst += nc;
++ }
++
++bail:
++ kunmap(spage);
++
++ return ret;
++}
++
++static unsigned long kcopy_copy_to_user(void __user *dst,
++ struct kcopy_file *kf, pid_t pid,
++ void __user *src,
++ unsigned long n)
++{
++ struct page **pages;
++ const int pages_len = PAGE_SIZE / sizeof(struct page *);
++ int ret = 0;
++
++ pages = (struct page **) __get_free_page(GFP_KERNEL);
++ if (!pages) {
++ ret = -ENOMEM;
++ goto bail;
++ }
++
++ while (n) {
++ const unsigned long soff = (unsigned long) src & ~PAGE_MASK;
++ const unsigned long spages_left =
++ (soff + n + PAGE_SIZE - 1) >> PAGE_SHIFT;
++ const unsigned long spages_cp =
++ min_t(unsigned long, spages_left, pages_len);
++ const unsigned long sbytes =
++ PAGE_SIZE - soff + (spages_cp - 1) * PAGE_SIZE;
++ const unsigned long nbytes = min_t(unsigned long, sbytes, n);
++
++ ret = kcopy_get_pages(kf, pid, pages, src, 0, spages_cp);
++ if (unlikely(ret))
++ goto bail_free;
++
++ ret = kcopy_copy_pages_to_user(dst, pages, soff, nbytes);
++ kcopy_put_pages(pages, spages_cp);
++ if (ret)
++ goto bail_free;
++ dst = (char *) dst + nbytes;
++ src = (char *) src + nbytes;
++
++ n -= nbytes;
++ }
++
++bail_free:
++ free_page((unsigned long) pages);
++bail:
++ return ret;
++}
++
++static unsigned long kcopy_copy_from_user(const void __user *src,
++ struct kcopy_file *kf, pid_t pid,
++ void __user *dst,
++ unsigned long n)
++{
++ struct page **pages;
++ const int pages_len = PAGE_SIZE / sizeof(struct page *);
++ int ret = 0;
++
++ pages = (struct page **) __get_free_page(GFP_KERNEL);
++ if (!pages) {
++ ret = -ENOMEM;
++ goto bail;
++ }
++
++ while (n) {
++ const unsigned long doff = (unsigned long) dst & ~PAGE_MASK;
++ const unsigned long dpages_left =
++ (doff + n + PAGE_SIZE - 1) >> PAGE_SHIFT;
++ const unsigned long dpages_cp =
++ min_t(unsigned long, dpages_left, pages_len);
++ const unsigned long dbytes =
++ PAGE_SIZE - doff + (dpages_cp - 1) * PAGE_SIZE;
++ const unsigned long nbytes = min_t(unsigned long, dbytes, n);
++
++ ret = kcopy_get_pages(kf, pid, pages, dst, 1, dpages_cp);
++ if (unlikely(ret))
++ goto bail_free;
++
++ ret = kcopy_copy_pages_from_user((void __user *) src,
++ pages, doff, nbytes);
++ kcopy_put_pages(pages, dpages_cp);
++ if (ret)
++ goto bail_free;
++
++ dst = (char *) dst + nbytes;
++ src = (char *) src + nbytes;
++
++ n -= nbytes;
++ }
++
++bail_free:
++ free_page((unsigned long) pages);
++bail:
++ return ret;
++}
++
++static int kcopy_do_get(struct kcopy_map_entry *kme, pid_t pid,
++ const void __user *src, void __user *dst,
++ unsigned long n)
++{
++ struct kcopy_file *kf = kme->file;
++ int ret = 0;
++
++ if (n == 0) {
++ ret = -EINVAL;
++ goto bail;
++ }
++
++ ret = kcopy_copy_to_user(dst, kf, pid, (void __user *) src, n);
++
++bail:
++ return ret;
++}
++
++static int kcopy_do_put(struct kcopy_map_entry *kme, const void __user *src,
++ pid_t pid, void __user *dst,
++ unsigned long n)
++{
++ struct kcopy_file *kf = kme->file;
++ int ret = 0;
++
++ if (n == 0) {
++ ret = -EINVAL;
++ goto bail;
++ }
++
++ ret = kcopy_copy_from_user(src, kf, pid, (void __user *) dst, n);
++
++bail:
++ return ret;
++}
++
++static int kcopy_do_abi(u32 __user *dst)
++{
++ u32 val = KCOPY_ABI;
++ int err;
++
++ err = put_user(val, dst);
++ if (err)
++ return -EFAULT;
++
++ return 0;
++}
++
++ssize_t kcopy_write(struct file *filp, const char __user *data, size_t cnt,
++ loff_t *o)
++{
++ struct kcopy_map_entry *kme = filp->private_data;
++ struct kcopy_syscall ks;
++ int err = 0;
++ const void __user *src;
++ void __user *dst;
++ unsigned long n;
++
++ if (cnt != sizeof(struct kcopy_syscall)) {
++ err = -EINVAL;
++ goto bail;
++ }
++
++ err = copy_from_user(&ks, data, cnt);
++ if (unlikely(err))
++ goto bail;
++
++ src = kcopy_syscall_src(&ks);
++ dst = kcopy_syscall_dst(&ks);
++ n = kcopy_syscall_n(&ks);
++ if (ks.tag == KCOPY_GET_SYSCALL)
++ err = kcopy_do_get(kme, ks.pid, src, dst, n);
++ else if (ks.tag == KCOPY_PUT_SYSCALL)
++ err = kcopy_do_put(kme, src, ks.pid, dst, n);
++ else if (ks.tag == KCOPY_ABI_SYSCALL)
++ err = kcopy_do_abi(dst);
++ else
++ err = -EINVAL;
++
++bail:
++ return err ? err : cnt;
++}
++
++static const struct file_operations kcopy_fops = {
++ .owner = THIS_MODULE,
++ .open = kcopy_open,
++ .release = kcopy_release,
++ .flush = kcopy_flush,
++ .write = kcopy_write,
++};
++
++static int __init kcopy_init(void)
++{
++ int ret;
++ const char *name = "kcopy";
++ int i;
++ int ninit = 0;
++
++ mutex_init(&kcopy_dev.open_lock);
++
++ ret = alloc_chrdev_region(&kcopy_dev.dev, 0, KCOPY_MAX_MINORS, name);
++ if (ret)
++ goto bail;
++
++ kcopy_dev.class = class_create(THIS_MODULE, (char *) name);
++
++ if (IS_ERR(kcopy_dev.class)) {
++ ret = PTR_ERR(kcopy_dev.class);
++ printk(KERN_ERR "kcopy: Could not create "
++ "device class (err %d)\n", -ret);
++ goto bail_chrdev;
++ }
++
++ cdev_init(&kcopy_dev.cdev, &kcopy_fops);
++ ret = cdev_add(&kcopy_dev.cdev, kcopy_dev.dev, KCOPY_MAX_MINORS);
++ if (ret < 0) {
++ printk(KERN_ERR "kcopy: Could not add cdev (err %d)\n",
++ -ret);
++ goto bail_class;
++ }
++
++ for (i = 0; i < KCOPY_MAX_MINORS; i++) {
++ char devname[8];
++ const int minor = MINOR(kcopy_dev.dev) + i;
++ const dev_t dev = MKDEV(MAJOR(kcopy_dev.dev), minor);
++
++ snprintf(devname, sizeof(devname), "kcopy%02d", i);
++ kcopy_dev.devp[i] =
++ device_create(kcopy_dev.class, NULL,
++ dev, NULL, devname);
++
++ if (IS_ERR(kcopy_dev.devp[i])) {
++ ret = PTR_ERR(kcopy_dev.devp[i]);
++ printk(KERN_ERR "kcopy: Could not create "
++ "devp %d (err %d)\n", i, -ret);
++ goto bail_cdev_add;
++ }
++
++ ninit++;
++ }
++
++ ret = 0;
++ goto bail;
++
++bail_cdev_add:
++ for (i = 0; i < ninit; i++)
++ device_unregister(kcopy_dev.devp[i]);
++
++ cdev_del(&kcopy_dev.cdev);
++bail_class:
++ class_destroy(kcopy_dev.class);
++bail_chrdev:
++ unregister_chrdev_region(kcopy_dev.dev, KCOPY_MAX_MINORS);
++bail:
++ return ret;
++}
++
++static void __exit kcopy_fini(void)
++{
++ int i;
++
++ for (i = 0; i < KCOPY_MAX_MINORS; i++)
++ device_unregister(kcopy_dev.devp[i]);
++
++ cdev_del(&kcopy_dev.cdev);
++ class_destroy(kcopy_dev.class);
++ unregister_chrdev_region(kcopy_dev.dev, KCOPY_MAX_MINORS);
++}
++
++module_init(kcopy_init);
++module_exit(kcopy_fini);
+--
+1.7.10
+
diff --git a/2700_ThinkPad-30-brightness-control-fix.patch b/2700_ThinkPad-30-brightness-control-fix.patch
new file mode 100644
index 0000000..b548c6d
--- /dev/null
+++ b/2700_ThinkPad-30-brightness-control-fix.patch
@@ -0,0 +1,67 @@
+diff --git a/drivers/acpi/blacklist.c b/drivers/acpi/blacklist.c
+index cb96296..6c242ed 100644
+--- a/drivers/acpi/blacklist.c
++++ b/drivers/acpi/blacklist.c
+@@ -269,6 +276,61 @@ static struct dmi_system_id acpi_osi_dmi_table[] __initdata = {
+ },
+
+ /*
++ * The following Lenovo models have a broken workaround in the
++ * acpi_video backlight implementation to meet the Windows 8
++ * requirement of 101 backlight levels. Reverting to pre-Win8
++ * behavior fixes the problem.
++ */
++ {
++ .callback = dmi_disable_osi_win8,
++ .ident = "Lenovo ThinkPad L430",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad L430"),
++ },
++ },
++ {
++ .callback = dmi_disable_osi_win8,
++ .ident = "Lenovo ThinkPad T430s",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T430s"),
++ },
++ },
++ {
++ .callback = dmi_disable_osi_win8,
++ .ident = "Lenovo ThinkPad T530",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T530"),
++ },
++ },
++ {
++ .callback = dmi_disable_osi_win8,
++ .ident = "Lenovo ThinkPad W530",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad W530"),
++ },
++ },
++ {
++ .callback = dmi_disable_osi_win8,
++ .ident = "Lenovo ThinkPad X1 Carbon",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad X1 Carbon"),
++ },
++ },
++ {
++ .callback = dmi_disable_osi_win8,
++ .ident = "Lenovo ThinkPad X230",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad X230"),
++ },
++ },
++
++ /*
+ * BIOS invocation of _OSI(Linux) is almost always a BIOS bug.
+ * Linux ignores it, except for the machines enumerated below.
+ */
+
diff --git a/2900_dev-root-proc-mount-fix.patch b/2900_dev-root-proc-mount-fix.patch
new file mode 100644
index 0000000..4c89adf
--- /dev/null
+++ b/2900_dev-root-proc-mount-fix.patch
@@ -0,0 +1,29 @@
+--- a/init/do_mounts.c 2013-01-25 19:11:11.609802424 -0500
++++ b/init/do_mounts.c 2013-01-25 19:14:20.606053568 -0500
+@@ -461,7 +461,10 @@ void __init change_floppy(char *fmt, ...
+ va_start(args, fmt);
+ vsprintf(buf, fmt, args);
+ va_end(args);
+- fd = sys_open("/dev/root", O_RDWR | O_NDELAY, 0);
++ if (saved_root_name[0])
++ fd = sys_open(saved_root_name, O_RDWR | O_NDELAY, 0);
++ else
++ fd = sys_open("/dev/root", O_RDWR | O_NDELAY, 0);
+ if (fd >= 0) {
+ sys_ioctl(fd, FDEJECT, 0);
+ sys_close(fd);
+@@ -505,7 +508,13 @@ void __init mount_root(void)
+ #endif
+ #ifdef CONFIG_BLOCK
+ create_dev("/dev/root", ROOT_DEV);
+- mount_block_root("/dev/root", root_mountflags);
++ if (saved_root_name[0]) {
++ create_dev(saved_root_name, ROOT_DEV);
++ mount_block_root(saved_root_name, root_mountflags);
++ } else {
++ create_dev("/dev/root", ROOT_DEV);
++ mount_block_root("/dev/root", root_mountflags);
++ }
+ #endif
+ }
+
diff --git a/2905_2disk-resume-image-fix.patch b/2905_2disk-resume-image-fix.patch
new file mode 100644
index 0000000..7e95d29
--- /dev/null
+++ b/2905_2disk-resume-image-fix.patch
@@ -0,0 +1,24 @@
+diff --git a/kernel/kmod.c b/kernel/kmod.c
+index fb32636..d968882 100644
+--- a/kernel/kmod.c
++++ b/kernel/kmod.c
+@@ -575,7 +575,8 @@
+ call_usermodehelper_freeinfo(sub_info);
+ return -EINVAL;
+ }
+- helper_lock();
++ if (!(current->flags & PF_FREEZER_SKIP))
++ helper_lock();
+ if (!khelper_wq || usermodehelper_disabled) {
+ retval = -EBUSY;
+ goto out;
+@@ -611,7 +612,8 @@ wait_done:
+ out:
+ call_usermodehelper_freeinfo(sub_info);
+ unlock:
+- helper_unlock();
++ if (!(current->flags & PF_FREEZER_SKIP))
++ helper_unlock();
+ return retval;
+ }
+ EXPORT_SYMBOL(call_usermodehelper_exec);
diff --git a/4200_fbcondecor-3.15.patch b/4200_fbcondecor-3.15.patch
new file mode 100644
index 0000000..c96e5dc
--- /dev/null
+++ b/4200_fbcondecor-3.15.patch
@@ -0,0 +1,2119 @@
+diff --git a/Documentation/fb/00-INDEX b/Documentation/fb/00-INDEX
+index fe85e7c..2230930 100644
+--- a/Documentation/fb/00-INDEX
++++ b/Documentation/fb/00-INDEX
+@@ -23,6 +23,8 @@ ep93xx-fb.txt
+ - info on the driver for EP93xx LCD controller.
+ fbcon.txt
+ - intro to and usage guide for the framebuffer console (fbcon).
++fbcondecor.txt
++ - info on the Framebuffer Console Decoration
+ framebuffer.txt
+ - introduction to frame buffer devices.
+ gxfb.txt
+diff --git a/Documentation/fb/fbcondecor.txt b/Documentation/fb/fbcondecor.txt
+new file mode 100644
+index 0000000..3388c61
+--- /dev/null
++++ b/Documentation/fb/fbcondecor.txt
+@@ -0,0 +1,207 @@
++What is it?
++-----------
++
++The framebuffer decorations are a kernel feature which allows displaying a
++background picture on selected consoles.
++
++What do I need to get it to work?
++---------------------------------
++
++To get fbcondecor up-and-running you will have to:
++ 1) get a copy of splashutils [1] or a similar program
++ 2) get some fbcondecor themes
++ 3) build the kernel helper program
++ 4) build your kernel with the FB_CON_DECOR option enabled.
++
++To get fbcondecor operational right after fbcon initialization is finished, you
++will have to include a theme and the kernel helper into your initramfs image.
++Please refer to splashutils documentation for instructions on how to do that.
++
++[1] The splashutils package can be downloaded from:
++ http://github.com/alanhaggai/fbsplash
++
++The userspace helper
++--------------------
++
++The userspace fbcondecor helper (by default: /sbin/fbcondecor_helper) is called by the
++kernel whenever an important event occurs and the kernel needs some kind of
++job to be carried out. Important events include console switches and video
++mode switches (the kernel requests background images and configuration
++parameters for the current console). The fbcondecor helper must be accessible at
++all times. If it's not, fbcondecor will be switched off automatically.
++
++It's possible to set path to the fbcondecor helper by writing it to
++/proc/sys/kernel/fbcondecor.
++
++*****************************************************************************
++
++The information below is mostly technical stuff. There's probably no need to
++read it unless you plan to develop a userspace helper.
++
++The fbcondecor protocol
++-----------------------
++
++The fbcondecor protocol defines a communication interface between the kernel and
++the userspace fbcondecor helper.
++
++The kernel side is responsible for:
++
++ * rendering console text, using an image as a background (instead of a
++ standard solid color fbcon uses),
++ * accepting commands from the user via ioctls on the fbcondecor device,
++ * calling the userspace helper to set things up as soon as the fb subsystem
++ is initialized.
++
++The userspace helper is responsible for everything else, including parsing
++configuration files, decompressing the image files whenever the kernel needs
++it, and communicating with the kernel if necessary.
++
++The fbcondecor protocol specifies how communication is done in both ways:
++kernel->userspace and userspace->helper.
++
++Kernel -> Userspace
++-------------------
++
++The kernel communicates with the userspace helper by calling it and specifying
++the task to be done in a series of arguments.
++
++The arguments follow the pattern:
++<fbcondecor protocol version> <command> <parameters>
++
++All commands defined in fbcondecor protocol v2 have the following parameters:
++ virtual console
++ framebuffer number
++ theme
++
++Fbcondecor protocol v1 specified an additional 'fbcondecor mode' after the
++framebuffer number. Fbcondecor protocol v1 is deprecated and should not be used.
++
++Fbcondecor protocol v2 specifies the following commands:
++
++getpic
++------
++ The kernel issues this command to request image data. It's up to the
++ userspace helper to find a background image appropriate for the specified
++ theme and the current resolution. The userspace helper should respond by
++ issuing the FBIOCONDECOR_SETPIC ioctl.
++
++init
++----
++ The kernel issues this command after the fbcondecor device is created and
++ the fbcondecor interface is initialized. Upon receiving 'init', the userspace
++ helper should parse the kernel command line (/proc/cmdline) or otherwise
++ decide whether fbcondecor is to be activated.
++
++ To activate fbcondecor on the first console the helper should issue the
++ FBIOCONDECOR_SETCFG, FBIOCONDECOR_SETPIC and FBIOCONDECOR_SETSTATE commands,
++ in the above-mentioned order.
++
++ When the userspace helper is called in an early phase of the boot process
++ (right after the initialization of fbcon), no filesystems will be mounted.
++ The helper program should mount sysfs and then create the appropriate
++ framebuffer, fbcondecor and tty0 devices (if they don't already exist) to get
++ current display settings and to be able to communicate with the kernel side.
++ It should probably also mount the procfs to be able to parse the kernel
++ command line parameters.
++
++ Note that the console sem is not held when the kernel calls fbcondecor_helper
++ with the 'init' command. The fbcondecor helper should perform all ioctls with
++ origin set to FBCON_DECOR_IO_ORIG_USER.
++
++modechange
++----------
++ The kernel issues this command on a mode change. The helper's response should
++ be similar to the response to the 'init' command. Note that this time the
++ console sem is held and all ioctls must be performed with origin set to
++ FBCON_DECOR_IO_ORIG_KERNEL.
++
++
++Userspace -> Kernel
++-------------------
++
++Userspace programs can communicate with fbcondecor via ioctls on the
++fbcondecor device. These ioctls are to be used by both the userspace helper
++(called only by the kernel) and userspace configuration tools (run by the users).
++
++The fbcondecor helper should set the origin field to FBCON_DECOR_IO_ORIG_KERNEL
++when doing the appropriate ioctls. All userspace configuration tools should
++use FBCON_DECOR_IO_ORIG_USER. Failure to set the appropriate value in the origin
++field when performing ioctls from the kernel helper will most likely result
++in a console deadlock.
++
++FBCON_DECOR_IO_ORIG_KERNEL instructs fbcondecor not to try to acquire the console
++semaphore. Not surprisingly, FBCON_DECOR_IO_ORIG_USER instructs it to acquire
++the console sem.
++
++The framebuffer console decoration provides the following ioctls (all defined in
++linux/fb.h):
++
++FBIOCONDECOR_SETPIC
++description: loads a background picture for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: struct fb_image*
++notes:
++If called for consoles other than the current foreground one, the picture data
++will be ignored.
++
++If the current virtual console is running in a 8-bpp mode, the cmap substruct
++of fb_image has to be filled appropriately: start should be set to 16 (first
++16 colors are reserved for fbcon), len to a value <= 240 and red, green and
++blue should point to valid cmap data. The transp field is ingored. The fields
++dx, dy, bg_color, fg_color in fb_image are ignored as well.
++
++FBIOCONDECOR_SETCFG
++description: sets the fbcondecor config for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: struct vc_decor*
++notes: The structure has to be filled with valid data.
++
++FBIOCONDECOR_GETCFG
++description: gets the fbcondecor config for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: struct vc_decor*
++
++FBIOCONDECOR_SETSTATE
++description: sets the fbcondecor state for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: unsigned int*
++ values: 0 = disabled, 1 = enabled.
++
++FBIOCONDECOR_GETSTATE
++description: gets the fbcondecor state for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: unsigned int*
++ values: as in FBIOCONDECOR_SETSTATE
++
++Info on used structures:
++
++Definition of struct vc_decor can be found in linux/console_decor.h. It's
++heavily commented. Note that the 'theme' field should point to a string
++no longer than FBCON_DECOR_THEME_LEN. When FBIOCONDECOR_GETCFG call is
++performed, the theme field should point to a char buffer of length
++FBCON_DECOR_THEME_LEN.
++
++Definition of struct fbcon_decor_iowrapper can be found in linux/fb.h.
++The fields in this struct have the following meaning:
++
++vc:
++Virtual console number.
++
++origin:
++Specifies if the ioctl is performed as a response to a kernel request. The
++fbcondecor helper should set this field to FBCON_DECOR_IO_ORIG_KERNEL, userspace
++programs should set it to FBCON_DECOR_IO_ORIG_USER. This field is necessary to
++avoid console semaphore deadlocks.
++
++data:
++Pointer to a data structure appropriate for the performed ioctl. Type of
++the data struct is specified in the ioctls description.
++
++*****************************************************************************
++
++Credit
++------
++
++Original 'bootsplash' project & implementation by:
++ Volker Poplawski <volker@poplawski.de>, Stefan Reinauer <stepan@suse.de>,
++ Steffen Winterfeldt <snwint@suse.de>, Michael Schroeder <mls@suse.de>,
++ Ken Wimer <wimer@suse.de>.
++
++Fbcondecor, fbcondecor protocol design, current implementation & docs by:
++ Michal Januszewski <michalj+fbcondecor@gmail.com>
++
+diff --git a/drivers/Makefile b/drivers/Makefile
+index 7183b6a..d576148 100644
+--- a/drivers/Makefile
++++ b/drivers/Makefile
+@@ -17,6 +17,10 @@ obj-y += pwm/
+ obj-$(CONFIG_PCI) += pci/
+ obj-$(CONFIG_PARISC) += parisc/
+ obj-$(CONFIG_RAPIDIO) += rapidio/
++# tty/ comes before char/ so that the VT console is the boot-time
++# default.
++obj-y += tty/
++obj-y += char/
+ obj-y += video/
+ obj-y += idle/
+
+@@ -42,11 +46,6 @@ obj-$(CONFIG_REGULATOR) += regulator/
+ # reset controllers early, since gpu drivers might rely on them to initialize
+ obj-$(CONFIG_RESET_CONTROLLER) += reset/
+
+-# tty/ comes before char/ so that the VT console is the boot-time
+-# default.
+-obj-y += tty/
+-obj-y += char/
+-
+ # gpu/ comes after char for AGP vs DRM startup
+ obj-y += gpu/
+
+diff --git a/drivers/video/console/Kconfig b/drivers/video/console/Kconfig
+index fe1cd01..6d2e87a 100644
+--- a/drivers/video/console/Kconfig
++++ b/drivers/video/console/Kconfig
+@@ -126,6 +126,19 @@ config FRAMEBUFFER_CONSOLE_ROTATION
+ such that other users of the framebuffer will remain normally
+ oriented.
+
++config FB_CON_DECOR
++ bool "Support for the Framebuffer Console Decorations"
++ depends on FRAMEBUFFER_CONSOLE=y && !FB_TILEBLITTING
++ default n
++ ---help---
++ This option enables support for framebuffer console decorations which
++ makes it possible to display images in the background of the system
++ consoles. Note that userspace utilities are necessary in order to take
++ advantage of these features. Refer to Documentation/fb/fbcondecor.txt
++ for more information.
++
++ If unsure, say N.
++
+ config STI_CONSOLE
+ bool "STI text console"
+ depends on PARISC
+diff --git a/drivers/video/console/Makefile b/drivers/video/console/Makefile
+index 43bfa48..cc104b6 100644
+--- a/drivers/video/console/Makefile
++++ b/drivers/video/console/Makefile
+@@ -16,4 +16,5 @@ obj-$(CONFIG_FRAMEBUFFER_CONSOLE) += fbcon_rotate.o fbcon_cw.o fbcon_ud.o \
+ fbcon_ccw.o
+ endif
+
++obj-$(CONFIG_FB_CON_DECOR) += fbcondecor.o cfbcondecor.o
+ obj-$(CONFIG_FB_STI) += sticore.o
+diff --git a/drivers/video/console/bitblit.c b/drivers/video/console/bitblit.c
+index 61b182b..984384b 100644
+--- a/drivers/video/console/bitblit.c
++++ b/drivers/video/console/bitblit.c
+@@ -18,6 +18,7 @@
+ #include <linux/console.h>
+ #include <asm/types.h>
+ #include "fbcon.h"
++#include "fbcondecor.h"
+
+ /*
+ * Accelerated handlers.
+@@ -55,6 +56,13 @@ static void bit_bmove(struct vc_data *vc, struct fb_info *info, int sy,
+ area.height = height * vc->vc_font.height;
+ area.width = width * vc->vc_font.width;
+
++ if (fbcon_decor_active(info, vc)) {
++ area.sx += vc->vc_decor.tx;
++ area.sy += vc->vc_decor.ty;
++ area.dx += vc->vc_decor.tx;
++ area.dy += vc->vc_decor.ty;
++ }
++
+ info->fbops->fb_copyarea(info, &area);
+ }
+
+@@ -380,11 +388,15 @@ static void bit_cursor(struct vc_data *vc, struct fb_info *info, int mode,
+ cursor.image.depth = 1;
+ cursor.rop = ROP_XOR;
+
+- if (info->fbops->fb_cursor)
+- err = info->fbops->fb_cursor(info, &cursor);
++ if (fbcon_decor_active(info, vc)) {
++ fbcon_decor_cursor(info, &cursor);
++ } else {
++ if (info->fbops->fb_cursor)
++ err = info->fbops->fb_cursor(info, &cursor);
+
+- if (err)
+- soft_cursor(info, &cursor);
++ if (err)
++ soft_cursor(info, &cursor);
++ }
+
+ ops->cursor_reset = 0;
+ }
+diff --git a/drivers/video/console/cfbcondecor.c b/drivers/video/console/cfbcondecor.c
+new file mode 100644
+index 0000000..a2b4497
+--- /dev/null
++++ b/drivers/video/console/cfbcondecor.c
+@@ -0,0 +1,471 @@
++/*
++ * linux/drivers/video/cfbcon_decor.c -- Framebuffer decor render functions
++ *
++ * Copyright (C) 2004 Michal Januszewski <michalj+fbcondecor@gmail.com>
++ *
++ * Code based upon "Bootdecor" (C) 2001-2003
++ * Volker Poplawski <volker@poplawski.de>,
++ * Stefan Reinauer <stepan@suse.de>,
++ * Steffen Winterfeldt <snwint@suse.de>,
++ * Michael Schroeder <mls@suse.de>,
++ * Ken Wimer <wimer@suse.de>.
++ *
++ * This file is subject to the terms and conditions of the GNU General Public
++ * License. See the file COPYING in the main directory of this archive for
++ * more details.
++ */
++#include <linux/module.h>
++#include <linux/types.h>
++#include <linux/fb.h>
++#include <linux/selection.h>
++#include <linux/slab.h>
++#include <linux/vt_kern.h>
++#include <asm/irq.h>
++
++#include "fbcon.h"
++#include "fbcondecor.h"
++
++#define parse_pixel(shift,bpp,type) \
++ do { \
++ if (d & (0x80 >> (shift))) \
++ dd2[(shift)] = fgx; \
++ else \
++ dd2[(shift)] = transparent ? *(type *)decor_src : bgx; \
++ decor_src += (bpp); \
++ } while (0) \
++
++extern int get_color(struct vc_data *vc, struct fb_info *info,
++ u16 c, int is_fg);
++
++void fbcon_decor_fix_pseudo_pal(struct fb_info *info, struct vc_data *vc)
++{
++ int i, j, k;
++ int minlen = min(min(info->var.red.length, info->var.green.length),
++ info->var.blue.length);
++ u32 col;
++
++ for (j = i = 0; i < 16; i++) {
++ k = color_table[i];
++
++ col = ((vc->vc_palette[j++] >> (8-minlen))
++ << info->var.red.offset);
++ col |= ((vc->vc_palette[j++] >> (8-minlen))
++ << info->var.green.offset);
++ col |= ((vc->vc_palette[j++] >> (8-minlen))
++ << info->var.blue.offset);
++ ((u32 *)info->pseudo_palette)[k] = col;
++ }
++}
++
++void fbcon_decor_renderc(struct fb_info *info, int ypos, int xpos, int height,
++ int width, u8* src, u32 fgx, u32 bgx, u8 transparent)
++{
++ unsigned int x, y;
++ u32 dd;
++ int bytespp = ((info->var.bits_per_pixel + 7) >> 3);
++ unsigned int d = ypos * info->fix.line_length + xpos * bytespp;
++ unsigned int ds = (ypos * info->var.xres + xpos) * bytespp;
++ u16 dd2[4];
++
++ u8* decor_src = (u8 *)(info->bgdecor.data + ds);
++ u8* dst = (u8 *)(info->screen_base + d);
++
++ if ((ypos + height) > info->var.yres || (xpos + width) > info->var.xres)
++ return;
++
++ for (y = 0; y < height; y++) {
++ switch (info->var.bits_per_pixel) {
++
++ case 32:
++ for (x = 0; x < width; x++) {
++
++ if ((x & 7) == 0)
++ d = *src++;
++ if (d & 0x80)
++ dd = fgx;
++ else
++ dd = transparent ?
++ *(u32 *)decor_src : bgx;
++
++ d <<= 1;
++ decor_src += 4;
++ fb_writel(dd, dst);
++ dst += 4;
++ }
++ break;
++ case 24:
++ for (x = 0; x < width; x++) {
++
++ if ((x & 7) == 0)
++ d = *src++;
++ if (d & 0x80)
++ dd = fgx;
++ else
++ dd = transparent ?
++ (*(u32 *)decor_src & 0xffffff) : bgx;
++
++ d <<= 1;
++ decor_src += 3;
++#ifdef __LITTLE_ENDIAN
++ fb_writew(dd & 0xffff, dst);
++ dst += 2;
++ fb_writeb((dd >> 16), dst);
++#else
++ fb_writew(dd >> 8, dst);
++ dst += 2;
++ fb_writeb(dd & 0xff, dst);
++#endif
++ dst++;
++ }
++ break;
++ case 16:
++ for (x = 0; x < width; x += 2) {
++ if ((x & 7) == 0)
++ d = *src++;
++
++ parse_pixel(0, 2, u16);
++ parse_pixel(1, 2, u16);
++#ifdef __LITTLE_ENDIAN
++ dd = dd2[0] | (dd2[1] << 16);
++#else
++ dd = dd2[1] | (dd2[0] << 16);
++#endif
++ d <<= 2;
++ fb_writel(dd, dst);
++ dst += 4;
++ }
++ break;
++
++ case 8:
++ for (x = 0; x < width; x += 4) {
++ if ((x & 7) == 0)
++ d = *src++;
++
++ parse_pixel(0, 1, u8);
++ parse_pixel(1, 1, u8);
++ parse_pixel(2, 1, u8);
++ parse_pixel(3, 1, u8);
++
++#ifdef __LITTLE_ENDIAN
++ dd = dd2[0] | (dd2[1] << 8) | (dd2[2] << 16) | (dd2[3] << 24);
++#else
++ dd = dd2[3] | (dd2[2] << 8) | (dd2[1] << 16) | (dd2[0] << 24);
++#endif
++ d <<= 4;
++ fb_writel(dd, dst);
++ dst += 4;
++ }
++ }
++
++ dst += info->fix.line_length - width * bytespp;
++ decor_src += (info->var.xres - width) * bytespp;
++ }
++}
++
++#define cc2cx(a) \
++ ((info->fix.visual == FB_VISUAL_TRUECOLOR || \
++ info->fix.visual == FB_VISUAL_DIRECTCOLOR) ? \
++ ((u32*)info->pseudo_palette)[a] : a)
++
++void fbcon_decor_putcs(struct vc_data *vc, struct fb_info *info,
++ const unsigned short *s, int count, int yy, int xx)
++{
++ unsigned short charmask = vc->vc_hi_font_mask ? 0x1ff : 0xff;
++ struct fbcon_ops *ops = info->fbcon_par;
++ int fg_color, bg_color, transparent;
++ u8 *src;
++ u32 bgx, fgx;
++ u16 c = scr_readw(s);
++
++ fg_color = get_color(vc, info, c, 1);
++ bg_color = get_color(vc, info, c, 0);
++
++ /* Don't paint the background image if console is blanked */
++ transparent = ops->blank_state ? 0 :
++ (vc->vc_decor.bg_color == bg_color);
++
++ xx = xx * vc->vc_font.width + vc->vc_decor.tx;
++ yy = yy * vc->vc_font.height + vc->vc_decor.ty;
++
++ fgx = cc2cx(fg_color);
++ bgx = cc2cx(bg_color);
++
++ while (count--) {
++ c = scr_readw(s++);
++ src = vc->vc_font.data + (c & charmask) * vc->vc_font.height *
++ ((vc->vc_font.width + 7) >> 3);
++
++ fbcon_decor_renderc(info, yy, xx, vc->vc_font.height,
++ vc->vc_font.width, src, fgx, bgx, transparent);
++ xx += vc->vc_font.width;
++ }
++}
++
++void fbcon_decor_cursor(struct fb_info *info, struct fb_cursor *cursor)
++{
++ int i;
++ unsigned int dsize, s_pitch;
++ struct fbcon_ops *ops = info->fbcon_par;
++ struct vc_data* vc;
++ u8 *src;
++
++ /* we really don't need any cursors while the console is blanked */
++ if (info->state != FBINFO_STATE_RUNNING || ops->blank_state)
++ return;
++
++ vc = vc_cons[ops->currcon].d;
++
++ src = kmalloc(64 + sizeof(struct fb_image), GFP_ATOMIC);
++ if (!src)
++ return;
++
++ s_pitch = (cursor->image.width + 7) >> 3;
++ dsize = s_pitch * cursor->image.height;
++ if (cursor->enable) {
++ switch (cursor->rop) {
++ case ROP_XOR:
++ for (i = 0; i < dsize; i++)
++ src[i] = cursor->image.data[i] ^ cursor->mask[i];
++ break;
++ case ROP_COPY:
++ default:
++ for (i = 0; i < dsize; i++)
++ src[i] = cursor->image.data[i] & cursor->mask[i];
++ break;
++ }
++ } else
++ memcpy(src, cursor->image.data, dsize);
++
++ fbcon_decor_renderc(info,
++ cursor->image.dy + vc->vc_decor.ty,
++ cursor->image.dx + vc->vc_decor.tx,
++ cursor->image.height,
++ cursor->image.width,
++ (u8*)src,
++ cc2cx(cursor->image.fg_color),
++ cc2cx(cursor->image.bg_color),
++ cursor->image.bg_color == vc->vc_decor.bg_color);
++
++ kfree(src);
++}
++
++static void decorset(u8 *dst, int height, int width, int dstbytes,
++ u32 bgx, int bpp)
++{
++ int i;
++
++ if (bpp == 8)
++ bgx |= bgx << 8;
++ if (bpp == 16 || bpp == 8)
++ bgx |= bgx << 16;
++
++ while (height-- > 0) {
++ u8 *p = dst;
++
++ switch (bpp) {
++
++ case 32:
++ for (i=0; i < width; i++) {
++ fb_writel(bgx, p); p += 4;
++ }
++ break;
++ case 24:
++ for (i=0; i < width; i++) {
++#ifdef __LITTLE_ENDIAN
++ fb_writew((bgx & 0xffff),(u16*)p); p += 2;
++ fb_writeb((bgx >> 16),p++);
++#else
++ fb_writew((bgx >> 8),(u16*)p); p += 2;
++ fb_writeb((bgx & 0xff),p++);
++#endif
++ }
++ case 16:
++ for (i=0; i < width/4; i++) {
++ fb_writel(bgx,p); p += 4;
++ fb_writel(bgx,p); p += 4;
++ }
++ if (width & 2) {
++ fb_writel(bgx,p); p += 4;
++ }
++ if (width & 1)
++ fb_writew(bgx,(u16*)p);
++ break;
++ case 8:
++ for (i=0; i < width/4; i++) {
++ fb_writel(bgx,p); p += 4;
++ }
++
++ if (width & 2) {
++ fb_writew(bgx,p); p += 2;
++ }
++ if (width & 1)
++ fb_writeb(bgx,(u8*)p);
++ break;
++
++ }
++ dst += dstbytes;
++ }
++}
++
++void fbcon_decor_copy(u8 *dst, u8 *src, int height, int width, int linebytes,
++ int srclinebytes, int bpp)
++{
++ int i;
++
++ while (height-- > 0) {
++ u32 *p = (u32 *)dst;
++ u32 *q = (u32 *)src;
++
++ switch (bpp) {
++
++ case 32:
++ for (i=0; i < width; i++)
++ fb_writel(*q++, p++);
++ break;
++ case 24:
++ for (i=0; i < (width*3/4); i++)
++ fb_writel(*q++, p++);
++ if ((width*3) % 4) {
++ if (width & 2) {
++ fb_writeb(*(u8*)q, (u8*)p);
++ } else if (width & 1) {
++ fb_writew(*(u16*)q, (u16*)p);
++ fb_writeb(*(u8*)((u16*)q+1),(u8*)((u16*)p+2));
++ }
++ }
++ break;
++ case 16:
++ for (i=0; i < width/4; i++) {
++ fb_writel(*q++, p++);
++ fb_writel(*q++, p++);
++ }
++ if (width & 2)
++ fb_writel(*q++, p++);
++ if (width & 1)
++ fb_writew(*(u16*)q, (u16*)p);
++ break;
++ case 8:
++ for (i=0; i < width/4; i++)
++ fb_writel(*q++, p++);
++
++ if (width & 2) {
++ fb_writew(*(u16*)q, (u16*)p);
++ q = (u32*) ((u16*)q + 1);
++ p = (u32*) ((u16*)p + 1);
++ }
++ if (width & 1)
++ fb_writeb(*(u8*)q, (u8*)p);
++ break;
++ }
++
++ dst += linebytes;
++ src += srclinebytes;
++ }
++}
++
++static void decorfill(struct fb_info *info, int sy, int sx, int height,
++ int width)
++{
++ int bytespp = ((info->var.bits_per_pixel + 7) >> 3);
++ int d = sy * info->fix.line_length + sx * bytespp;
++ int ds = (sy * info->var.xres + sx) * bytespp;
++
++ fbcon_decor_copy((u8 *)(info->screen_base + d), (u8 *)(info->bgdecor.data + ds),
++ height, width, info->fix.line_length, info->var.xres * bytespp,
++ info->var.bits_per_pixel);
++}
++
++void fbcon_decor_clear(struct vc_data *vc, struct fb_info *info, int sy, int sx,
++ int height, int width)
++{
++ int bgshift = (vc->vc_hi_font_mask) ? 13 : 12;
++ struct fbcon_ops *ops = info->fbcon_par;
++ u8 *dst;
++ int transparent, bg_color = attr_bgcol_ec(bgshift, vc, info);
++
++ transparent = (vc->vc_decor.bg_color == bg_color);
++ sy = sy * vc->vc_font.height + vc->vc_decor.ty;
++ sx = sx * vc->vc_font.width + vc->vc_decor.tx;
++ height *= vc->vc_font.height;
++ width *= vc->vc_font.width;
++
++ /* Don't paint the background image if console is blanked */
++ if (transparent && !ops->blank_state) {
++ decorfill(info, sy, sx, height, width);
++ } else {
++ dst = (u8 *)(info->screen_base + sy * info->fix.line_length +
++ sx * ((info->var.bits_per_pixel + 7) >> 3));
++ decorset(dst, height, width, info->fix.line_length, cc2cx(bg_color),
++ info->var.bits_per_pixel);
++ }
++}
++
++void fbcon_decor_clear_margins(struct vc_data *vc, struct fb_info *info,
++ int bottom_only)
++{
++ unsigned int tw = vc->vc_cols*vc->vc_font.width;
++ unsigned int th = vc->vc_rows*vc->vc_font.height;
++
++ if (!bottom_only) {
++ /* top margin */
++ decorfill(info, 0, 0, vc->vc_decor.ty, info->var.xres);
++ /* left margin */
++ decorfill(info, vc->vc_decor.ty, 0, th, vc->vc_decor.tx);
++ /* right margin */
++ decorfill(info, vc->vc_decor.ty, vc->vc_decor.tx + tw, th,
++ info->var.xres - vc->vc_decor.tx - tw);
++ }
++ decorfill(info, vc->vc_decor.ty + th, 0,
++ info->var.yres - vc->vc_decor.ty - th, info->var.xres);
++}
++
++void fbcon_decor_bmove_redraw(struct vc_data *vc, struct fb_info *info, int y,
++ int sx, int dx, int width)
++{
++ u16 *d = (u16 *) (vc->vc_origin + vc->vc_size_row * y + dx * 2);
++ u16 *s = d + (dx - sx);
++ u16 *start = d;
++ u16 *ls = d;
++ u16 *le = d + width;
++ u16 c;
++ int x = dx;
++ u16 attr = 1;
++
++ do {
++ c = scr_readw(d);
++ if (attr != (c & 0xff00)) {
++ attr = c & 0xff00;
++ if (d > start) {
++ fbcon_decor_putcs(vc, info, start, d - start, y, x);
++ x += d - start;
++ start = d;
++ }
++ }
++ if (s >= ls && s < le && c == scr_readw(s)) {
++ if (d > start) {
++ fbcon_decor_putcs(vc, info, start, d - start, y, x);
++ x += d - start + 1;
++ start = d + 1;
++ } else {
++ x++;
++ start++;
++ }
++ }
++ s++;
++ d++;
++ } while (d < le);
++ if (d > start)
++ fbcon_decor_putcs(vc, info, start, d - start, y, x);
++}
++
++void fbcon_decor_blank(struct vc_data *vc, struct fb_info *info, int blank)
++{
++ if (blank) {
++ decorset((u8 *)info->screen_base, info->var.yres, info->var.xres,
++ info->fix.line_length, 0, info->var.bits_per_pixel);
++ } else {
++ update_screen(vc);
++ fbcon_decor_clear_margins(vc, info, 0);
++ }
++}
++
+diff --git a/drivers/video/console/fbcon.c b/drivers/video/console/fbcon.c
+index f447734..1a840c2 100644
+--- a/drivers/video/console/fbcon.c
++++ b/drivers/video/console/fbcon.c
+@@ -79,6 +79,7 @@
+ #include <asm/irq.h>
+
+ #include "fbcon.h"
++#include "fbcondecor.h"
+
+ #ifdef FBCONDEBUG
+ # define DPRINTK(fmt, args...) printk(KERN_DEBUG "%s: " fmt, __func__ , ## args)
+@@ -94,7 +95,7 @@ enum {
+
+ static struct display fb_display[MAX_NR_CONSOLES];
+
+-static signed char con2fb_map[MAX_NR_CONSOLES];
++signed char con2fb_map[MAX_NR_CONSOLES];
+ static signed char con2fb_map_boot[MAX_NR_CONSOLES];
+
+ static int logo_lines;
+@@ -286,7 +287,7 @@ static inline int fbcon_is_inactive(struct vc_data *vc, struct fb_info *info)
+ !vt_force_oops_output(vc);
+ }
+
+-static int get_color(struct vc_data *vc, struct fb_info *info,
++int get_color(struct vc_data *vc, struct fb_info *info,
+ u16 c, int is_fg)
+ {
+ int depth = fb_get_color_depth(&info->var, &info->fix);
+@@ -551,6 +552,9 @@ static int do_fbcon_takeover(int show_logo)
+ info_idx = -1;
+ } else {
+ fbcon_has_console_bind = 1;
++#ifdef CONFIG_FB_CON_DECOR
++ fbcon_decor_init();
++#endif
+ }
+
+ return err;
+@@ -1007,6 +1011,12 @@ static const char *fbcon_startup(void)
+ rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres);
+ cols /= vc->vc_font.width;
+ rows /= vc->vc_font.height;
++
++ if (fbcon_decor_active(info, vc)) {
++ cols = vc->vc_decor.twidth / vc->vc_font.width;
++ rows = vc->vc_decor.theight / vc->vc_font.height;
++ }
++
+ vc_resize(vc, cols, rows);
+
+ DPRINTK("mode: %s\n", info->fix.id);
+@@ -1036,7 +1046,7 @@ static void fbcon_init(struct vc_data *vc, int init)
+ cap = info->flags;
+
+ if (vc != svc || logo_shown == FBCON_LOGO_DONTSHOW ||
+- (info->fix.type == FB_TYPE_TEXT))
++ (info->fix.type == FB_TYPE_TEXT) || fbcon_decor_active(info, vc))
+ logo = 0;
+
+ if (var_to_display(p, &info->var, info))
+@@ -1260,6 +1270,11 @@ static void fbcon_clear(struct vc_data *vc, int sy, int sx, int height,
+ fbcon_clear_margins(vc, 0);
+ }
+
++ if (fbcon_decor_active(info, vc)) {
++ fbcon_decor_clear(vc, info, sy, sx, height, width);
++ return;
++ }
++
+ /* Split blits that cross physical y_wrap boundary */
+
+ y_break = p->vrows - p->yscroll;
+@@ -1279,10 +1294,15 @@ static void fbcon_putcs(struct vc_data *vc, const unsigned short *s,
+ struct display *p = &fb_display[vc->vc_num];
+ struct fbcon_ops *ops = info->fbcon_par;
+
+- if (!fbcon_is_inactive(vc, info))
+- ops->putcs(vc, info, s, count, real_y(p, ypos), xpos,
+- get_color(vc, info, scr_readw(s), 1),
+- get_color(vc, info, scr_readw(s), 0));
++ if (!fbcon_is_inactive(vc, info)) {
++
++ if (fbcon_decor_active(info, vc))
++ fbcon_decor_putcs(vc, info, s, count, ypos, xpos);
++ else
++ ops->putcs(vc, info, s, count, real_y(p, ypos), xpos,
++ get_color(vc, info, scr_readw(s), 1),
++ get_color(vc, info, scr_readw(s), 0));
++ }
+ }
+
+ static void fbcon_putc(struct vc_data *vc, int c, int ypos, int xpos)
+@@ -1298,8 +1318,13 @@ static void fbcon_clear_margins(struct vc_data *vc, int bottom_only)
+ struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
+ struct fbcon_ops *ops = info->fbcon_par;
+
+- if (!fbcon_is_inactive(vc, info))
+- ops->clear_margins(vc, info, bottom_only);
++ if (!fbcon_is_inactive(vc, info)) {
++ if (fbcon_decor_active(info, vc)) {
++ fbcon_decor_clear_margins(vc, info, bottom_only);
++ } else {
++ ops->clear_margins(vc, info, bottom_only);
++ }
++ }
+ }
+
+ static void fbcon_cursor(struct vc_data *vc, int mode)
+@@ -1819,7 +1844,7 @@ static int fbcon_scroll(struct vc_data *vc, int t, int b, int dir,
+ count = vc->vc_rows;
+ if (softback_top)
+ fbcon_softback_note(vc, t, count);
+- if (logo_shown >= 0)
++ if (logo_shown >= 0 || fbcon_decor_active(info, vc))
+ goto redraw_up;
+ switch (p->scrollmode) {
+ case SCROLL_MOVE:
+@@ -1912,6 +1937,8 @@ static int fbcon_scroll(struct vc_data *vc, int t, int b, int dir,
+ count = vc->vc_rows;
+ if (logo_shown >= 0)
+ goto redraw_down;
++ if (fbcon_decor_active(info, vc))
++ goto redraw_down;
+ switch (p->scrollmode) {
+ case SCROLL_MOVE:
+ fbcon_redraw_blit(vc, info, p, b - 1, b - t - count,
+@@ -2060,6 +2087,13 @@ static void fbcon_bmove_rec(struct vc_data *vc, struct display *p, int sy, int s
+ }
+ return;
+ }
++
++ if (fbcon_decor_active(info, vc) && sy == dy && height == 1) {
++ /* must use slower redraw bmove to keep background pic intact */
++ fbcon_decor_bmove_redraw(vc, info, sy, sx, dx, width);
++ return;
++ }
++
+ ops->bmove(vc, info, real_y(p, sy), sx, real_y(p, dy), dx,
+ height, width);
+ }
+@@ -2130,8 +2164,8 @@ static int fbcon_resize(struct vc_data *vc, unsigned int width,
+ var.yres = virt_h * virt_fh;
+ x_diff = info->var.xres - var.xres;
+ y_diff = info->var.yres - var.yres;
+- if (x_diff < 0 || x_diff > virt_fw ||
+- y_diff < 0 || y_diff > virt_fh) {
++ if ((x_diff < 0 || x_diff > virt_fw ||
++ y_diff < 0 || y_diff > virt_fh) && !vc->vc_decor.state) {
+ const struct fb_videomode *mode;
+
+ DPRINTK("attempting resize %ix%i\n", var.xres, var.yres);
+@@ -2167,6 +2201,21 @@ static int fbcon_switch(struct vc_data *vc)
+
+ info = registered_fb[con2fb_map[vc->vc_num]];
+ ops = info->fbcon_par;
++ prev_console = ops->currcon;
++ if (prev_console != -1)
++ old_info = registered_fb[con2fb_map[prev_console]];
++
++#ifdef CONFIG_FB_CON_DECOR
++ if (!fbcon_decor_active_vc(vc) && info->fix.visual == FB_VISUAL_DIRECTCOLOR) {
++ struct vc_data *vc_curr = vc_cons[prev_console].d;
++ if (vc_curr && fbcon_decor_active_vc(vc_curr)) {
++ /* Clear the screen to avoid displaying funky colors during
++ * palette updates. */
++ memset((u8*)info->screen_base + info->fix.line_length * info->var.yoffset,
++ 0, info->var.yres * info->fix.line_length);
++ }
++ }
++#endif
+
+ if (softback_top) {
+ if (softback_lines)
+@@ -2185,9 +2234,6 @@ static int fbcon_switch(struct vc_data *vc)
+ logo_shown = FBCON_LOGO_CANSHOW;
+ }
+
+- prev_console = ops->currcon;
+- if (prev_console != -1)
+- old_info = registered_fb[con2fb_map[prev_console]];
+ /*
+ * FIXME: If we have multiple fbdev's loaded, we need to
+ * update all info->currcon. Perhaps, we can place this
+@@ -2231,6 +2277,18 @@ static int fbcon_switch(struct vc_data *vc)
+ fbcon_del_cursor_timer(old_info);
+ }
+
++ if (fbcon_decor_active_vc(vc)) {
++ struct vc_data *vc_curr = vc_cons[prev_console].d;
++
++ if (!vc_curr->vc_decor.theme ||
++ strcmp(vc->vc_decor.theme, vc_curr->vc_decor.theme) ||
++ (fbcon_decor_active_nores(info, vc_curr) &&
++ !fbcon_decor_active(info, vc_curr))) {
++ fbcon_decor_disable(vc, 0);
++ fbcon_decor_call_helper("modechange", vc->vc_num);
++ }
++ }
++
+ if (fbcon_is_inactive(vc, info) ||
+ ops->blank_state != FB_BLANK_UNBLANK)
+ fbcon_del_cursor_timer(info);
+@@ -2339,15 +2397,20 @@ static int fbcon_blank(struct vc_data *vc, int blank, int mode_switch)
+ }
+ }
+
+- if (!fbcon_is_inactive(vc, info)) {
++ if (!fbcon_is_inactive(vc, info)) {
+ if (ops->blank_state != blank) {
+ ops->blank_state = blank;
+ fbcon_cursor(vc, blank ? CM_ERASE : CM_DRAW);
+ ops->cursor_flash = (!blank);
+
+- if (!(info->flags & FBINFO_MISC_USEREVENT))
+- if (fb_blank(info, blank))
+- fbcon_generic_blank(vc, info, blank);
++ if (!(info->flags & FBINFO_MISC_USEREVENT)) {
++ if (fb_blank(info, blank)) {
++ if (fbcon_decor_active(info, vc))
++ fbcon_decor_blank(vc, info, blank);
++ else
++ fbcon_generic_blank(vc, info, blank);
++ }
++ }
+ }
+
+ if (!blank)
+@@ -2522,13 +2585,22 @@ static int fbcon_do_set_font(struct vc_data *vc, int w, int h,
+ }
+
+ if (resize) {
++ /* reset wrap/pan */
+ int cols, rows;
+
+ cols = FBCON_SWAP(ops->rotate, info->var.xres, info->var.yres);
+ rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres);
++
++ if (fbcon_decor_active(info, vc)) {
++ info->var.xoffset = info->var.yoffset = p->yscroll = 0;
++ cols = vc->vc_decor.twidth;
++ rows = vc->vc_decor.theight;
++ }
+ cols /= w;
+ rows /= h;
++
+ vc_resize(vc, cols, rows);
++
+ if (CON_IS_VISIBLE(vc) && softback_buf)
+ fbcon_update_softback(vc);
+ } else if (CON_IS_VISIBLE(vc)
+@@ -2657,7 +2729,11 @@ static int fbcon_set_palette(struct vc_data *vc, unsigned char *table)
+ int i, j, k, depth;
+ u8 val;
+
+- if (fbcon_is_inactive(vc, info))
++ if (fbcon_is_inactive(vc, info)
++#ifdef CONFIG_FB_CON_DECOR
++ || vc->vc_num != fg_console
++#endif
++ )
+ return -EINVAL;
+
+ if (!CON_IS_VISIBLE(vc))
+@@ -2683,14 +2759,56 @@ static int fbcon_set_palette(struct vc_data *vc, unsigned char *table)
+ } else
+ fb_copy_cmap(fb_default_cmap(1 << depth), &palette_cmap);
+
+- return fb_set_cmap(&palette_cmap, info);
++ if (fbcon_decor_active(info, vc_cons[fg_console].d) &&
++ info->fix.visual == FB_VISUAL_DIRECTCOLOR) {
++
++ u16 *red, *green, *blue;
++ int minlen = min(min(info->var.red.length, info->var.green.length),
++ info->var.blue.length);
++ int h;
++
++ struct fb_cmap cmap = {
++ .start = 0,
++ .len = (1 << minlen),
++ .red = NULL,
++ .green = NULL,
++ .blue = NULL,
++ .transp = NULL
++ };
++
++ red = kmalloc(256 * sizeof(u16) * 3, GFP_KERNEL);
++
++ if (!red)
++ goto out;
++
++ green = red + 256;
++ blue = green + 256;
++ cmap.red = red;
++ cmap.green = green;
++ cmap.blue = blue;
++
++ for (i = 0; i < cmap.len; i++) {
++ red[i] = green[i] = blue[i] = (0xffff * i)/(cmap.len-1);
++ }
++
++ h = fb_set_cmap(&cmap, info);
++ fbcon_decor_fix_pseudo_pal(info, vc_cons[fg_console].d);
++ kfree(red);
++
++ return h;
++
++ } else if (fbcon_decor_active(info, vc_cons[fg_console].d) &&
++ info->var.bits_per_pixel == 8 && info->bgdecor.cmap.red != NULL)
++ fb_set_cmap(&info->bgdecor.cmap, info);
++
++out: return fb_set_cmap(&palette_cmap, info);
+ }
+
+ static u16 *fbcon_screen_pos(struct vc_data *vc, int offset)
+ {
+ unsigned long p;
+ int line;
+-
++
+ if (vc->vc_num != fg_console || !softback_lines)
+ return (u16 *) (vc->vc_origin + offset);
+ line = offset / vc->vc_size_row;
+@@ -2909,7 +3027,14 @@ static void fbcon_modechanged(struct fb_info *info)
+ rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres);
+ cols /= vc->vc_font.width;
+ rows /= vc->vc_font.height;
+- vc_resize(vc, cols, rows);
++
++ if (!fbcon_decor_active_nores(info, vc)) {
++ vc_resize(vc, cols, rows);
++ } else {
++ fbcon_decor_disable(vc, 0);
++ fbcon_decor_call_helper("modechange", vc->vc_num);
++ }
++
+ updatescrollmode(p, info, vc);
+ scrollback_max = 0;
+ scrollback_current = 0;
+@@ -2954,7 +3079,9 @@ static void fbcon_set_all_vcs(struct fb_info *info)
+ rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres);
+ cols /= vc->vc_font.width;
+ rows /= vc->vc_font.height;
+- vc_resize(vc, cols, rows);
++ if (!fbcon_decor_active_nores(info, vc)) {
++ vc_resize(vc, cols, rows);
++ }
+ }
+
+ if (fg != -1)
+@@ -3596,6 +3723,7 @@ static void fbcon_exit(void)
+ }
+ }
+
++ fbcon_decor_exit();
+ fbcon_has_exited = 1;
+ }
+
+diff --git a/drivers/video/console/fbcondecor.c b/drivers/video/console/fbcondecor.c
+new file mode 100644
+index 0000000..babc8c5
+--- /dev/null
++++ b/drivers/video/console/fbcondecor.c
+@@ -0,0 +1,555 @@
++/*
++ * linux/drivers/video/console/fbcondecor.c -- Framebuffer console decorations
++ *
++ * Copyright (C) 2004-2009 Michal Januszewski <michalj+fbcondecor@gmail.com>
++ *
++ * Code based upon "Bootsplash" (C) 2001-2003
++ * Volker Poplawski <volker@poplawski.de>,
++ * Stefan Reinauer <stepan@suse.de>,
++ * Steffen Winterfeldt <snwint@suse.de>,
++ * Michael Schroeder <mls@suse.de>,
++ * Ken Wimer <wimer@suse.de>.
++ *
++ * Compat ioctl support by Thorsten Klein <TK@Thorsten-Klein.de>.
++ *
++ * This file is subject to the terms and conditions of the GNU General Public
++ * License. See the file COPYING in the main directory of this archive for
++ * more details.
++ *
++ */
++#include <linux/module.h>
++#include <linux/kernel.h>
++#include <linux/string.h>
++#include <linux/types.h>
++#include <linux/fb.h>
++#include <linux/vt_kern.h>
++#include <linux/vmalloc.h>
++#include <linux/unistd.h>
++#include <linux/syscalls.h>
++#include <linux/init.h>
++#include <linux/proc_fs.h>
++#include <linux/workqueue.h>
++#include <linux/kmod.h>
++#include <linux/miscdevice.h>
++#include <linux/device.h>
++#include <linux/fs.h>
++#include <linux/compat.h>
++#include <linux/console.h>
++
++#include <asm/uaccess.h>
++#include <asm/irq.h>
++
++#include "fbcon.h"
++#include "fbcondecor.h"
++
++extern signed char con2fb_map[];
++static int fbcon_decor_enable(struct vc_data *vc);
++char fbcon_decor_path[KMOD_PATH_LEN] = "/sbin/fbcondecor_helper";
++static int initialized = 0;
++
++int fbcon_decor_call_helper(char* cmd, unsigned short vc)
++{
++ char *envp[] = {
++ "HOME=/",
++ "PATH=/sbin:/bin",
++ NULL
++ };
++
++ char tfb[5];
++ char tcons[5];
++ unsigned char fb = (int) con2fb_map[vc];
++
++ char *argv[] = {
++ fbcon_decor_path,
++ "2",
++ cmd,
++ tcons,
++ tfb,
++ vc_cons[vc].d->vc_decor.theme,
++ NULL
++ };
++
++ snprintf(tfb,5,"%d",fb);
++ snprintf(tcons,5,"%d",vc);
++
++ return call_usermodehelper(fbcon_decor_path, argv, envp, UMH_WAIT_EXEC);
++}
++
++/* Disables fbcondecor on a virtual console; called with console sem held. */
++int fbcon_decor_disable(struct vc_data *vc, unsigned char redraw)
++{
++ struct fb_info* info;
++
++ if (!vc->vc_decor.state)
++ return -EINVAL;
++
++ info = registered_fb[(int) con2fb_map[vc->vc_num]];
++
++ if (info == NULL)
++ return -EINVAL;
++
++ vc->vc_decor.state = 0;
++ vc_resize(vc, info->var.xres / vc->vc_font.width,
++ info->var.yres / vc->vc_font.height);
++
++ if (fg_console == vc->vc_num && redraw) {
++ redraw_screen(vc, 0);
++ update_region(vc, vc->vc_origin +
++ vc->vc_size_row * vc->vc_top,
++ vc->vc_size_row * (vc->vc_bottom - vc->vc_top) / 2);
++ }
++
++ printk(KERN_INFO "fbcondecor: switched decor state to 'off' on console %d\n",
++ vc->vc_num);
++
++ return 0;
++}
++
++/* Enables fbcondecor on a virtual console; called with console sem held. */
++static int fbcon_decor_enable(struct vc_data *vc)
++{
++ struct fb_info* info;
++
++ info = registered_fb[(int) con2fb_map[vc->vc_num]];
++
++ if (vc->vc_decor.twidth == 0 || vc->vc_decor.theight == 0 ||
++ info == NULL || vc->vc_decor.state || (!info->bgdecor.data &&
++ vc->vc_num == fg_console))
++ return -EINVAL;
++
++ vc->vc_decor.state = 1;
++ vc_resize(vc, vc->vc_decor.twidth / vc->vc_font.width,
++ vc->vc_decor.theight / vc->vc_font.height);
++
++ if (fg_console == vc->vc_num) {
++ redraw_screen(vc, 0);
++ update_region(vc, vc->vc_origin +
++ vc->vc_size_row * vc->vc_top,
++ vc->vc_size_row * (vc->vc_bottom - vc->vc_top) / 2);
++ fbcon_decor_clear_margins(vc, info, 0);
++ }
++
++ printk(KERN_INFO "fbcondecor: switched decor state to 'on' on console %d\n",
++ vc->vc_num);
++
++ return 0;
++}
++
++static inline int fbcon_decor_ioctl_dosetstate(struct vc_data *vc, unsigned int state, unsigned char origin)
++{
++ int ret;
++
++// if (origin == FBCON_DECOR_IO_ORIG_USER)
++ console_lock();
++ if (!state)
++ ret = fbcon_decor_disable(vc, 1);
++ else
++ ret = fbcon_decor_enable(vc);
++// if (origin == FBCON_DECOR_IO_ORIG_USER)
++ console_unlock();
++
++ return ret;
++}
++
++static inline void fbcon_decor_ioctl_dogetstate(struct vc_data *vc, unsigned int *state)
++{
++ *state = vc->vc_decor.state;
++}
++
++static int fbcon_decor_ioctl_dosetcfg(struct vc_data *vc, struct vc_decor *cfg, unsigned char origin)
++{
++ struct fb_info *info;
++ int len;
++ char *tmp;
++
++ info = registered_fb[(int) con2fb_map[vc->vc_num]];
++
++ if (info == NULL || !cfg->twidth || !cfg->theight ||
++ cfg->tx + cfg->twidth > info->var.xres ||
++ cfg->ty + cfg->theight > info->var.yres)
++ return -EINVAL;
++
++ len = strlen_user(cfg->theme);
++ if (!len || len > FBCON_DECOR_THEME_LEN)
++ return -EINVAL;
++ tmp = kmalloc(len, GFP_KERNEL);
++ if (!tmp)
++ return -ENOMEM;
++ if (copy_from_user(tmp, (void __user *)cfg->theme, len))
++ return -EFAULT;
++ cfg->theme = tmp;
++ cfg->state = 0;
++
++ /* If this ioctl is a response to a request from kernel, the console sem
++ * is already held; we also don't need to disable decor because either the
++ * new config and background picture will be successfully loaded, and the
++ * decor will stay on, or in case of a failure it'll be turned off in fbcon. */
++// if (origin == FBCON_DECOR_IO_ORIG_USER) {
++ console_lock();
++ if (vc->vc_decor.state)
++ fbcon_decor_disable(vc, 1);
++// }
++
++ if (vc->vc_decor.theme)
++ kfree(vc->vc_decor.theme);
++
++ vc->vc_decor = *cfg;
++
++// if (origin == FBCON_DECOR_IO_ORIG_USER)
++ console_unlock();
++
++ printk(KERN_INFO "fbcondecor: console %d using theme '%s'\n",
++ vc->vc_num, vc->vc_decor.theme);
++ return 0;
++}
++
++static int fbcon_decor_ioctl_dogetcfg(struct vc_data *vc, struct vc_decor *decor)
++{
++ char __user *tmp;
++
++ tmp = decor->theme;
++ *decor = vc->vc_decor;
++ decor->theme = tmp;
++
++ if (vc->vc_decor.theme) {
++ if (copy_to_user(tmp, vc->vc_decor.theme, strlen(vc->vc_decor.theme) + 1))
++ return -EFAULT;
++ } else
++ if (put_user(0, tmp))
++ return -EFAULT;
++
++ return 0;
++}
++
++static int fbcon_decor_ioctl_dosetpic(struct vc_data *vc, struct fb_image *img, unsigned char origin)
++{
++ struct fb_info *info;
++ int len;
++ u8 *tmp;
++
++ if (vc->vc_num != fg_console)
++ return -EINVAL;
++
++ info = registered_fb[(int) con2fb_map[vc->vc_num]];
++
++ if (info == NULL)
++ return -EINVAL;
++
++ if (img->width != info->var.xres || img->height != info->var.yres) {
++ printk(KERN_ERR "fbcondecor: picture dimensions mismatch\n");
++ printk(KERN_ERR "%dx%d vs %dx%d\n", img->width, img->height, info->var.xres, info->var.yres);
++ return -EINVAL;
++ }
++
++ if (img->depth != info->var.bits_per_pixel) {
++ printk(KERN_ERR "fbcondecor: picture depth mismatch\n");
++ return -EINVAL;
++ }
++
++ if (img->depth == 8) {
++ if (!img->cmap.len || !img->cmap.red || !img->cmap.green ||
++ !img->cmap.blue)
++ return -EINVAL;
++
++ tmp = vmalloc(img->cmap.len * 3 * 2);
++ if (!tmp)
++ return -ENOMEM;
++
++ if (copy_from_user(tmp,
++ (void __user*)img->cmap.red, (img->cmap.len << 1)) ||
++ copy_from_user(tmp + (img->cmap.len << 1),
++ (void __user*)img->cmap.green, (img->cmap.len << 1)) ||
++ copy_from_user(tmp + (img->cmap.len << 2),
++ (void __user*)img->cmap.blue, (img->cmap.len << 1))) {
++ vfree(tmp);
++ return -EFAULT;
++ }
++
++ img->cmap.transp = NULL;
++ img->cmap.red = (u16*)tmp;
++ img->cmap.green = img->cmap.red + img->cmap.len;
++ img->cmap.blue = img->cmap.green + img->cmap.len;
++ } else {
++ img->cmap.red = NULL;
++ }
++
++ len = ((img->depth + 7) >> 3) * img->width * img->height;
++
++ /*
++ * Allocate an additional byte so that we never go outside of the
++ * buffer boundaries in the rendering functions in a 24 bpp mode.
++ */
++ tmp = vmalloc(len + 1);
++
++ if (!tmp)
++ goto out;
++
++ if (copy_from_user(tmp, (void __user*)img->data, len))
++ goto out;
++
++ img->data = tmp;
++
++ /* If this ioctl is a response to a request from kernel, the console sem
++ * is already held. */
++// if (origin == FBCON_DECOR_IO_ORIG_USER)
++ console_lock();
++
++ if (info->bgdecor.data)
++ vfree((u8*)info->bgdecor.data);
++ if (info->bgdecor.cmap.red)
++ vfree(info->bgdecor.cmap.red);
++
++ info->bgdecor = *img;
++
++ if (fbcon_decor_active_vc(vc) && fg_console == vc->vc_num) {
++ redraw_screen(vc, 0);
++ update_region(vc, vc->vc_origin +
++ vc->vc_size_row * vc->vc_top,
++ vc->vc_size_row * (vc->vc_bottom - vc->vc_top) / 2);
++ fbcon_decor_clear_margins(vc, info, 0);
++ }
++
++// if (origin == FBCON_DECOR_IO_ORIG_USER)
++ console_unlock();
++
++ return 0;
++
++out: if (img->cmap.red)
++ vfree(img->cmap.red);
++
++ if (tmp)
++ vfree(tmp);
++ return -ENOMEM;
++}
++
++static long fbcon_decor_ioctl(struct file *filp, u_int cmd, u_long arg)
++{
++ struct fbcon_decor_iowrapper __user *wrapper = (void __user*) arg;
++ struct vc_data *vc = NULL;
++ unsigned short vc_num = 0;
++ unsigned char origin = 0;
++ void __user *data = NULL;
++
++ if (!access_ok(VERIFY_READ, wrapper,
++ sizeof(struct fbcon_decor_iowrapper)))
++ return -EFAULT;
++
++ __get_user(vc_num, &wrapper->vc);
++ __get_user(origin, &wrapper->origin);
++ __get_user(data, &wrapper->data);
++
++ if (!vc_cons_allocated(vc_num))
++ return -EINVAL;
++
++ vc = vc_cons[vc_num].d;
++
++ switch (cmd) {
++ case FBIOCONDECOR_SETPIC:
++ {
++ struct fb_image img;
++ if (copy_from_user(&img, (struct fb_image __user *)data, sizeof(struct fb_image)))
++ return -EFAULT;
++
++ return fbcon_decor_ioctl_dosetpic(vc, &img, origin);
++ }
++ case FBIOCONDECOR_SETCFG:
++ {
++ struct vc_decor cfg;
++ if (copy_from_user(&cfg, (struct vc_decor __user *)data, sizeof(struct vc_decor)))
++ return -EFAULT;
++
++ return fbcon_decor_ioctl_dosetcfg(vc, &cfg, origin);
++ }
++ case FBIOCONDECOR_GETCFG:
++ {
++ int rval;
++ struct vc_decor cfg;
++
++ if (copy_from_user(&cfg, (struct vc_decor __user *)data, sizeof(struct vc_decor)))
++ return -EFAULT;
++
++ rval = fbcon_decor_ioctl_dogetcfg(vc, &cfg);
++
++ if (copy_to_user(data, &cfg, sizeof(struct vc_decor)))
++ return -EFAULT;
++ return rval;
++ }
++ case FBIOCONDECOR_SETSTATE:
++ {
++ unsigned int state = 0;
++ if (get_user(state, (unsigned int __user *)data))
++ return -EFAULT;
++ return fbcon_decor_ioctl_dosetstate(vc, state, origin);
++ }
++ case FBIOCONDECOR_GETSTATE:
++ {
++ unsigned int state = 0;
++ fbcon_decor_ioctl_dogetstate(vc, &state);
++ return put_user(state, (unsigned int __user *)data);
++ }
++
++ default:
++ return -ENOIOCTLCMD;
++ }
++}
++
++#ifdef CONFIG_COMPAT
++
++static long fbcon_decor_compat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) {
++
++ struct fbcon_decor_iowrapper32 __user *wrapper = (void __user *)arg;
++ struct vc_data *vc = NULL;
++ unsigned short vc_num = 0;
++ unsigned char origin = 0;
++ compat_uptr_t data_compat = 0;
++ void __user *data = NULL;
++
++ if (!access_ok(VERIFY_READ, wrapper,
++ sizeof(struct fbcon_decor_iowrapper32)))
++ return -EFAULT;
++
++ __get_user(vc_num, &wrapper->vc);
++ __get_user(origin, &wrapper->origin);
++ __get_user(data_compat, &wrapper->data);
++ data = compat_ptr(data_compat);
++
++ if (!vc_cons_allocated(vc_num))
++ return -EINVAL;
++
++ vc = vc_cons[vc_num].d;
++
++ switch (cmd) {
++ case FBIOCONDECOR_SETPIC32:
++ {
++ struct fb_image32 img_compat;
++ struct fb_image img;
++
++ if (copy_from_user(&img_compat, (struct fb_image32 __user *)data, sizeof(struct fb_image32)))
++ return -EFAULT;
++
++ fb_image_from_compat(img, img_compat);
++
++ return fbcon_decor_ioctl_dosetpic(vc, &img, origin);
++ }
++
++ case FBIOCONDECOR_SETCFG32:
++ {
++ struct vc_decor32 cfg_compat;
++ struct vc_decor cfg;
++
++ if (copy_from_user(&cfg_compat, (struct vc_decor32 __user *)data, sizeof(struct vc_decor32)))
++ return -EFAULT;
++
++ vc_decor_from_compat(cfg, cfg_compat);
++
++ return fbcon_decor_ioctl_dosetcfg(vc, &cfg, origin);
++ }
++
++ case FBIOCONDECOR_GETCFG32:
++ {
++ int rval;
++ struct vc_decor32 cfg_compat;
++ struct vc_decor cfg;
++
++ if (copy_from_user(&cfg_compat, (struct vc_decor32 __user *)data, sizeof(struct vc_decor32)))
++ return -EFAULT;
++ cfg.theme = compat_ptr(cfg_compat.theme);
++
++ rval = fbcon_decor_ioctl_dogetcfg(vc, &cfg);
++
++ vc_decor_to_compat(cfg_compat, cfg);
++
++ if (copy_to_user((struct vc_decor32 __user *)data, &cfg_compat, sizeof(struct vc_decor32)))
++ return -EFAULT;
++ return rval;
++ }
++
++ case FBIOCONDECOR_SETSTATE32:
++ {
++ compat_uint_t state_compat = 0;
++ unsigned int state = 0;
++
++ if (get_user(state_compat, (compat_uint_t __user *)data))
++ return -EFAULT;
++
++ state = (unsigned int)state_compat;
++
++ return fbcon_decor_ioctl_dosetstate(vc, state, origin);
++ }
++
++ case FBIOCONDECOR_GETSTATE32:
++ {
++ compat_uint_t state_compat = 0;
++ unsigned int state = 0;
++
++ fbcon_decor_ioctl_dogetstate(vc, &state);
++ state_compat = (compat_uint_t)state;
++
++ return put_user(state_compat, (compat_uint_t __user *)data);
++ }
++
++ default:
++ return -ENOIOCTLCMD;
++ }
++}
++#else
++ #define fbcon_decor_compat_ioctl NULL
++#endif
++
++static struct file_operations fbcon_decor_ops = {
++ .owner = THIS_MODULE,
++ .unlocked_ioctl = fbcon_decor_ioctl,
++ .compat_ioctl = fbcon_decor_compat_ioctl
++};
++
++static struct miscdevice fbcon_decor_dev = {
++ .minor = MISC_DYNAMIC_MINOR,
++ .name = "fbcondecor",
++ .fops = &fbcon_decor_ops
++};
++
++void fbcon_decor_reset(void)
++{
++ int i;
++
++ for (i = 0; i < num_registered_fb; i++) {
++ registered_fb[i]->bgdecor.data = NULL;
++ registered_fb[i]->bgdecor.cmap.red = NULL;
++ }
++
++ for (i = 0; i < MAX_NR_CONSOLES && vc_cons[i].d; i++) {
++ vc_cons[i].d->vc_decor.state = vc_cons[i].d->vc_decor.twidth =
++ vc_cons[i].d->vc_decor.theight = 0;
++ vc_cons[i].d->vc_decor.theme = NULL;
++ }
++
++ return;
++}
++
++int fbcon_decor_init(void)
++{
++ int i;
++
++ fbcon_decor_reset();
++
++ if (initialized)
++ return 0;
++
++ i = misc_register(&fbcon_decor_dev);
++ if (i) {
++ printk(KERN_ERR "fbcondecor: failed to register device\n");
++ return i;
++ }
++
++ fbcon_decor_call_helper("init", 0);
++ initialized = 1;
++ return 0;
++}
++
++int fbcon_decor_exit(void)
++{
++ fbcon_decor_reset();
++ return 0;
++}
++
++EXPORT_SYMBOL(fbcon_decor_path);
+diff --git a/drivers/video/console/fbcondecor.h b/drivers/video/console/fbcondecor.h
+new file mode 100644
+index 0000000..3b3724b
+--- /dev/null
++++ b/drivers/video/console/fbcondecor.h
+@@ -0,0 +1,78 @@
++/*
++ * linux/drivers/video/console/fbcondecor.h -- Framebuffer Console Decoration headers
++ *
++ * Copyright (C) 2004 Michal Januszewski <michalj+fbcondecor@gmail.com>
++ *
++ */
++
++#ifndef __FBCON_DECOR_H
++#define __FBCON_DECOR_H
++
++#ifndef _LINUX_FB_H
++#include <linux/fb.h>
++#endif
++
++/* This is needed for vc_cons in fbcmap.c */
++#include <linux/vt_kern.h>
++
++struct fb_cursor;
++struct fb_info;
++struct vc_data;
++
++#ifdef CONFIG_FB_CON_DECOR
++/* fbcondecor.c */
++int fbcon_decor_init(void);
++int fbcon_decor_exit(void);
++int fbcon_decor_call_helper(char* cmd, unsigned short cons);
++int fbcon_decor_disable(struct vc_data *vc, unsigned char redraw);
++
++/* cfbcondecor.c */
++void fbcon_decor_putcs(struct vc_data *vc, struct fb_info *info, const unsigned short *s, int count, int yy, int xx);
++void fbcon_decor_cursor(struct fb_info *info, struct fb_cursor *cursor);
++void fbcon_decor_clear(struct vc_data *vc, struct fb_info *info, int sy, int sx, int height, int width);
++void fbcon_decor_clear_margins(struct vc_data *vc, struct fb_info *info, int bottom_only);
++void fbcon_decor_blank(struct vc_data *vc, struct fb_info *info, int blank);
++void fbcon_decor_bmove_redraw(struct vc_data *vc, struct fb_info *info, int y, int sx, int dx, int width);
++void fbcon_decor_copy(u8 *dst, u8 *src, int height, int width, int linebytes, int srclinesbytes, int bpp);
++void fbcon_decor_fix_pseudo_pal(struct fb_info *info, struct vc_data *vc);
++
++/* vt.c */
++void acquire_console_sem(void);
++void release_console_sem(void);
++void do_unblank_screen(int entering_gfx);
++
++/* struct vc_data *y */
++#define fbcon_decor_active_vc(y) (y->vc_decor.state && y->vc_decor.theme)
++
++/* struct fb_info *x, struct vc_data *y */
++#define fbcon_decor_active_nores(x,y) (x->bgdecor.data && fbcon_decor_active_vc(y))
++
++/* struct fb_info *x, struct vc_data *y */
++#define fbcon_decor_active(x,y) (fbcon_decor_active_nores(x,y) && \
++ x->bgdecor.width == x->var.xres && \
++ x->bgdecor.height == x->var.yres && \
++ x->bgdecor.depth == x->var.bits_per_pixel)
++
++
++#else /* CONFIG_FB_CON_DECOR */
++
++static inline void fbcon_decor_putcs(struct vc_data *vc, struct fb_info *info, const unsigned short *s, int count, int yy, int xx) {}
++static inline void fbcon_decor_putc(struct vc_data *vc, struct fb_info *info, int c, int ypos, int xpos) {}
++static inline void fbcon_decor_cursor(struct fb_info *info, struct fb_cursor *cursor) {}
++static inline void fbcon_decor_clear(struct vc_data *vc, struct fb_info *info, int sy, int sx, int height, int width) {}
++static inline void fbcon_decor_clear_margins(struct vc_data *vc, struct fb_info *info, int bottom_only) {}
++static inline void fbcon_decor_blank(struct vc_data *vc, struct fb_info *info, int blank) {}
++static inline void fbcon_decor_bmove_redraw(struct vc_data *vc, struct fb_info *info, int y, int sx, int dx, int width) {}
++static inline void fbcon_decor_fix_pseudo_pal(struct fb_info *info, struct vc_data *vc) {}
++static inline int fbcon_decor_call_helper(char* cmd, unsigned short cons) { return 0; }
++static inline int fbcon_decor_init(void) { return 0; }
++static inline int fbcon_decor_exit(void) { return 0; }
++static inline int fbcon_decor_disable(struct vc_data *vc, unsigned char redraw) { return 0; }
++
++#define fbcon_decor_active_vc(y) (0)
++#define fbcon_decor_active_nores(x,y) (0)
++#define fbcon_decor_active(x,y) (0)
++
++#endif /* CONFIG_FB_CON_DECOR */
++
++#endif /* __FBCON_DECOR_H */
+diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig
+index e1f4727..2952e33 100644
+--- a/drivers/video/fbdev/Kconfig
++++ b/drivers/video/fbdev/Kconfig
+@@ -1204,7 +1204,6 @@ config FB_MATROX
+ select FB_CFB_FILLRECT
+ select FB_CFB_COPYAREA
+ select FB_CFB_IMAGEBLIT
+- select FB_TILEBLITTING
+ select FB_MACMODES if PPC_PMAC
+ ---help---
+ Say Y here if you have a Matrox Millennium, Matrox Millennium II,
+diff --git a/drivers/video/fbdev/core/fbcmap.c b/drivers/video/fbdev/core/fbcmap.c
+index f89245b..05e036c 100644
+--- a/drivers/video/fbdev/core/fbcmap.c
++++ b/drivers/video/fbdev/core/fbcmap.c
+@@ -17,6 +17,8 @@
+ #include <linux/slab.h>
+ #include <linux/uaccess.h>
+
++#include "../../console/fbcondecor.h"
++
+ static u16 red2[] __read_mostly = {
+ 0x0000, 0xaaaa
+ };
+@@ -249,14 +251,17 @@ int fb_set_cmap(struct fb_cmap *cmap, struct fb_info *info)
+ if (transp)
+ htransp = *transp++;
+ if (info->fbops->fb_setcolreg(start++,
+- hred, hgreen, hblue,
++ hred, hgreen, hblue,
+ htransp, info))
+ break;
+ }
+ }
+- if (rc == 0)
++ if (rc == 0) {
+ fb_copy_cmap(cmap, &info->cmap);
+-
++ if (fbcon_decor_active(info, vc_cons[fg_console].d) &&
++ info->fix.visual == FB_VISUAL_DIRECTCOLOR)
++ fbcon_decor_fix_pseudo_pal(info, vc_cons[fg_console].d);
++ }
+ return rc;
+ }
+
+diff --git a/drivers/video/fbdev/core/fbmem.c b/drivers/video/fbdev/core/fbmem.c
+index b6d5008..d6703f2 100644
+--- a/drivers/video/fbdev/core/fbmem.c
++++ b/drivers/video/fbdev/core/fbmem.c
+@@ -1250,15 +1250,6 @@ struct fb_fix_screeninfo32 {
+ u16 reserved[3];
+ };
+
+-struct fb_cmap32 {
+- u32 start;
+- u32 len;
+- compat_caddr_t red;
+- compat_caddr_t green;
+- compat_caddr_t blue;
+- compat_caddr_t transp;
+-};
+-
+ static int fb_getput_cmap(struct fb_info *info, unsigned int cmd,
+ unsigned long arg)
+ {
+diff --git a/include/linux/console_decor.h b/include/linux/console_decor.h
+new file mode 100644
+index 0000000..04b8d80
+--- /dev/null
++++ b/include/linux/console_decor.h
+@@ -0,0 +1,46 @@
++#ifndef _LINUX_CONSOLE_DECOR_H_
++#define _LINUX_CONSOLE_DECOR_H_ 1
++
++/* A structure used by the framebuffer console decorations (drivers/video/console/fbcondecor.c) */
++struct vc_decor {
++ __u8 bg_color; /* The color that is to be treated as transparent */
++ __u8 state; /* Current decor state: 0 = off, 1 = on */
++ __u16 tx, ty; /* Top left corner coordinates of the text field */
++ __u16 twidth, theight; /* Width and height of the text field */
++ char* theme;
++};
++
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++#include <linux/compat.h>
++
++struct vc_decor32 {
++ __u8 bg_color; /* The color that is to be treated as transparent */
++ __u8 state; /* Current decor state: 0 = off, 1 = on */
++ __u16 tx, ty; /* Top left corner coordinates of the text field */
++ __u16 twidth, theight; /* Width and height of the text field */
++ compat_uptr_t theme;
++};
++
++#define vc_decor_from_compat(to, from) \
++ (to).bg_color = (from).bg_color; \
++ (to).state = (from).state; \
++ (to).tx = (from).tx; \
++ (to).ty = (from).ty; \
++ (to).twidth = (from).twidth; \
++ (to).theight = (from).theight; \
++ (to).theme = compat_ptr((from).theme)
++
++#define vc_decor_to_compat(to, from) \
++ (to).bg_color = (from).bg_color; \
++ (to).state = (from).state; \
++ (to).tx = (from).tx; \
++ (to).ty = (from).ty; \
++ (to).twidth = (from).twidth; \
++ (to).theight = (from).theight; \
++ (to).theme = ptr_to_compat((from).theme)
++
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
++#endif
+diff --git a/include/linux/console_struct.h b/include/linux/console_struct.h
+index 7f0c329..98f5d60 100644
+--- a/include/linux/console_struct.h
++++ b/include/linux/console_struct.h
+@@ -19,6 +19,7 @@
+ struct vt_struct;
+
+ #define NPAR 16
++#include <linux/console_decor.h>
+
+ struct vc_data {
+ struct tty_port port; /* Upper level data */
+@@ -107,6 +108,8 @@ struct vc_data {
+ unsigned long vc_uni_pagedir;
+ unsigned long *vc_uni_pagedir_loc; /* [!] Location of uni_pagedir variable for this console */
+ bool vc_panic_force_write; /* when oops/panic this VC can accept forced output/blanking */
++
++ struct vc_decor vc_decor;
+ /* additional information is in vt_kern.h */
+ };
+
+diff --git a/include/linux/fb.h b/include/linux/fb.h
+index fe6ac95..1e36b03 100644
+--- a/include/linux/fb.h
++++ b/include/linux/fb.h
+@@ -219,6 +219,34 @@ struct fb_deferred_io {
+ };
+ #endif
+
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++struct fb_image32 {
++ __u32 dx; /* Where to place image */
++ __u32 dy;
++ __u32 width; /* Size of image */
++ __u32 height;
++ __u32 fg_color; /* Only used when a mono bitmap */
++ __u32 bg_color;
++ __u8 depth; /* Depth of the image */
++ const compat_uptr_t data; /* Pointer to image data */
++ struct fb_cmap32 cmap; /* color map info */
++};
++
++#define fb_image_from_compat(to, from) \
++ (to).dx = (from).dx; \
++ (to).dy = (from).dy; \
++ (to).width = (from).width; \
++ (to).height = (from).height; \
++ (to).fg_color = (from).fg_color; \
++ (to).bg_color = (from).bg_color; \
++ (to).depth = (from).depth; \
++ (to).data = compat_ptr((from).data); \
++ fb_cmap_from_compat((to).cmap, (from).cmap)
++
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
+ /*
+ * Frame buffer operations
+ *
+@@ -489,6 +517,9 @@ struct fb_info {
+ #define FBINFO_STATE_SUSPENDED 1
+ u32 state; /* Hardware state i.e suspend */
+ void *fbcon_par; /* fbcon use-only private area */
++
++ struct fb_image bgdecor;
++
+ /* From here on everything is device dependent */
+ void *par;
+ /* we need the PCI or similar aperture base/size not
+diff --git a/include/uapi/linux/fb.h b/include/uapi/linux/fb.h
+index fb795c3..dc77a03 100644
+--- a/include/uapi/linux/fb.h
++++ b/include/uapi/linux/fb.h
+@@ -8,6 +8,25 @@
+
+ #define FB_MAX 32 /* sufficient for now */
+
++struct fbcon_decor_iowrapper
++{
++ unsigned short vc; /* Virtual console */
++ unsigned char origin; /* Point of origin of the request */
++ void *data;
++};
++
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++#include <linux/compat.h>
++struct fbcon_decor_iowrapper32
++{
++ unsigned short vc; /* Virtual console */
++ unsigned char origin; /* Point of origin of the request */
++ compat_uptr_t data;
++};
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
+ /* ioctls
+ 0x46 is 'F' */
+ #define FBIOGET_VSCREENINFO 0x4600
+@@ -35,6 +54,25 @@
+ #define FBIOGET_DISPINFO 0x4618
+ #define FBIO_WAITFORVSYNC _IOW('F', 0x20, __u32)
+
++#define FBIOCONDECOR_SETCFG _IOWR('F', 0x19, struct fbcon_decor_iowrapper)
++#define FBIOCONDECOR_GETCFG _IOR('F', 0x1A, struct fbcon_decor_iowrapper)
++#define FBIOCONDECOR_SETSTATE _IOWR('F', 0x1B, struct fbcon_decor_iowrapper)
++#define FBIOCONDECOR_GETSTATE _IOR('F', 0x1C, struct fbcon_decor_iowrapper)
++#define FBIOCONDECOR_SETPIC _IOWR('F', 0x1D, struct fbcon_decor_iowrapper)
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++#define FBIOCONDECOR_SETCFG32 _IOWR('F', 0x19, struct fbcon_decor_iowrapper32)
++#define FBIOCONDECOR_GETCFG32 _IOR('F', 0x1A, struct fbcon_decor_iowrapper32)
++#define FBIOCONDECOR_SETSTATE32 _IOWR('F', 0x1B, struct fbcon_decor_iowrapper32)
++#define FBIOCONDECOR_GETSTATE32 _IOR('F', 0x1C, struct fbcon_decor_iowrapper32)
++#define FBIOCONDECOR_SETPIC32 _IOWR('F', 0x1D, struct fbcon_decor_iowrapper32)
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
++#define FBCON_DECOR_THEME_LEN 128 /* Maximum lenght of a theme name */
++#define FBCON_DECOR_IO_ORIG_KERNEL 0 /* Kernel ioctl origin */
++#define FBCON_DECOR_IO_ORIG_USER 1 /* User ioctl origin */
++
+ #define FB_TYPE_PACKED_PIXELS 0 /* Packed Pixels */
+ #define FB_TYPE_PLANES 1 /* Non interleaved planes */
+ #define FB_TYPE_INTERLEAVED_PLANES 2 /* Interleaved planes */
+@@ -277,6 +315,29 @@ struct fb_var_screeninfo {
+ __u32 reserved[4]; /* Reserved for future compatibility */
+ };
+
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++struct fb_cmap32 {
++ __u32 start;
++ __u32 len; /* Number of entries */
++ compat_uptr_t red; /* Red values */
++ compat_uptr_t green;
++ compat_uptr_t blue;
++ compat_uptr_t transp; /* transparency, can be NULL */
++};
++
++#define fb_cmap_from_compat(to, from) \
++ (to).start = (from).start; \
++ (to).len = (from).len; \
++ (to).red = compat_ptr((from).red); \
++ (to).green = compat_ptr((from).green); \
++ (to).blue = compat_ptr((from).blue); \
++ (to).transp = compat_ptr((from).transp)
++
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
++
+ struct fb_cmap {
+ __u32 start; /* First entry */
+ __u32 len; /* Number of entries */
+diff --git a/kernel/sysctl.c b/kernel/sysctl.c
+index 74f5b58..6386ab0 100644
+--- a/kernel/sysctl.c
++++ b/kernel/sysctl.c
+@@ -146,6 +146,10 @@ static const int cap_last_cap = CAP_LAST_CAP;
+ static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
+ #endif
+
++#ifdef CONFIG_FB_CON_DECOR
++extern char fbcon_decor_path[];
++#endif
++
+ #ifdef CONFIG_INOTIFY_USER
+ #include <linux/inotify.h>
+ #endif
+@@ -255,6 +259,15 @@ static struct ctl_table sysctl_base_table[] = {
+ .mode = 0555,
+ .child = dev_table,
+ },
++#ifdef CONFIG_FB_CON_DECOR
++ {
++ .procname = "fbcondecor",
++ .data = &fbcon_decor_path,
++ .maxlen = KMOD_PATH_LEN,
++ .mode = 0644,
++ .proc_handler = &proc_dostring,
++ },
++#endif
+ { }
+ };
+
diff --git a/4500_support-for-pogoplug-e02.patch b/4500_support-for-pogoplug-e02.patch
new file mode 100644
index 0000000..9f0becd
--- /dev/null
+++ b/4500_support-for-pogoplug-e02.patch
@@ -0,0 +1,172 @@
+diff --git a/arch/arm/configs/kirkwood_defconfig b/arch/arm/configs/kirkwood_defconfig
+index 0f2aa61..8c3146b 100644
+--- a/arch/arm/configs/kirkwood_defconfig
++++ b/arch/arm/configs/kirkwood_defconfig
+@@ -20,6 +20,7 @@ CONFIG_MACH_NET2BIG_V2=y
+ CONFIG_MACH_D2NET_V2=y
+ CONFIG_MACH_NET2BIG_V2=y
+ CONFIG_MACH_NET5BIG_V2=y
++CONFIG_MACH_POGO_E02=n
+ CONFIG_MACH_OPENRD_BASE=y
+ CONFIG_MACH_OPENRD_CLIENT=y
+ CONFIG_MACH_OPENRD_ULTIMATE=y
+diff --git a/arch/arm/mach-kirkwood/Kconfig b/arch/arm/mach-kirkwood/Kconfig
+index b634f96..cd7f289 100644
+--- a/arch/arm/mach-kirkwood/Kconfig
++++ b/arch/arm/mach-kirkwood/Kconfig
+@@ -62,6 +62,15 @@ config MACH_NETSPACE_V2
+ Say 'Y' here if you want your kernel to support the
+ LaCie Network Space v2 NAS.
+
++config MACH_POGO_E02
++ bool "CE Pogoplug E02"
++ default n
++ help
++ Say 'Y' here if you want your kernel to support the
++ CloudEngines Pogoplug e02. It differs from Marvell's
++ SheevaPlug Reference Board by a few details, but
++ especially in the led assignments.
++
+ config MACH_OPENRD
+ bool
+
+diff --git a/arch/arm/mach-kirkwood/Makefile b/arch/arm/mach-kirkwood/Makefile
+index ac4cd75..dddbb40 100644
+--- a/arch/arm/mach-kirkwood/Makefile
++++ b/arch/arm/mach-kirkwood/Makefile
+@@ -2,6 +2,7 @@ obj-y += common.o irq.o pcie.o mpp.o
+ obj-$(CONFIG_MACH_D2NET_V2) += d2net_v2-setup.o lacie_v2-common.o
+ obj-$(CONFIG_MACH_NET2BIG_V2) += netxbig_v2-setup.o lacie_v2-common.o
+ obj-$(CONFIG_MACH_NET5BIG_V2) += netxbig_v2-setup.o lacie_v2-common.o
++obj-$(CONFIG_MACH_POGO_E02) += pogo_e02-setup.o
+ obj-$(CONFIG_MACH_OPENRD) += openrd-setup.o
+ obj-$(CONFIG_MACH_RD88F6192_NAS) += rd88f6192-nas-setup.o
+ obj-$(CONFIG_MACH_RD88F6281) += rd88f6281-setup.o
+diff --git a/arch/arm/mach-kirkwood/pogo_e02-setup.c b/arch/arm/mach-kirkwood/pogo_e02-setup.c
+new file mode 100644
+index 0000000..f57e8f7
+--- /dev/null
++++ b/arch/arm/mach-kirkwood/pogo_e02-setup.c
+@@ -0,0 +1,122 @@
++/*
++ * arch/arm/mach-kirkwood/pogo_e02-setup.c
++ *
++ * CloudEngines Pogoplug E02 support
++ *
++ * Copyright (C) 2013 Christoph Junghans <ottxor@gentoo.org>
++ * Based on a patch in Arch Linux for Arm by:
++ * Copyright (C) 2012 Kevin Mihelich <kevin@miheli.ch>
++ * and <pazos@lavabit.com>
++ *
++ * Based on the board file sheevaplug-setup.c
++ *
++ * This file is licensed under the terms of the GNU General Public
++ * License version 2. This program is licensed "as is" without any
++ * warranty of any kind, whether express or implied.
++ */
++
++#include <linux/kernel.h>
++#include <linux/init.h>
++#include <linux/platform_device.h>
++#include <linux/ata_platform.h>
++#include <linux/mtd/partitions.h>
++#include <linux/mv643xx_eth.h>
++#include <linux/gpio.h>
++#include <linux/leds.h>
++#include <asm/mach-types.h>
++#include <asm/mach/arch.h>
++#include <mach/kirkwood.h>
++#include "common.h"
++#include "mpp.h"
++
++static struct mtd_partition pogo_e02_nand_parts[] = {
++ {
++ .name = "u-boot",
++ .offset = 0,
++ .size = SZ_1M
++ }, {
++ .name = "uImage",
++ .offset = MTDPART_OFS_NXTBLK,
++ .size = SZ_4M
++ }, {
++ .name = "pogoplug",
++ .offset = MTDPART_OFS_NXTBLK,
++ .size = SZ_32M
++ }, {
++ .name = "root",
++ .offset = MTDPART_OFS_NXTBLK,
++ .size = MTDPART_SIZ_FULL
++ },
++};
++
++static struct mv643xx_eth_platform_data pogo_e02_ge00_data = {
++ .phy_addr = MV643XX_ETH_PHY_ADDR(0),
++};
++
++static struct gpio_led pogo_e02_led_pins[] = {
++ {
++ .name = "status:green:health",
++ .default_trigger = "default-on",
++ .gpio = 48,
++ .active_low = 1,
++ },
++ {
++ .name = "status:orange:fault",
++ .default_trigger = "none",
++ .gpio = 49,
++ .active_low = 1,
++ }
++};
++
++static struct gpio_led_platform_data pogo_e02_led_data = {
++ .leds = pogo_e02_led_pins,
++ .num_leds = ARRAY_SIZE(pogo_e02_led_pins),
++};
++
++static struct platform_device pogo_e02_leds = {
++ .name = "leds-gpio",
++ .id = -1,
++ .dev = {
++ .platform_data = &pogo_e02_led_data,
++ }
++};
++
++static unsigned int pogo_e02_mpp_config[] __initdata = {
++ MPP29_GPIO, /* USB Power Enable */
++ MPP48_GPIO, /* LED Green */
++ MPP49_GPIO, /* LED Orange */
++ 0
++};
++
++static void __init pogo_e02_init(void)
++{
++ /*
++ * Basic setup. Needs to be called early.
++ */
++ kirkwood_init();
++
++ /* setup gpio pin select */
++ kirkwood_mpp_conf(pogo_e02_mpp_config);
++
++ kirkwood_uart0_init();
++ kirkwood_nand_init(ARRAY_AND_SIZE(pogo_e02_nand_parts), 25);
++
++ if (gpio_request(29, "USB Power Enable") != 0 ||
++ gpio_direction_output(29, 1) != 0)
++ pr_err("can't set up GPIO 29 (USB Power Enable)\n");
++ kirkwood_ehci_init();
++
++ kirkwood_ge00_init(&pogo_e02_ge00_data);
++
++ platform_device_register(&pogo_e02_leds);
++}
++
++MACHINE_START(POGO_E02, "Pogoplug E02")
++ .atag_offset = 0x100,
++ .init_machine = pogo_e02_init,
++ .map_io = kirkwood_map_io,
++ .init_early = kirkwood_init_early,
++ .init_irq = kirkwood_init_irq,
++ .timer = &kirkwood_timer,
++ .restart = kirkwood_restart,
++MACHINE_END
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-07-15 12:23 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-07-15 12:23 UTC (permalink / raw
To: gentoo-commits
commit: 3fe9f8aab7f5e1262afd9d1f45be1e3d0afe8ce9
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Tue Jul 15 12:22:59 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Tue Jul 15 12:22:59 2014 +0000
URL: http://git.overlays.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=3fe9f8aa
Kernel patch enables gcc optimizations for additional CPUs.
---
0000_README | 4 +
...able-additional-cpu-optimizations-for-gcc.patch | 327 +++++++++++++++++++++
2 files changed, 331 insertions(+)
diff --git a/0000_README b/0000_README
index 6276507..da7da0d 100644
--- a/0000_README
+++ b/0000_README
@@ -71,3 +71,7 @@ Patch: 4567_distro-Gentoo-Kconfig.patch
From: Tom Wijsman <TomWij@gentoo.org>
Desc: Add Gentoo Linux support config settings and defaults.
+Patch: 5000_enable-additional-cpu-optimizations-for-gcc.patch
+From: https://github.com/graysky2/kernel_gcc_patch/
+Desc: Kernel patch enables gcc optimizations for additional CPUs.
+
diff --git a/5000_enable-additional-cpu-optimizations-for-gcc.patch b/5000_enable-additional-cpu-optimizations-for-gcc.patch
new file mode 100644
index 0000000..f7ab6f0
--- /dev/null
+++ b/5000_enable-additional-cpu-optimizations-for-gcc.patch
@@ -0,0 +1,327 @@
+This patch has been tested on and known to work with kernel versions from 3.2
+up to the latest git version (pulled on 12/14/2013).
+
+This patch will expand the number of microarchitectures to include new
+processors including: AMD K10-family, AMD Family 10h (Barcelona), AMD Family
+14h (Bobcat), AMD Family 15h (Bulldozer), AMD Family 15h (Piledriver), AMD
+Family 16h (Jaguar), Intel 1st Gen Core i3/i5/i7 (Nehalem), Intel 2nd Gen Core
+i3/i5/i7 (Sandybridge), Intel 3rd Gen Core i3/i5/i7 (Ivybridge), and Intel 4th
+Gen Core i3/i5/i7 (Haswell). It also offers the compiler the 'native' flag.
+
+Small but real speed increases are measurable using a make endpoint comparing
+a generic kernel to one built with one of the respective microarchs.
+
+See the following experimental evidence supporting this statement:
+https://github.com/graysky2/kernel_gcc_patch
+
+REQUIREMENTS
+linux version >=3.15
+gcc version <4.9
+
+---
+diff -uprN a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
+--- a/arch/x86/include/asm/module.h 2013-11-03 18:41:51.000000000 -0500
++++ b/arch/x86/include/asm/module.h 2013-12-15 06:21:24.351122516 -0500
+@@ -15,6 +15,16 @@
+ #define MODULE_PROC_FAMILY "586MMX "
+ #elif defined CONFIG_MCORE2
+ #define MODULE_PROC_FAMILY "CORE2 "
++#elif defined CONFIG_MNATIVE
++#define MODULE_PROC_FAMILY "NATIVE "
++#elif defined CONFIG_MCOREI7
++#define MODULE_PROC_FAMILY "COREI7 "
++#elif defined CONFIG_MCOREI7AVX
++#define MODULE_PROC_FAMILY "COREI7AVX "
++#elif defined CONFIG_MCOREAVXI
++#define MODULE_PROC_FAMILY "COREAVXI "
++#elif defined CONFIG_MCOREAVX2
++#define MODULE_PROC_FAMILY "COREAVX2 "
+ #elif defined CONFIG_MATOM
+ #define MODULE_PROC_FAMILY "ATOM "
+ #elif defined CONFIG_M686
+@@ -33,6 +43,18 @@
+ #define MODULE_PROC_FAMILY "K7 "
+ #elif defined CONFIG_MK8
+ #define MODULE_PROC_FAMILY "K8 "
++#elif defined CONFIG_MK10
++#define MODULE_PROC_FAMILY "K10 "
++#elif defined CONFIG_MBARCELONA
++#define MODULE_PROC_FAMILY "BARCELONA "
++#elif defined CONFIG_MBOBCAT
++#define MODULE_PROC_FAMILY "BOBCAT "
++#elif defined CONFIG_MBULLDOZER
++#define MODULE_PROC_FAMILY "BULLDOZER "
++#elif defined CONFIG_MPILEDRIVER
++#define MODULE_PROC_FAMILY "PILEDRIVER "
++#elif defined CONFIG_MJAGUAR
++#define MODULE_PROC_FAMILY "JAGUAR "
+ #elif defined CONFIG_MELAN
+ #define MODULE_PROC_FAMILY "ELAN "
+ #elif defined CONFIG_MCRUSOE
+diff -uprN a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
+--- a/arch/x86/Kconfig.cpu 2013-11-03 18:41:51.000000000 -0500
++++ b/arch/x86/Kconfig.cpu 2013-12-15 06:21:24.351122516 -0500
+@@ -139,7 +139,7 @@ config MPENTIUM4
+
+
+ config MK6
+- bool "K6/K6-II/K6-III"
++ bool "AMD K6/K6-II/K6-III"
+ depends on X86_32
+ ---help---
+ Select this for an AMD K6-family processor. Enables use of
+@@ -147,7 +147,7 @@ config MK6
+ flags to GCC.
+
+ config MK7
+- bool "Athlon/Duron/K7"
++ bool "AMD Athlon/Duron/K7"
+ depends on X86_32
+ ---help---
+ Select this for an AMD Athlon K7-family processor. Enables use of
+@@ -155,12 +155,55 @@ config MK7
+ flags to GCC.
+
+ config MK8
+- bool "Opteron/Athlon64/Hammer/K8"
++ bool "AMD Opteron/Athlon64/Hammer/K8"
+ ---help---
+ Select this for an AMD Opteron or Athlon64 Hammer-family processor.
+ Enables use of some extended instructions, and passes appropriate
+ optimization flags to GCC.
+
++config MK10
++ bool "AMD 61xx/7x50/PhenomX3/X4/II/K10"
++ ---help---
++ Select this for an AMD 61xx Eight-Core Magny-Cours, Athlon X2 7x50,
++ Phenom X3/X4/II, Athlon II X2/X3/X4, or Turion II-family processor.
++ Enables use of some extended instructions, and passes appropriate
++ optimization flags to GCC.
++
++config MBARCELONA
++ bool "AMD Barcelona"
++ ---help---
++ Select this for AMD Barcelona and newer processors.
++
++ Enables -march=barcelona
++
++config MBOBCAT
++ bool "AMD Bobcat"
++ ---help---
++ Select this for AMD Bobcat processors.
++
++ Enables -march=btver1
++
++config MBULLDOZER
++ bool "AMD Bulldozer"
++ ---help---
++ Select this for AMD Bulldozer processors.
++
++ Enables -march=bdver1
++
++config MPILEDRIVER
++ bool "AMD Piledriver"
++ ---help---
++ Select this for AMD Piledriver processors.
++
++ Enables -march=bdver2
++
++config MJAGUAR
++ bool "AMD Jaguar"
++ ---help---
++ Select this for AMD Jaguar processors.
++
++ Enables -march=btver2
++
+ config MCRUSOE
+ bool "Crusoe"
+ depends on X86_32
+@@ -251,8 +294,17 @@ config MPSC
+ using the cpu family field
+ in /proc/cpuinfo. Family 15 is an older Xeon, Family 6 a newer one.
+
++config MATOM
++ bool "Intel Atom"
++ ---help---
++
++ Select this for the Intel Atom platform. Intel Atom CPUs have an
++ in-order pipelining architecture and thus can benefit from
++ accordingly optimized code. Use a recent GCC with specific Atom
++ support in order to fully benefit from selecting this option.
++
+ config MCORE2
+- bool "Core 2/newer Xeon"
++ bool "Intel Core 2"
+ ---help---
+
+ Select this for Intel Core 2 and newer Core 2 Xeons (Xeon 51xx and
+@@ -260,14 +312,40 @@ config MCORE2
+ family in /proc/cpuinfo. Newer ones have 6 and older ones 15
+ (not a typo)
+
+-config MATOM
+- bool "Intel Atom"
++ Enables -march=core2
++
++config MCOREI7
++ bool "Intel Core i7"
+ ---help---
+
+- Select this for the Intel Atom platform. Intel Atom CPUs have an
+- in-order pipelining architecture and thus can benefit from
+- accordingly optimized code. Use a recent GCC with specific Atom
+- support in order to fully benefit from selecting this option.
++ Select this for the Intel Nehalem platform. Intel Nehalem proecessors
++ include Core i3, i5, i7, Xeon: 34xx, 35xx, 55xx, 56xx, 75xx processors.
++
++ Enables -march=corei7
++
++config MCOREI7AVX
++ bool "Intel Core 2nd Gen AVX"
++ ---help---
++
++ Select this for 2nd Gen Core processors including Sandy Bridge.
++
++ Enables -march=corei7-avx
++
++config MCOREAVXI
++ bool "Intel Core 3rd Gen AVX"
++ ---help---
++
++ Select this for 3rd Gen Core processors including Ivy Bridge.
++
++ Enables -march=core-avx-i
++
++config MCOREAVX2
++ bool "Intel Core AVX2"
++ ---help---
++
++ Select this for AVX2 enabled processors including Haswell.
++
++ Enables -march=core-avx2
+
+ config GENERIC_CPU
+ bool "Generic-x86-64"
+@@ -276,6 +354,19 @@ config GENERIC_CPU
+ Generic x86-64 CPU.
+ Run equally well on all x86-64 CPUs.
+
++config MNATIVE
++ bool "Native optimizations autodetected by GCC"
++ ---help---
++
++ GCC 4.2 and above support -march=native, which automatically detects
++ the optimum settings to use based on your processor. -march=native
++ also detects and applies additional settings beyond -march specific
++ to your CPU, (eg. -msse4). Unless you have a specific reason not to
++ (e.g. distcc cross-compiling), you should probably be using
++ -march=native rather than anything listed below.
++
++ Enables -march=native
++
+ endchoice
+
+ config X86_GENERIC
+@@ -300,7 +391,7 @@ config X86_INTERNODE_CACHE_SHIFT
+ config X86_L1_CACHE_SHIFT
+ int
+ default "7" if MPENTIUM4 || MPSC
+- default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU
++ default "6" if MK7 || MK8 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MJAGUAR || MPENTIUMM || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2 || MATOM || MVIAC7 || X86_GENERIC || MNATIVE || GENERIC_CPU
+ default "4" if MELAN || M486 || MGEODEGX1
+ default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX
+
+@@ -331,11 +422,11 @@ config X86_ALIGNMENT_16
+
+ config X86_INTEL_USERCOPY
+ def_bool y
+- depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK7 || MEFFICEON || MCORE2
++ depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || MNATIVE || X86_GENERIC || MK8 || MK7 || MK10 || MBARCELONA || MEFFICEON || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2
+
+ config X86_USE_PPRO_CHECKSUM
+ def_bool y
+- depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MATOM
++ depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MK10 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2 || MATOM || MNATIVE
+
+ config X86_USE_3DNOW
+ def_bool y
+@@ -363,17 +454,17 @@ config X86_P6_NOP
+
+ config X86_TSC
+ def_bool y
+- depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MATOM) || X86_64
++ depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MJAGUAR || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MCOREI7 || MCOREI7-AVX || MATOM) || X86_64 || MNATIVE
+
+ config X86_CMPXCHG64
+ def_bool y
+- depends on X86_PAE || X86_64 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM
++ depends on X86_PAE || X86_64 || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM || MNATIVE
+
+ # this should be set for all -march=.. options where the compiler
+ # generates cmov.
+ config X86_CMOV
+ def_bool y
+- depends on (MK8 || MK7 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MATOM || MGEODE_LX)
++ depends on (MK8 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MJAGUAR || MK7 || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MNATIVE || MATOM || MGEODE_LX)
+
+ config X86_MINIMUM_CPU_FAMILY
+ int
+diff -uprN a/arch/x86/Makefile b/arch/x86/Makefile
+--- a/arch/x86/Makefile 2013-11-03 18:41:51.000000000 -0500
++++ b/arch/x86/Makefile 2013-12-15 06:21:24.354455723 -0500
+@@ -61,11 +61,26 @@ else
+ KBUILD_CFLAGS += $(call cc-option,-mno-sse -mpreferred-stack-boundary=3)
+
+ # FIXME - should be integrated in Makefile.cpu (Makefile_32.cpu)
++ cflags-$(CONFIG_MNATIVE) += $(call cc-option,-march=native)
+ cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8)
++ cflags-$(CONFIG_MK10) += $(call cc-option,-march=amdfam10)
++ cflags-$(CONFIG_MBARCELONA) += $(call cc-option,-march=barcelona)
++ cflags-$(CONFIG_MBOBCAT) += $(call cc-option,-march=btver1)
++ cflags-$(CONFIG_MBULLDOZER) += $(call cc-option,-march=bdver1)
++ cflags-$(CONFIG_MPILEDRIVER) += $(call cc-option,-march=bdver2)
++ cflags-$(CONFIG_MJAGUAR) += $(call cc-option,-march=btver2)
+ cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona)
+
+ cflags-$(CONFIG_MCORE2) += \
+- $(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
++ $(call cc-option,-march=core2,$(call cc-option,-mtune=core2))
++ cflags-$(CONFIG_MCOREI7) += \
++ $(call cc-option,-march=corei7,$(call cc-option,-mtune=corei7))
++ cflags-$(CONFIG_MCOREI7AVX) += \
++ $(call cc-option,-march=corei7-avx,$(call cc-option,-mtune=corei7-avx))
++ cflags-$(CONFIG_MCOREAVXI) += \
++ $(call cc-option,-march=core-avx-i,$(call cc-option,-mtune=core-avx-i))
++ cflags-$(CONFIG_MCOREAVX2) += \
++ $(call cc-option,-march=core-avx2,$(call cc-option,-mtune=core-avx2))
+ cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \
+ $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
+ cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)
+diff -uprN a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu
+--- a/arch/x86/Makefile_32.cpu 2013-11-03 18:41:51.000000000 -0500
++++ b/arch/x86/Makefile_32.cpu 2013-12-15 06:21:24.354455723 -0500
+@@ -23,7 +23,14 @@ cflags-$(CONFIG_MK6) += -march=k6
+ # Please note, that patches that add -march=athlon-xp and friends are pointless.
+ # They make zero difference whatsosever to performance at this time.
+ cflags-$(CONFIG_MK7) += -march=athlon
++cflags-$(CONFIG_MNATIVE) += $(call cc-option,-march=native)
+ cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8,-march=athlon)
++cflags-$(CONFIG_MK10) += $(call cc-option,-march=amdfam10,-march=athlon)
++cflags-$(CONFIG_MBARCELONA) += $(call cc-option,-march=barcelona,-march=athlon)
++cflags-$(CONFIG_MBOBCAT) += $(call cc-option,-march=btver1,-march=athlon)
++cflags-$(CONFIG_MBULLDOZER) += $(call cc-option,-march=bdver1,-march=athlon)
++cflags-$(CONFIG_MPILEDRIVER) += $(call cc-option,-march=bdver2,-march=athlon)
++cflags-$(CONFIG_MJAGUAR) += $(call cc-option,-march=btver2,-march=athlon)
+ cflags-$(CONFIG_MCRUSOE) += -march=i686 $(align)-functions=0 $(align)-jumps=0 $(align)-loops=0
+ cflags-$(CONFIG_MEFFICEON) += -march=i686 $(call tune,pentium3) $(align)-functions=0 $(align)-jumps=0 $(align)-loops=0
+ cflags-$(CONFIG_MWINCHIPC6) += $(call cc-option,-march=winchip-c6,-march=i586)
+@@ -32,6 +39,10 @@ cflags-$(CONFIG_MCYRIXIII) += $(call cc-
+ cflags-$(CONFIG_MVIAC3_2) += $(call cc-option,-march=c3-2,-march=i686)
+ cflags-$(CONFIG_MVIAC7) += -march=i686
+ cflags-$(CONFIG_MCORE2) += -march=i686 $(call tune,core2)
++cflags-$(CONFIG_MCOREI7) += -march=i686 $(call tune,corei7)
++cflags-$(CONFIG_MCOREI7AVX) += -march=i686 $(call tune,corei7-avx)
++cflags-$(CONFIG_MCOREAVXI) += -march=i686 $(call tune,core-avx-i)
++cflags-$(CONFIG_MCOREAVX2) += -march=i686 $(call tune,core-avx2)
+ cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom,$(call cc-option,-march=core2,-march=i686)) \
+ $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-08-08 19:48 Mike Pagano
2014-08-19 11:44 ` Mike Pagano
0 siblings, 1 reply; 26+ messages in thread
From: Mike Pagano @ 2014-08-08 19:48 UTC (permalink / raw
To: gentoo-commits
commit: 9df8c18cd85acf5655794c6de5da3a0690675965
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Fri Aug 8 19:48:09 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Fri Aug 8 19:48:09 2014 +0000
URL: http://git.overlays.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=9df8c18c
BFQ patch for 3.16
---
0000_README | 11 +
...-cgroups-kconfig-build-bits-for-v7r5-3.16.patch | 104 +
...ck-introduce-the-v7r5-I-O-sched-for-3.16.patch1 | 6635 ++++++++++++++++++++
...add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch | 1188 ++++
4 files changed, 7938 insertions(+)
diff --git a/0000_README b/0000_README
index da7da0d..a6ec2e6 100644
--- a/0000_README
+++ b/0000_README
@@ -75,3 +75,14 @@ Patch: 5000_enable-additional-cpu-optimizations-for-gcc.patch
From: https://github.com/graysky2/kernel_gcc_patch/
Desc: Kernel patch enables gcc optimizations for additional CPUs.
+Patch: 5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
+From: http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc: BFQ v7r5 patch 1 for 3.16: Build, cgroups and kconfig bits
+
+Patch: 5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
+From: http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc: BFQ v7r5 patch 2 for 3.16: BFQ Scheduler
+
+Patch: 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
+From: http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc: BFQ v7r5 patch 3 for 3.16: Early Queue Merge (EQM)
diff --git a/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
new file mode 100644
index 0000000..088bd05
--- /dev/null
+++ b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
@@ -0,0 +1,104 @@
+From 6519e5beef1063a86d3fc917cff2592cb599e824 Mon Sep 17 00:00:00 2001
+From: Paolo Valente <paolo.valente@unimore.it>
+Date: Thu, 22 May 2014 11:59:35 +0200
+Subject: [PATCH 1/3] block: cgroups, kconfig, build bits for BFQ-v7r5-3.16
+
+Update Kconfig.iosched and do the related Makefile changes to include
+kernel configuration options for BFQ. Also add the bfqio controller
+to the cgroups subsystem.
+
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+---
+ block/Kconfig.iosched | 32 ++++++++++++++++++++++++++++++++
+ block/Makefile | 1 +
+ include/linux/cgroup_subsys.h | 4 ++++
+ 3 files changed, 37 insertions(+)
+
+diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
+index 421bef9..0ee5f0f 100644
+--- a/block/Kconfig.iosched
++++ b/block/Kconfig.iosched
+@@ -39,6 +39,27 @@ config CFQ_GROUP_IOSCHED
+ ---help---
+ Enable group IO scheduling in CFQ.
+
++config IOSCHED_BFQ
++ tristate "BFQ I/O scheduler"
++ default n
++ ---help---
++ The BFQ I/O scheduler tries to distribute bandwidth among
++ all processes according to their weights.
++ It aims at distributing the bandwidth as desired, independently of
++ the disk parameters and with any workload. It also tries to
++ guarantee low latency to interactive and soft real-time
++ applications. If compiled built-in (saying Y here), BFQ can
++ be configured to support hierarchical scheduling.
++
++config CGROUP_BFQIO
++ bool "BFQ hierarchical scheduling support"
++ depends on CGROUPS && IOSCHED_BFQ=y
++ default n
++ ---help---
++ Enable hierarchical scheduling in BFQ, using the cgroups
++ filesystem interface. The name of the subsystem will be
++ bfqio.
++
+ choice
+ prompt "Default I/O scheduler"
+ default DEFAULT_CFQ
+@@ -52,6 +73,16 @@ choice
+ config DEFAULT_CFQ
+ bool "CFQ" if IOSCHED_CFQ=y
+
++ config DEFAULT_BFQ
++ bool "BFQ" if IOSCHED_BFQ=y
++ help
++ Selects BFQ as the default I/O scheduler which will be
++ used by default for all block devices.
++ The BFQ I/O scheduler aims at distributing the bandwidth
++ as desired, independently of the disk parameters and with
++ any workload. It also tries to guarantee low latency to
++ interactive and soft real-time applications.
++
+ config DEFAULT_NOOP
+ bool "No-op"
+
+@@ -61,6 +92,7 @@ config DEFAULT_IOSCHED
+ string
+ default "deadline" if DEFAULT_DEADLINE
+ default "cfq" if DEFAULT_CFQ
++ default "bfq" if DEFAULT_BFQ
+ default "noop" if DEFAULT_NOOP
+
+ endmenu
+diff --git a/block/Makefile b/block/Makefile
+index a2ce6ac..a0fc06a 100644
+--- a/block/Makefile
++++ b/block/Makefile
+@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
+ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
+ obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
+ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
++obj-$(CONFIG_IOSCHED_BFQ) += bfq-iosched.o
+
+ obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+ obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
+diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
+index 98c4f9b..13b010d 100644
+--- a/include/linux/cgroup_subsys.h
++++ b/include/linux/cgroup_subsys.h
+@@ -35,6 +35,10 @@ SUBSYS(net_cls)
+ SUBSYS(blkio)
+ #endif
+
++#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
++SUBSYS(bfqio)
++#endif
++
+ #if IS_ENABLED(CONFIG_CGROUP_PERF)
+ SUBSYS(perf_event)
+ #endif
+--
+2.0.3
+
diff --git a/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1 b/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
new file mode 100644
index 0000000..6f630ba
--- /dev/null
+++ b/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
@@ -0,0 +1,6635 @@
+From c56e6c5db41f7137d3e0b38063ef0c944eec1898 Mon Sep 17 00:00:00 2001
+From: Paolo Valente <paolo.valente@unimore.it>
+Date: Thu, 9 May 2013 19:10:02 +0200
+Subject: [PATCH 2/3] block: introduce the BFQ-v7r5 I/O sched for 3.16
+
+Add the BFQ-v7r5 I/O scheduler to 3.16.
+The general structure is borrowed from CFQ, as much of the code for
+handling I/O contexts. Over time, several useful features have been
+ported from CFQ as well (details in the changelog in README.BFQ). A
+(bfq_)queue is associated to each task doing I/O on a device, and each
+time a scheduling decision has to be made a queue is selected and served
+until it expires.
+
+ - Slices are given in the service domain: tasks are assigned
+ budgets, measured in number of sectors. Once got the disk, a task
+ must however consume its assigned budget within a configurable
+ maximum time (by default, the maximum possible value of the
+ budgets is automatically computed to comply with this timeout).
+ This allows the desired latency vs "throughput boosting" tradeoff
+ to be set.
+
+ - Budgets are scheduled according to a variant of WF2Q+, implemented
+ using an augmented rb-tree to take eligibility into account while
+ preserving an O(log N) overall complexity.
+
+ - A low-latency tunable is provided; if enabled, both interactive
+ and soft real-time applications are guaranteed a very low latency.
+
+ - Latency guarantees are preserved also in the presence of NCQ.
+
+ - Also with flash-based devices, a high throughput is achieved
+ while still preserving latency guarantees.
+
+ - BFQ features Early Queue Merge (EQM), a sort of fusion of the
+ cooperating-queue-merging and the preemption mechanisms present
+ in CFQ. EQM is in fact a unified mechanism that tries to get a
+ sequential read pattern, and hence a high throughput, with any
+ set of processes performing interleaved I/O over a contiguous
+ sequence of sectors.
+
+ - BFQ supports full hierarchical scheduling, exporting a cgroups
+ interface. Since each node has a full scheduler, each group can
+ be assigned its own weight.
+
+ - If the cgroups interface is not used, only I/O priorities can be
+ assigned to processes, with ioprio values mapped to weights
+ with the relation weight = IOPRIO_BE_NR - ioprio.
+
+ - ioprio classes are served in strict priority order, i.e., lower
+ priority queues are not served as long as there are higher
+ priority queues. Among queues in the same class the bandwidth is
+ distributed in proportion to the weight of each queue. A very
+ thin extra bandwidth is however guaranteed to the Idle class, to
+ prevent it from starving.
+
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+---
+ block/bfq-cgroup.c | 930 +++++++++++++
+ block/bfq-ioc.c | 36 +
+ block/bfq-iosched.c | 3617 +++++++++++++++++++++++++++++++++++++++++++++++++++
+ block/bfq-sched.c | 1207 +++++++++++++++++
+ block/bfq.h | 742 +++++++++++
+ 5 files changed, 6532 insertions(+)
+ create mode 100644 block/bfq-cgroup.c
+ create mode 100644 block/bfq-ioc.c
+ create mode 100644 block/bfq-iosched.c
+ create mode 100644 block/bfq-sched.c
+ create mode 100644 block/bfq.h
+
+diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
+new file mode 100644
+index 0000000..f742806
+--- /dev/null
++++ b/block/bfq-cgroup.c
+@@ -0,0 +1,930 @@
++/*
++ * BFQ: CGROUPS support.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
++ * file.
++ */
++
++#ifdef CONFIG_CGROUP_BFQIO
++
++static DEFINE_MUTEX(bfqio_mutex);
++
++static bool bfqio_is_removed(struct bfqio_cgroup *bgrp)
++{
++ return bgrp ? !bgrp->online : false;
++}
++
++static struct bfqio_cgroup bfqio_root_cgroup = {
++ .weight = BFQ_DEFAULT_GRP_WEIGHT,
++ .ioprio = BFQ_DEFAULT_GRP_IOPRIO,
++ .ioprio_class = BFQ_DEFAULT_GRP_CLASS,
++};
++
++static inline void bfq_init_entity(struct bfq_entity *entity,
++ struct bfq_group *bfqg)
++{
++ entity->weight = entity->new_weight;
++ entity->orig_weight = entity->new_weight;
++ entity->ioprio = entity->new_ioprio;
++ entity->ioprio_class = entity->new_ioprio_class;
++ entity->parent = bfqg->my_entity;
++ entity->sched_data = &bfqg->sched_data;
++}
++
++static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css)
++{
++ return css ? container_of(css, struct bfqio_cgroup, css) : NULL;
++}
++
++/*
++ * Search the bfq_group for bfqd into the hash table (by now only a list)
++ * of bgrp. Must be called under rcu_read_lock().
++ */
++static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp,
++ struct bfq_data *bfqd)
++{
++ struct bfq_group *bfqg;
++ void *key;
++
++ hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) {
++ key = rcu_dereference(bfqg->bfqd);
++ if (key == bfqd)
++ return bfqg;
++ }
++
++ return NULL;
++}
++
++static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
++ struct bfq_group *bfqg)
++{
++ struct bfq_entity *entity = &bfqg->entity;
++
++ /*
++ * If the weight of the entity has never been set via the sysfs
++ * interface, then bgrp->weight == 0. In this case we initialize
++ * the weight from the current ioprio value. Otherwise, the group
++ * weight, if set, has priority over the ioprio value.
++ */
++ if (bgrp->weight == 0) {
++ entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio);
++ entity->new_ioprio = bgrp->ioprio;
++ } else {
++ entity->new_weight = bgrp->weight;
++ entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight);
++ }
++ entity->orig_weight = entity->weight = entity->new_weight;
++ entity->ioprio = entity->new_ioprio;
++ entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
++ entity->my_sched_data = &bfqg->sched_data;
++ bfqg->active_entities = 0;
++}
++
++static inline void bfq_group_set_parent(struct bfq_group *bfqg,
++ struct bfq_group *parent)
++{
++ struct bfq_entity *entity;
++
++ BUG_ON(parent == NULL);
++ BUG_ON(bfqg == NULL);
++
++ entity = &bfqg->entity;
++ entity->parent = parent->my_entity;
++ entity->sched_data = &parent->sched_data;
++}
++
++/**
++ * bfq_group_chain_alloc - allocate a chain of groups.
++ * @bfqd: queue descriptor.
++ * @css: the leaf cgroup_subsys_state this chain starts from.
++ *
++ * Allocate a chain of groups starting from the one belonging to
++ * @cgroup up to the root cgroup. Stop if a cgroup on the chain
++ * to the root has already an allocated group on @bfqd.
++ */
++static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd,
++ struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp;
++ struct bfq_group *bfqg, *prev = NULL, *leaf = NULL;
++
++ for (; css != NULL; css = css->parent) {
++ bgrp = css_to_bfqio(css);
++
++ bfqg = bfqio_lookup_group(bgrp, bfqd);
++ if (bfqg != NULL) {
++ /*
++ * All the cgroups in the path from there to the
++ * root must have a bfq_group for bfqd, so we don't
++ * need any more allocations.
++ */
++ break;
++ }
++
++ bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC);
++ if (bfqg == NULL)
++ goto cleanup;
++
++ bfq_group_init_entity(bgrp, bfqg);
++ bfqg->my_entity = &bfqg->entity;
++
++ if (leaf == NULL) {
++ leaf = bfqg;
++ prev = leaf;
++ } else {
++ bfq_group_set_parent(prev, bfqg);
++ /*
++ * Build a list of allocated nodes using the bfqd
++ * filed, that is still unused and will be
++ * initialized only after the node will be
++ * connected.
++ */
++ prev->bfqd = bfqg;
++ prev = bfqg;
++ }
++ }
++
++ return leaf;
++
++cleanup:
++ while (leaf != NULL) {
++ prev = leaf;
++ leaf = leaf->bfqd;
++ kfree(prev);
++ }
++
++ return NULL;
++}
++
++/**
++ * bfq_group_chain_link - link an allocated group chain to a cgroup
++ * hierarchy.
++ * @bfqd: the queue descriptor.
++ * @css: the leaf cgroup_subsys_state to start from.
++ * @leaf: the leaf group (to be associated to @cgroup).
++ *
++ * Try to link a chain of groups to a cgroup hierarchy, connecting the
++ * nodes bottom-up, so we can be sure that when we find a cgroup in the
++ * hierarchy that already as a group associated to @bfqd all the nodes
++ * in the path to the root cgroup have one too.
++ *
++ * On locking: the queue lock protects the hierarchy (there is a hierarchy
++ * per device) while the bfqio_cgroup lock protects the list of groups
++ * belonging to the same cgroup.
++ */
++static void bfq_group_chain_link(struct bfq_data *bfqd,
++ struct cgroup_subsys_state *css,
++ struct bfq_group *leaf)
++{
++ struct bfqio_cgroup *bgrp;
++ struct bfq_group *bfqg, *next, *prev = NULL;
++ unsigned long flags;
++
++ assert_spin_locked(bfqd->queue->queue_lock);
++
++ for (; css != NULL && leaf != NULL; css = css->parent) {
++ bgrp = css_to_bfqio(css);
++ next = leaf->bfqd;
++
++ bfqg = bfqio_lookup_group(bgrp, bfqd);
++ BUG_ON(bfqg != NULL);
++
++ spin_lock_irqsave(&bgrp->lock, flags);
++
++ rcu_assign_pointer(leaf->bfqd, bfqd);
++ hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
++ hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
++
++ spin_unlock_irqrestore(&bgrp->lock, flags);
++
++ prev = leaf;
++ leaf = next;
++ }
++
++ BUG_ON(css == NULL && leaf != NULL);
++ if (css != NULL && prev != NULL) {
++ bgrp = css_to_bfqio(css);
++ bfqg = bfqio_lookup_group(bgrp, bfqd);
++ bfq_group_set_parent(prev, bfqg);
++ }
++}
++
++/**
++ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
++ * @bfqd: queue descriptor.
++ * @cgroup: cgroup being searched for.
++ *
++ * Return a group associated to @bfqd in @cgroup, allocating one if
++ * necessary. When a group is returned all the cgroups in the path
++ * to the root have a group associated to @bfqd.
++ *
++ * If the allocation fails, return the root group: this breaks guarantees
++ * but is a safe fallback. If this loss becomes a problem it can be
++ * mitigated using the equivalent weight (given by the product of the
++ * weights of the groups in the path from @group to the root) in the
++ * root scheduler.
++ *
++ * We allocate all the missing nodes in the path from the leaf cgroup
++ * to the root and we connect the nodes only after all the allocations
++ * have been successful.
++ */
++static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
++ struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++ struct bfq_group *bfqg;
++
++ bfqg = bfqio_lookup_group(bgrp, bfqd);
++ if (bfqg != NULL)
++ return bfqg;
++
++ bfqg = bfq_group_chain_alloc(bfqd, css);
++ if (bfqg != NULL)
++ bfq_group_chain_link(bfqd, css, bfqg);
++ else
++ bfqg = bfqd->root_group;
++
++ return bfqg;
++}
++
++/**
++ * bfq_bfqq_move - migrate @bfqq to @bfqg.
++ * @bfqd: queue descriptor.
++ * @bfqq: the queue to move.
++ * @entity: @bfqq's entity.
++ * @bfqg: the group to move to.
++ *
++ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
++ * it on the new one. Avoid putting the entity on the old group idle tree.
++ *
++ * Must be called under the queue lock; the cgroup owning @bfqg must
++ * not disappear (by now this just means that we are called under
++ * rcu_read_lock()).
++ */
++static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ struct bfq_entity *entity, struct bfq_group *bfqg)
++{
++ int busy, resume;
++
++ busy = bfq_bfqq_busy(bfqq);
++ resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
++
++ BUG_ON(resume && !entity->on_st);
++ BUG_ON(busy && !resume && entity->on_st &&
++ bfqq != bfqd->in_service_queue);
++
++ if (busy) {
++ BUG_ON(atomic_read(&bfqq->ref) < 2);
++
++ if (!resume)
++ bfq_del_bfqq_busy(bfqd, bfqq, 0);
++ else
++ bfq_deactivate_bfqq(bfqd, bfqq, 0);
++ } else if (entity->on_st)
++ bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
++
++ /*
++ * Here we use a reference to bfqg. We don't need a refcounter
++ * as the cgroup reference will not be dropped, so that its
++ * destroy() callback will not be invoked.
++ */
++ entity->parent = bfqg->my_entity;
++ entity->sched_data = &bfqg->sched_data;
++
++ if (busy && resume)
++ bfq_activate_bfqq(bfqd, bfqq);
++
++ if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver)
++ bfq_schedule_dispatch(bfqd);
++}
++
++/**
++ * __bfq_bic_change_cgroup - move @bic to @cgroup.
++ * @bfqd: the queue descriptor.
++ * @bic: the bic to move.
++ * @cgroup: the cgroup to move to.
++ *
++ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller
++ * has to make sure that the reference to cgroup is valid across the call.
++ *
++ * NOTE: an alternative approach might have been to store the current
++ * cgroup in bfqq and getting a reference to it, reducing the lookup
++ * time here, at the price of slightly more complex code.
++ */
++static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
++ struct bfq_io_cq *bic,
++ struct cgroup_subsys_state *css)
++{
++ struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
++ struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
++ struct bfq_entity *entity;
++ struct bfq_group *bfqg;
++ struct bfqio_cgroup *bgrp;
++
++ bgrp = css_to_bfqio(css);
++
++ bfqg = bfq_find_alloc_group(bfqd, css);
++ if (async_bfqq != NULL) {
++ entity = &async_bfqq->entity;
++
++ if (entity->sched_data != &bfqg->sched_data) {
++ bic_set_bfqq(bic, NULL, 0);
++ bfq_log_bfqq(bfqd, async_bfqq,
++ "bic_change_group: %p %d",
++ async_bfqq, atomic_read(&async_bfqq->ref));
++ bfq_put_queue(async_bfqq);
++ }
++ }
++
++ if (sync_bfqq != NULL) {
++ entity = &sync_bfqq->entity;
++ if (entity->sched_data != &bfqg->sched_data)
++ bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
++ }
++
++ return bfqg;
++}
++
++/**
++ * bfq_bic_change_cgroup - move @bic to @cgroup.
++ * @bic: the bic being migrated.
++ * @cgroup: the destination cgroup.
++ *
++ * When the task owning @bic is moved to @cgroup, @bic is immediately
++ * moved into its new parent group.
++ */
++static void bfq_bic_change_cgroup(struct bfq_io_cq *bic,
++ struct cgroup_subsys_state *css)
++{
++ struct bfq_data *bfqd;
++ unsigned long uninitialized_var(flags);
++
++ bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
++ &flags);
++ if (bfqd != NULL) {
++ __bfq_bic_change_cgroup(bfqd, bic, css);
++ bfq_put_bfqd_unlock(bfqd, &flags);
++ }
++}
++
++/**
++ * bfq_bic_update_cgroup - update the cgroup of @bic.
++ * @bic: the @bic to update.
++ *
++ * Make sure that @bic is enqueued in the cgroup of the current task.
++ * We need this in addition to moving bics during the cgroup attach
++ * phase because the task owning @bic could be at its first disk
++ * access or we may end up in the root cgroup as the result of a
++ * memory allocation failure and here we try to move to the right
++ * group.
++ *
++ * Must be called under the queue lock. It is safe to use the returned
++ * value even after the rcu_read_unlock() as the migration/destruction
++ * paths act under the queue lock too. IOW it is impossible to race with
++ * group migration/destruction and end up with an invalid group as:
++ * a) here cgroup has not yet been destroyed, nor its destroy callback
++ * has started execution, as current holds a reference to it,
++ * b) if it is destroyed after rcu_read_unlock() [after current is
++ * migrated to a different cgroup] its attach() callback will have
++ * taken care of remove all the references to the old cgroup data.
++ */
++static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic)
++{
++ struct bfq_data *bfqd = bic_to_bfqd(bic);
++ struct bfq_group *bfqg;
++ struct cgroup_subsys_state *css;
++
++ BUG_ON(bfqd == NULL);
++
++ rcu_read_lock();
++ css = task_css(current, bfqio_cgrp_id);
++ bfqg = __bfq_bic_change_cgroup(bfqd, bic, css);
++ rcu_read_unlock();
++
++ return bfqg;
++}
++
++/**
++ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
++ * @st: the service tree being flushed.
++ */
++static inline void bfq_flush_idle_tree(struct bfq_service_tree *st)
++{
++ struct bfq_entity *entity = st->first_idle;
++
++ for (; entity != NULL; entity = st->first_idle)
++ __bfq_deactivate_entity(entity, 0);
++}
++
++/**
++ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
++ * @bfqd: the device data structure with the root group.
++ * @entity: the entity to move.
++ */
++static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++ BUG_ON(bfqq == NULL);
++ bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
++ return;
++}
++
++/**
++ * bfq_reparent_active_entities - move to the root group all active
++ * entities.
++ * @bfqd: the device data structure with the root group.
++ * @bfqg: the group to move from.
++ * @st: the service tree with the entities.
++ *
++ * Needs queue_lock to be taken and reference to be valid over the call.
++ */
++static inline void bfq_reparent_active_entities(struct bfq_data *bfqd,
++ struct bfq_group *bfqg,
++ struct bfq_service_tree *st)
++{
++ struct rb_root *active = &st->active;
++ struct bfq_entity *entity = NULL;
++
++ if (!RB_EMPTY_ROOT(&st->active))
++ entity = bfq_entity_of(rb_first(active));
++
++ for (; entity != NULL; entity = bfq_entity_of(rb_first(active)))
++ bfq_reparent_leaf_entity(bfqd, entity);
++
++ if (bfqg->sched_data.in_service_entity != NULL)
++ bfq_reparent_leaf_entity(bfqd,
++ bfqg->sched_data.in_service_entity);
++
++ return;
++}
++
++/**
++ * bfq_destroy_group - destroy @bfqg.
++ * @bgrp: the bfqio_cgroup containing @bfqg.
++ * @bfqg: the group being destroyed.
++ *
++ * Destroy @bfqg, making sure that it is not referenced from its parent.
++ */
++static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
++{
++ struct bfq_data *bfqd;
++ struct bfq_service_tree *st;
++ struct bfq_entity *entity = bfqg->my_entity;
++ unsigned long uninitialized_var(flags);
++ int i;
++
++ hlist_del(&bfqg->group_node);
++
++ /*
++ * Empty all service_trees belonging to this group before
++ * deactivating the group itself.
++ */
++ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
++ st = bfqg->sched_data.service_tree + i;
++
++ /*
++ * The idle tree may still contain bfq_queues belonging
++ * to exited task because they never migrated to a different
++ * cgroup from the one being destroyed now. No one else
++ * can access them so it's safe to act without any lock.
++ */
++ bfq_flush_idle_tree(st);
++
++ /*
++ * It may happen that some queues are still active
++ * (busy) upon group destruction (if the corresponding
++ * processes have been forced to terminate). We move
++ * all the leaf entities corresponding to these queues
++ * to the root_group.
++ * Also, it may happen that the group has an entity
++ * in service, which is disconnected from the active
++ * tree: it must be moved, too.
++ * There is no need to put the sync queues, as the
++ * scheduler has taken no reference.
++ */
++ bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
++ if (bfqd != NULL) {
++ bfq_reparent_active_entities(bfqd, bfqg, st);
++ bfq_put_bfqd_unlock(bfqd, &flags);
++ }
++ BUG_ON(!RB_EMPTY_ROOT(&st->active));
++ BUG_ON(!RB_EMPTY_ROOT(&st->idle));
++ }
++ BUG_ON(bfqg->sched_data.next_in_service != NULL);
++ BUG_ON(bfqg->sched_data.in_service_entity != NULL);
++
++ /*
++ * We may race with device destruction, take extra care when
++ * dereferencing bfqg->bfqd.
++ */
++ bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
++ if (bfqd != NULL) {
++ hlist_del(&bfqg->bfqd_node);
++ __bfq_deactivate_entity(entity, 0);
++ bfq_put_async_queues(bfqd, bfqg);
++ bfq_put_bfqd_unlock(bfqd, &flags);
++ }
++ BUG_ON(entity->tree != NULL);
++
++ /*
++ * No need to defer the kfree() to the end of the RCU grace
++ * period: we are called from the destroy() callback of our
++ * cgroup, so we can be sure that no one is a) still using
++ * this cgroup or b) doing lookups in it.
++ */
++ kfree(bfqg);
++}
++
++static void bfq_end_wr_async(struct bfq_data *bfqd)
++{
++ struct hlist_node *tmp;
++ struct bfq_group *bfqg;
++
++ hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node)
++ bfq_end_wr_async_queues(bfqd, bfqg);
++ bfq_end_wr_async_queues(bfqd, bfqd->root_group);
++}
++
++/**
++ * bfq_disconnect_groups - disconnect @bfqd from all its groups.
++ * @bfqd: the device descriptor being exited.
++ *
++ * When the device exits we just make sure that no lookup can return
++ * the now unused group structures. They will be deallocated on cgroup
++ * destruction.
++ */
++static void bfq_disconnect_groups(struct bfq_data *bfqd)
++{
++ struct hlist_node *tmp;
++ struct bfq_group *bfqg;
++
++ bfq_log(bfqd, "disconnect_groups beginning");
++ hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) {
++ hlist_del(&bfqg->bfqd_node);
++
++ __bfq_deactivate_entity(bfqg->my_entity, 0);
++
++ /*
++ * Don't remove from the group hash, just set an
++ * invalid key. No lookups can race with the
++ * assignment as bfqd is being destroyed; this
++ * implies also that new elements cannot be added
++ * to the list.
++ */
++ rcu_assign_pointer(bfqg->bfqd, NULL);
++
++ bfq_log(bfqd, "disconnect_groups: put async for group %p",
++ bfqg);
++ bfq_put_async_queues(bfqd, bfqg);
++ }
++}
++
++static inline void bfq_free_root_group(struct bfq_data *bfqd)
++{
++ struct bfqio_cgroup *bgrp = &bfqio_root_cgroup;
++ struct bfq_group *bfqg = bfqd->root_group;
++
++ bfq_put_async_queues(bfqd, bfqg);
++
++ spin_lock_irq(&bgrp->lock);
++ hlist_del_rcu(&bfqg->group_node);
++ spin_unlock_irq(&bgrp->lock);
++
++ /*
++ * No need to synchronize_rcu() here: since the device is gone
++ * there cannot be any read-side access to its root_group.
++ */
++ kfree(bfqg);
++}
++
++static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
++{
++ struct bfq_group *bfqg;
++ struct bfqio_cgroup *bgrp;
++ int i;
++
++ bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node);
++ if (bfqg == NULL)
++ return NULL;
++
++ bfqg->entity.parent = NULL;
++ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
++ bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
++
++ bgrp = &bfqio_root_cgroup;
++ spin_lock_irq(&bgrp->lock);
++ rcu_assign_pointer(bfqg->bfqd, bfqd);
++ hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data);
++ spin_unlock_irq(&bgrp->lock);
++
++ return bfqg;
++}
++
++#define SHOW_FUNCTION(__VAR) \
++static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \
++ struct cftype *cftype) \
++{ \
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css); \
++ u64 ret = -ENODEV; \
++ \
++ mutex_lock(&bfqio_mutex); \
++ if (bfqio_is_removed(bgrp)) \
++ goto out_unlock; \
++ \
++ spin_lock_irq(&bgrp->lock); \
++ ret = bgrp->__VAR; \
++ spin_unlock_irq(&bgrp->lock); \
++ \
++out_unlock: \
++ mutex_unlock(&bfqio_mutex); \
++ return ret; \
++}
++
++SHOW_FUNCTION(weight);
++SHOW_FUNCTION(ioprio);
++SHOW_FUNCTION(ioprio_class);
++#undef SHOW_FUNCTION
++
++#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
++static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\
++ struct cftype *cftype, \
++ u64 val) \
++{ \
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css); \
++ struct bfq_group *bfqg; \
++ int ret = -EINVAL; \
++ \
++ if (val < (__MIN) || val > (__MAX)) \
++ return ret; \
++ \
++ ret = -ENODEV; \
++ mutex_lock(&bfqio_mutex); \
++ if (bfqio_is_removed(bgrp)) \
++ goto out_unlock; \
++ ret = 0; \
++ \
++ spin_lock_irq(&bgrp->lock); \
++ bgrp->__VAR = (unsigned short)val; \
++ hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) { \
++ /* \
++ * Setting the ioprio_changed flag of the entity \
++ * to 1 with new_##__VAR == ##__VAR would re-set \
++ * the value of the weight to its ioprio mapping. \
++ * Set the flag only if necessary. \
++ */ \
++ if ((unsigned short)val != bfqg->entity.new_##__VAR) { \
++ bfqg->entity.new_##__VAR = (unsigned short)val; \
++ /* \
++ * Make sure that the above new value has been \
++ * stored in bfqg->entity.new_##__VAR before \
++ * setting the ioprio_changed flag. In fact, \
++ * this flag may be read asynchronously (in \
++ * critical sections protected by a different \
++ * lock than that held here), and finding this \
++ * flag set may cause the execution of the code \
++ * for updating parameters whose value may \
++ * depend also on bfqg->entity.new_##__VAR (in \
++ * __bfq_entity_update_weight_prio). \
++ * This barrier makes sure that the new value \
++ * of bfqg->entity.new_##__VAR is correctly \
++ * seen in that code. \
++ */ \
++ smp_wmb(); \
++ bfqg->entity.ioprio_changed = 1; \
++ } \
++ } \
++ spin_unlock_irq(&bgrp->lock); \
++ \
++out_unlock: \
++ mutex_unlock(&bfqio_mutex); \
++ return ret; \
++}
++
++STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT);
++STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
++STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
++#undef STORE_FUNCTION
++
++static struct cftype bfqio_files[] = {
++ {
++ .name = "weight",
++ .read_u64 = bfqio_cgroup_weight_read,
++ .write_u64 = bfqio_cgroup_weight_write,
++ },
++ {
++ .name = "ioprio",
++ .read_u64 = bfqio_cgroup_ioprio_read,
++ .write_u64 = bfqio_cgroup_ioprio_write,
++ },
++ {
++ .name = "ioprio_class",
++ .read_u64 = bfqio_cgroup_ioprio_class_read,
++ .write_u64 = bfqio_cgroup_ioprio_class_write,
++ },
++ { }, /* terminate */
++};
++
++static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state
++ *parent_css)
++{
++ struct bfqio_cgroup *bgrp;
++
++ if (parent_css != NULL) {
++ bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL);
++ if (bgrp == NULL)
++ return ERR_PTR(-ENOMEM);
++ } else
++ bgrp = &bfqio_root_cgroup;
++
++ spin_lock_init(&bgrp->lock);
++ INIT_HLIST_HEAD(&bgrp->group_data);
++ bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO;
++ bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS;
++
++ return &bgrp->css;
++}
++
++/*
++ * We cannot support shared io contexts, as we have no means to support
++ * two tasks with the same ioc in two different groups without major rework
++ * of the main bic/bfqq data structures. By now we allow a task to change
++ * its cgroup only if it's the only owner of its ioc; the drawback of this
++ * behavior is that a group containing a task that forked using CLONE_IO
++ * will not be destroyed until the tasks sharing the ioc die.
++ */
++static int bfqio_can_attach(struct cgroup_subsys_state *css,
++ struct cgroup_taskset *tset)
++{
++ struct task_struct *task;
++ struct io_context *ioc;
++ int ret = 0;
++
++ cgroup_taskset_for_each(task, tset) {
++ /*
++ * task_lock() is needed to avoid races with
++ * exit_io_context()
++ */
++ task_lock(task);
++ ioc = task->io_context;
++ if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
++ /*
++ * ioc == NULL means that the task is either too
++ * young or exiting: if it has still no ioc the
++ * ioc can't be shared, if the task is exiting the
++ * attach will fail anyway, no matter what we
++ * return here.
++ */
++ ret = -EINVAL;
++ task_unlock(task);
++ if (ret)
++ break;
++ }
++
++ return ret;
++}
++
++static void bfqio_attach(struct cgroup_subsys_state *css,
++ struct cgroup_taskset *tset)
++{
++ struct task_struct *task;
++ struct io_context *ioc;
++ struct io_cq *icq;
++
++ /*
++ * IMPORTANT NOTE: The move of more than one process at a time to a
++ * new group has not yet been tested.
++ */
++ cgroup_taskset_for_each(task, tset) {
++ ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
++ if (ioc) {
++ /*
++ * Handle cgroup change here.
++ */
++ rcu_read_lock();
++ hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node)
++ if (!strncmp(
++ icq->q->elevator->type->elevator_name,
++ "bfq", ELV_NAME_MAX))
++ bfq_bic_change_cgroup(icq_to_bic(icq),
++ css);
++ rcu_read_unlock();
++ put_io_context(ioc);
++ }
++ }
++}
++
++static void bfqio_destroy(struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++ struct hlist_node *tmp;
++ struct bfq_group *bfqg;
++
++ /*
++ * Since we are destroying the cgroup, there are no more tasks
++ * referencing it, and all the RCU grace periods that may have
++ * referenced it are ended (as the destruction of the parent
++ * cgroup is RCU-safe); bgrp->group_data will not be accessed by
++ * anything else and we don't need any synchronization.
++ */
++ hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node)
++ bfq_destroy_group(bgrp, bfqg);
++
++ BUG_ON(!hlist_empty(&bgrp->group_data));
++
++ kfree(bgrp);
++}
++
++static int bfqio_css_online(struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++
++ mutex_lock(&bfqio_mutex);
++ bgrp->online = true;
++ mutex_unlock(&bfqio_mutex);
++
++ return 0;
++}
++
++static void bfqio_css_offline(struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++
++ mutex_lock(&bfqio_mutex);
++ bgrp->online = false;
++ mutex_unlock(&bfqio_mutex);
++}
++
++struct cgroup_subsys bfqio_cgrp_subsys = {
++ .css_alloc = bfqio_create,
++ .css_online = bfqio_css_online,
++ .css_offline = bfqio_css_offline,
++ .can_attach = bfqio_can_attach,
++ .attach = bfqio_attach,
++ .css_free = bfqio_destroy,
++ .base_cftypes = bfqio_files,
++};
++#else
++static inline void bfq_init_entity(struct bfq_entity *entity,
++ struct bfq_group *bfqg)
++{
++ entity->weight = entity->new_weight;
++ entity->orig_weight = entity->new_weight;
++ entity->ioprio = entity->new_ioprio;
++ entity->ioprio_class = entity->new_ioprio_class;
++ entity->sched_data = &bfqg->sched_data;
++}
++
++static inline struct bfq_group *
++bfq_bic_update_cgroup(struct bfq_io_cq *bic)
++{
++ struct bfq_data *bfqd = bic_to_bfqd(bic);
++ return bfqd->root_group;
++}
++
++static inline void bfq_bfqq_move(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ struct bfq_entity *entity,
++ struct bfq_group *bfqg)
++{
++}
++
++static void bfq_end_wr_async(struct bfq_data *bfqd)
++{
++ bfq_end_wr_async_queues(bfqd, bfqd->root_group);
++}
++
++static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
++{
++ bfq_put_async_queues(bfqd, bfqd->root_group);
++}
++
++static inline void bfq_free_root_group(struct bfq_data *bfqd)
++{
++ kfree(bfqd->root_group);
++}
++
++static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
++{
++ struct bfq_group *bfqg;
++ int i;
++
++ bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
++ if (bfqg == NULL)
++ return NULL;
++
++ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
++ bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
++
++ return bfqg;
++}
++#endif
+diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
+new file mode 100644
+index 0000000..7f6b000
+--- /dev/null
++++ b/block/bfq-ioc.c
+@@ -0,0 +1,36 @@
++/*
++ * BFQ: I/O context handling.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++/**
++ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
++ * @icq: the iocontext queue.
++ */
++static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
++{
++ /* bic->icq is the first member, %NULL will convert to %NULL */
++ return container_of(icq, struct bfq_io_cq, icq);
++}
++
++/**
++ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
++ * @bfqd: the lookup key.
++ * @ioc: the io_context of the process doing I/O.
++ *
++ * Queue lock must be held.
++ */
++static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
++ struct io_context *ioc)
++{
++ if (ioc)
++ return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
++ return NULL;
++}
+diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
+new file mode 100644
+index 0000000..0a0891b
+--- /dev/null
++++ b/block/bfq-iosched.c
+@@ -0,0 +1,3617 @@
++/*
++ * Budget Fair Queueing (BFQ) disk scheduler.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
++ * file.
++ *
++ * BFQ is a proportional-share storage-I/O scheduling algorithm based on
++ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
++ * measured in number of sectors, to processes instead of time slices. The
++ * device is not granted to the in-service process for a given time slice,
++ * but until it has exhausted its assigned budget. This change from the time
++ * to the service domain allows BFQ to distribute the device throughput
++ * among processes as desired, without any distortion due to ZBR, workload
++ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler,
++ * called B-WF2Q+, to schedule processes according to their budgets. More
++ * precisely, BFQ schedules queues associated to processes. Thanks to the
++ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
++ * I/O-bound processes issuing sequential requests (to boost the
++ * throughput), and yet guarantee a low latency to interactive and soft
++ * real-time applications.
++ *
++ * BFQ is described in [1], where also a reference to the initial, more
++ * theoretical paper on BFQ can be found. The interested reader can find
++ * in the latter paper full details on the main algorithm, as well as
++ * formulas of the guarantees and formal proofs of all the properties.
++ * With respect to the version of BFQ presented in these papers, this
++ * implementation adds a few more heuristics, such as the one that
++ * guarantees a low latency to soft real-time applications, and a
++ * hierarchical extension based on H-WF2Q+.
++ *
++ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
++ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
++ * complexity derives from the one introduced with EEVDF in [3].
++ *
++ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness
++ * with the BFQ Disk I/O Scheduler'',
++ * Proceedings of the 5th Annual International Systems and Storage
++ * Conference (SYSTOR '12), June 2012.
++ *
++ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
++ *
++ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
++ * Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
++ * Oct 1997.
++ *
++ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
++ *
++ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
++ * First: A Flexible and Accurate Mechanism for Proportional Share
++ * Resource Allocation,'' technical report.
++ *
++ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
++ */
++#include <linux/module.h>
++#include <linux/slab.h>
++#include <linux/blkdev.h>
++#include <linux/cgroup.h>
++#include <linux/elevator.h>
++#include <linux/jiffies.h>
++#include <linux/rbtree.h>
++#include <linux/ioprio.h>
++#include "bfq.h"
++#include "blk.h"
++
++/* Max number of dispatches in one round of service. */
++static const int bfq_quantum = 4;
++
++/* Expiration time of sync (0) and async (1) requests, in jiffies. */
++static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
++
++/* Maximum backwards seek, in KiB. */
++static const int bfq_back_max = 16 * 1024;
++
++/* Penalty of a backwards seek, in number of sectors. */
++static const int bfq_back_penalty = 2;
++
++/* Idling period duration, in jiffies. */
++static int bfq_slice_idle = HZ / 125;
++
++/* Default maximum budget values, in sectors and number of requests. */
++static const int bfq_default_max_budget = 16 * 1024;
++static const int bfq_max_budget_async_rq = 4;
++
++/*
++ * Async to sync throughput distribution is controlled as follows:
++ * when an async request is served, the entity is charged the number
++ * of sectors of the request, multiplied by the factor below
++ */
++static const int bfq_async_charge_factor = 10;
++
++/* Default timeout values, in jiffies, approximating CFQ defaults. */
++static const int bfq_timeout_sync = HZ / 8;
++static int bfq_timeout_async = HZ / 25;
++
++struct kmem_cache *bfq_pool;
++
++/* Below this threshold (in ms), we consider thinktime immediate. */
++#define BFQ_MIN_TT 2
++
++/* hw_tag detection: parallel requests threshold and min samples needed. */
++#define BFQ_HW_QUEUE_THRESHOLD 4
++#define BFQ_HW_QUEUE_SAMPLES 32
++
++#define BFQQ_SEEK_THR (sector_t)(8 * 1024)
++#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
++
++/* Min samples used for peak rate estimation (for autotuning). */
++#define BFQ_PEAK_RATE_SAMPLES 32
++
++/* Shift used for peak rate fixed precision calculations. */
++#define BFQ_RATE_SHIFT 16
++
++/*
++ * By default, BFQ computes the duration of the weight raising for
++ * interactive applications automatically, using the following formula:
++ * duration = (R / r) * T, where r is the peak rate of the device, and
++ * R and T are two reference parameters.
++ * In particular, R is the peak rate of the reference device (see below),
++ * and T is a reference time: given the systems that are likely to be
++ * installed on the reference device according to its speed class, T is
++ * about the maximum time needed, under BFQ and while reading two files in
++ * parallel, to load typical large applications on these systems.
++ * In practice, the slower/faster the device at hand is, the more/less it
++ * takes to load applications with respect to the reference device.
++ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
++ * applications.
++ *
++ * BFQ uses four different reference pairs (R, T), depending on:
++ * . whether the device is rotational or non-rotational;
++ * . whether the device is slow, such as old or portable HDDs, as well as
++ * SD cards, or fast, such as newer HDDs and SSDs.
++ *
++ * The device's speed class is dynamically (re)detected in
++ * bfq_update_peak_rate() every time the estimated peak rate is updated.
++ *
++ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0]
++ * are the reference values for a slow/fast rotational device, whereas
++ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for
++ * a slow/fast non-rotational device. Finally, device_speed_thresh are the
++ * thresholds used to switch between speed classes.
++ * Both the reference peak rates and the thresholds are measured in
++ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
++ */
++static int R_slow[2] = {1536, 10752};
++static int R_fast[2] = {17415, 34791};
++/*
++ * To improve readability, a conversion function is used to initialize the
++ * following arrays, which entails that they can be initialized only in a
++ * function.
++ */
++static int T_slow[2];
++static int T_fast[2];
++static int device_speed_thresh[2];
++
++#define BFQ_SERVICE_TREE_INIT ((struct bfq_service_tree) \
++ { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
++
++#define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0])
++#define RQ_BFQQ(rq) ((rq)->elv.priv[1])
++
++static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
++
++#include "bfq-ioc.c"
++#include "bfq-sched.c"
++#include "bfq-cgroup.c"
++
++#define bfq_class_idle(bfqq) ((bfqq)->entity.ioprio_class ==\
++ IOPRIO_CLASS_IDLE)
++#define bfq_class_rt(bfqq) ((bfqq)->entity.ioprio_class ==\
++ IOPRIO_CLASS_RT)
++
++#define bfq_sample_valid(samples) ((samples) > 80)
++
++/*
++ * We regard a request as SYNC, if either it's a read or has the SYNC bit
++ * set (in which case it could also be a direct WRITE).
++ */
++static inline int bfq_bio_sync(struct bio *bio)
++{
++ if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC))
++ return 1;
++
++ return 0;
++}
++
++/*
++ * Scheduler run of queue, if there are requests pending and no one in the
++ * driver that will restart queueing.
++ */
++static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
++{
++ if (bfqd->queued != 0) {
++ bfq_log(bfqd, "schedule dispatch");
++ kblockd_schedule_work(&bfqd->unplug_work);
++ }
++}
++
++/*
++ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
++ * We choose the request that is closesr to the head right now. Distance
++ * behind the head is penalized and only allowed to a certain extent.
++ */
++static struct request *bfq_choose_req(struct bfq_data *bfqd,
++ struct request *rq1,
++ struct request *rq2,
++ sector_t last)
++{
++ sector_t s1, s2, d1 = 0, d2 = 0;
++ unsigned long back_max;
++#define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */
++#define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */
++ unsigned wrap = 0; /* bit mask: requests behind the disk head? */
++
++ if (rq1 == NULL || rq1 == rq2)
++ return rq2;
++ if (rq2 == NULL)
++ return rq1;
++
++ if (rq_is_sync(rq1) && !rq_is_sync(rq2))
++ return rq1;
++ else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
++ return rq2;
++ if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
++ return rq1;
++ else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
++ return rq2;
++
++ s1 = blk_rq_pos(rq1);
++ s2 = blk_rq_pos(rq2);
++
++ /*
++ * By definition, 1KiB is 2 sectors.
++ */
++ back_max = bfqd->bfq_back_max * 2;
++
++ /*
++ * Strict one way elevator _except_ in the case where we allow
++ * short backward seeks which are biased as twice the cost of a
++ * similar forward seek.
++ */
++ if (s1 >= last)
++ d1 = s1 - last;
++ else if (s1 + back_max >= last)
++ d1 = (last - s1) * bfqd->bfq_back_penalty;
++ else
++ wrap |= BFQ_RQ1_WRAP;
++
++ if (s2 >= last)
++ d2 = s2 - last;
++ else if (s2 + back_max >= last)
++ d2 = (last - s2) * bfqd->bfq_back_penalty;
++ else
++ wrap |= BFQ_RQ2_WRAP;
++
++ /* Found required data */
++
++ /*
++ * By doing switch() on the bit mask "wrap" we avoid having to
++ * check two variables for all permutations: --> faster!
++ */
++ switch (wrap) {
++ case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
++ if (d1 < d2)
++ return rq1;
++ else if (d2 < d1)
++ return rq2;
++ else {
++ if (s1 >= s2)
++ return rq1;
++ else
++ return rq2;
++ }
++
++ case BFQ_RQ2_WRAP:
++ return rq1;
++ case BFQ_RQ1_WRAP:
++ return rq2;
++ case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
++ default:
++ /*
++ * Since both rqs are wrapped,
++ * start with the one that's further behind head
++ * (--> only *one* back seek required),
++ * since back seek takes more time than forward.
++ */
++ if (s1 <= s2)
++ return rq1;
++ else
++ return rq2;
++ }
++}
++
++static struct bfq_queue *
++bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
++ sector_t sector, struct rb_node **ret_parent,
++ struct rb_node ***rb_link)
++{
++ struct rb_node **p, *parent;
++ struct bfq_queue *bfqq = NULL;
++
++ parent = NULL;
++ p = &root->rb_node;
++ while (*p) {
++ struct rb_node **n;
++
++ parent = *p;
++ bfqq = rb_entry(parent, struct bfq_queue, pos_node);
++
++ /*
++ * Sort strictly based on sector. Smallest to the left,
++ * largest to the right.
++ */
++ if (sector > blk_rq_pos(bfqq->next_rq))
++ n = &(*p)->rb_right;
++ else if (sector < blk_rq_pos(bfqq->next_rq))
++ n = &(*p)->rb_left;
++ else
++ break;
++ p = n;
++ bfqq = NULL;
++ }
++
++ *ret_parent = parent;
++ if (rb_link)
++ *rb_link = p;
++
++ bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
++ (long long unsigned)sector,
++ bfqq != NULL ? bfqq->pid : 0);
++
++ return bfqq;
++}
++
++static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ struct rb_node **p, *parent;
++ struct bfq_queue *__bfqq;
++
++ if (bfqq->pos_root != NULL) {
++ rb_erase(&bfqq->pos_node, bfqq->pos_root);
++ bfqq->pos_root = NULL;
++ }
++
++ if (bfq_class_idle(bfqq))
++ return;
++ if (!bfqq->next_rq)
++ return;
++
++ bfqq->pos_root = &bfqd->rq_pos_tree;
++ __bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
++ blk_rq_pos(bfqq->next_rq), &parent, &p);
++ if (__bfqq == NULL) {
++ rb_link_node(&bfqq->pos_node, parent, p);
++ rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
++ } else
++ bfqq->pos_root = NULL;
++}
++
++/*
++ * Tell whether there are active queues or groups with differentiated weights.
++ */
++static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
++{
++ BUG_ON(!bfqd->hw_tag);
++ /*
++ * For weights to differ, at least one of the trees must contain
++ * at least two nodes.
++ */
++ return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
++ (bfqd->queue_weights_tree.rb_node->rb_left ||
++ bfqd->queue_weights_tree.rb_node->rb_right)
++#ifdef CONFIG_CGROUP_BFQIO
++ ) ||
++ (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
++ (bfqd->group_weights_tree.rb_node->rb_left ||
++ bfqd->group_weights_tree.rb_node->rb_right)
++#endif
++ );
++}
++
++/*
++ * If the weight-counter tree passed as input contains no counter for
++ * the weight of the input entity, then add that counter; otherwise just
++ * increment the existing counter.
++ *
++ * Note that weight-counter trees contain few nodes in mostly symmetric
++ * scenarios. For example, if all queues have the same weight, then the
++ * weight-counter tree for the queues may contain at most one node.
++ * This holds even if low_latency is on, because weight-raised queues
++ * are not inserted in the tree.
++ * In most scenarios, the rate at which nodes are created/destroyed
++ * should be low too.
++ */
++static void bfq_weights_tree_add(struct bfq_data *bfqd,
++ struct bfq_entity *entity,
++ struct rb_root *root)
++{
++ struct rb_node **new = &(root->rb_node), *parent = NULL;
++
++ /*
++ * Do not insert if:
++ * - the device does not support queueing;
++ * - the entity is already associated with a counter, which happens if:
++ * 1) the entity is associated with a queue, 2) a request arrival
++ * has caused the queue to become both non-weight-raised, and hence
++ * change its weight, and backlogged; in this respect, each
++ * of the two events causes an invocation of this function,
++ * 3) this is the invocation of this function caused by the second
++ * event. This second invocation is actually useless, and we handle
++ * this fact by exiting immediately. More efficient or clearer
++ * solutions might possibly be adopted.
++ */
++ if (!bfqd->hw_tag || entity->weight_counter)
++ return;
++
++ while (*new) {
++ struct bfq_weight_counter *__counter = container_of(*new,
++ struct bfq_weight_counter,
++ weights_node);
++ parent = *new;
++
++ if (entity->weight == __counter->weight) {
++ entity->weight_counter = __counter;
++ goto inc_counter;
++ }
++ if (entity->weight < __counter->weight)
++ new = &((*new)->rb_left);
++ else
++ new = &((*new)->rb_right);
++ }
++
++ entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
++ GFP_ATOMIC);
++ entity->weight_counter->weight = entity->weight;
++ rb_link_node(&entity->weight_counter->weights_node, parent, new);
++ rb_insert_color(&entity->weight_counter->weights_node, root);
++
++inc_counter:
++ entity->weight_counter->num_active++;
++}
++
++/*
++ * Decrement the weight counter associated with the entity, and, if the
++ * counter reaches 0, remove the counter from the tree.
++ * See the comments to the function bfq_weights_tree_add() for considerations
++ * about overhead.
++ */
++static void bfq_weights_tree_remove(struct bfq_data *bfqd,
++ struct bfq_entity *entity,
++ struct rb_root *root)
++{
++ /*
++ * Check whether the entity is actually associated with a counter.
++ * In fact, the device may not be considered NCQ-capable for a while,
++ * which implies that no insertion in the weight trees is performed,
++ * after which the device may start to be deemed NCQ-capable, and hence
++ * this function may start to be invoked. This may cause the function
++ * to be invoked for entities that are not associated with any counter.
++ */
++ if (!entity->weight_counter)
++ return;
++
++ BUG_ON(RB_EMPTY_ROOT(root));
++ BUG_ON(entity->weight_counter->weight != entity->weight);
++
++ BUG_ON(!entity->weight_counter->num_active);
++ entity->weight_counter->num_active--;
++ if (entity->weight_counter->num_active > 0)
++ goto reset_entity_pointer;
++
++ rb_erase(&entity->weight_counter->weights_node, root);
++ kfree(entity->weight_counter);
++
++reset_entity_pointer:
++ entity->weight_counter = NULL;
++}
++
++static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ struct request *last)
++{
++ struct rb_node *rbnext = rb_next(&last->rb_node);
++ struct rb_node *rbprev = rb_prev(&last->rb_node);
++ struct request *next = NULL, *prev = NULL;
++
++ BUG_ON(RB_EMPTY_NODE(&last->rb_node));
++
++ if (rbprev != NULL)
++ prev = rb_entry_rq(rbprev);
++
++ if (rbnext != NULL)
++ next = rb_entry_rq(rbnext);
++ else {
++ rbnext = rb_first(&bfqq->sort_list);
++ if (rbnext && rbnext != &last->rb_node)
++ next = rb_entry_rq(rbnext);
++ }
++
++ return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
++}
++
++/* see the definition of bfq_async_charge_factor for details */
++static inline unsigned long bfq_serv_to_charge(struct request *rq,
++ struct bfq_queue *bfqq)
++{
++ return blk_rq_sectors(rq) *
++ (1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
++ bfq_async_charge_factor));
++}
++
++/**
++ * bfq_updated_next_req - update the queue after a new next_rq selection.
++ * @bfqd: the device data the queue belongs to.
++ * @bfqq: the queue to update.
++ *
++ * If the first request of a queue changes we make sure that the queue
++ * has enough budget to serve at least its first request (if the
++ * request has grown). We do this because if the queue has not enough
++ * budget for its first request, it has to go through two dispatch
++ * rounds to actually get it dispatched.
++ */
++static void bfq_updated_next_req(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++ struct request *next_rq = bfqq->next_rq;
++ unsigned long new_budget;
++
++ if (next_rq == NULL)
++ return;
++
++ if (bfqq == bfqd->in_service_queue)
++ /*
++ * In order not to break guarantees, budgets cannot be
++ * changed after an entity has been selected.
++ */
++ return;
++
++ BUG_ON(entity->tree != &st->active);
++ BUG_ON(entity == entity->sched_data->in_service_entity);
++
++ new_budget = max_t(unsigned long, bfqq->max_budget,
++ bfq_serv_to_charge(next_rq, bfqq));
++ if (entity->budget != new_budget) {
++ entity->budget = new_budget;
++ bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
++ new_budget);
++ bfq_activate_bfqq(bfqd, bfqq);
++ }
++}
++
++static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
++{
++ u64 dur;
++
++ if (bfqd->bfq_wr_max_time > 0)
++ return bfqd->bfq_wr_max_time;
++
++ dur = bfqd->RT_prod;
++ do_div(dur, bfqd->peak_rate);
++
++ return dur;
++}
++
++static void bfq_add_request(struct request *rq)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++ struct bfq_entity *entity = &bfqq->entity;
++ struct bfq_data *bfqd = bfqq->bfqd;
++ struct request *next_rq, *prev;
++ unsigned long old_wr_coeff = bfqq->wr_coeff;
++ int idle_for_long_time = 0;
++
++ bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
++ bfqq->queued[rq_is_sync(rq)]++;
++ bfqd->queued++;
++
++ elv_rb_add(&bfqq->sort_list, rq);
++
++ /*
++ * Check if this request is a better next-serve candidate.
++ */
++ prev = bfqq->next_rq;
++ next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
++ BUG_ON(next_rq == NULL);
++ bfqq->next_rq = next_rq;
++
++ /*
++ * Adjust priority tree position, if next_rq changes.
++ */
++ if (prev != bfqq->next_rq)
++ bfq_rq_pos_tree_add(bfqd, bfqq);
++
++ if (!bfq_bfqq_busy(bfqq)) {
++ int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++ time_is_before_jiffies(bfqq->soft_rt_next_start);
++ idle_for_long_time = time_is_before_jiffies(
++ bfqq->budget_timeout +
++ bfqd->bfq_wr_min_idle_time);
++ entity->budget = max_t(unsigned long, bfqq->max_budget,
++ bfq_serv_to_charge(next_rq, bfqq));
++
++ if (!bfq_bfqq_IO_bound(bfqq)) {
++ if (time_before(jiffies,
++ RQ_BIC(rq)->ttime.last_end_request +
++ bfqd->bfq_slice_idle)) {
++ bfqq->requests_within_timer++;
++ if (bfqq->requests_within_timer >=
++ bfqd->bfq_requests_within_timer)
++ bfq_mark_bfqq_IO_bound(bfqq);
++ } else
++ bfqq->requests_within_timer = 0;
++ }
++
++ if (!bfqd->low_latency)
++ goto add_bfqq_busy;
++
++ /*
++ * If the queue is not being boosted and has been idle
++ * for enough time, start a weight-raising period
++ */
++ if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
++ if (idle_for_long_time)
++ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++ else
++ bfqq->wr_cur_max_time =
++ bfqd->bfq_wr_rt_max_time;
++ bfq_log_bfqq(bfqd, bfqq,
++ "wrais starting at %lu, rais_max_time %u",
++ jiffies,
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ } else if (old_wr_coeff > 1) {
++ if (idle_for_long_time)
++ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++ else if (bfqq->wr_cur_max_time ==
++ bfqd->bfq_wr_rt_max_time &&
++ !soft_rt) {
++ bfqq->wr_coeff = 1;
++ bfq_log_bfqq(bfqd, bfqq,
++ "wrais ending at %lu, rais_max_time %u",
++ jiffies,
++ jiffies_to_msecs(bfqq->
++ wr_cur_max_time));
++ } else if (time_before(
++ bfqq->last_wr_start_finish +
++ bfqq->wr_cur_max_time,
++ jiffies +
++ bfqd->bfq_wr_rt_max_time) &&
++ soft_rt) {
++ /*
++ *
++ * The remaining weight-raising time is lower
++ * than bfqd->bfq_wr_rt_max_time, which
++ * means that the application is enjoying
++ * weight raising either because deemed soft-
++ * rt in the near past, or because deemed
++ * interactive a long ago. In both cases,
++ * resetting now the current remaining weight-
++ * raising time for the application to the
++ * weight-raising duration for soft rt
++ * applications would not cause any latency
++ * increase for the application (as the new
++ * duration would be higher than the remaining
++ * time).
++ *
++ * In addition, the application is now meeting
++ * the requirements for being deemed soft rt.
++ * In the end we can correctly and safely
++ * (re)charge the weight-raising duration for
++ * the application with the weight-raising
++ * duration for soft rt applications.
++ *
++ * In particular, doing this recharge now, i.e.,
++ * before the weight-raising period for the
++ * application finishes, reduces the probability
++ * of the following negative scenario:
++ * 1) the weight of a soft rt application is
++ * raised at startup (as for any newly
++ * created application),
++ * 2) since the application is not interactive,
++ * at a certain time weight-raising is
++ * stopped for the application,
++ * 3) at that time the application happens to
++ * still have pending requests, and hence
++ * is destined to not have a chance to be
++ * deemed soft rt before these requests are
++ * completed (see the comments to the
++ * function bfq_bfqq_softrt_next_start()
++ * for details on soft rt detection),
++ * 4) these pending requests experience a high
++ * latency because the application is not
++ * weight-raised while they are pending.
++ */
++ bfqq->last_wr_start_finish = jiffies;
++ bfqq->wr_cur_max_time =
++ bfqd->bfq_wr_rt_max_time;
++ }
++ }
++ if (old_wr_coeff != bfqq->wr_coeff)
++ entity->ioprio_changed = 1;
++add_bfqq_busy:
++ bfqq->last_idle_bklogged = jiffies;
++ bfqq->service_from_backlogged = 0;
++ bfq_clear_bfqq_softrt_update(bfqq);
++ bfq_add_bfqq_busy(bfqd, bfqq);
++ } else {
++ if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
++ time_is_before_jiffies(
++ bfqq->last_wr_start_finish +
++ bfqd->bfq_wr_min_inter_arr_async)) {
++ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
++ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++
++ bfqd->wr_busy_queues++;
++ entity->ioprio_changed = 1;
++ bfq_log_bfqq(bfqd, bfqq,
++ "non-idle wrais starting at %lu, rais_max_time %u",
++ jiffies,
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ }
++ if (prev != bfqq->next_rq)
++ bfq_updated_next_req(bfqd, bfqq);
++ }
++
++ if (bfqd->low_latency &&
++ (old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
++ idle_for_long_time))
++ bfqq->last_wr_start_finish = jiffies;
++}
++
++static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
++ struct bio *bio)
++{
++ struct task_struct *tsk = current;
++ struct bfq_io_cq *bic;
++ struct bfq_queue *bfqq;
++
++ bic = bfq_bic_lookup(bfqd, tsk->io_context);
++ if (bic == NULL)
++ return NULL;
++
++ bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++ if (bfqq != NULL)
++ return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
++
++ return NULL;
++}
++
++static void bfq_activate_request(struct request_queue *q, struct request *rq)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++
++ bfqd->rq_in_driver++;
++ bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
++ bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
++ (long long unsigned)bfqd->last_position);
++}
++
++static inline void bfq_deactivate_request(struct request_queue *q,
++ struct request *rq)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++
++ BUG_ON(bfqd->rq_in_driver == 0);
++ bfqd->rq_in_driver--;
++}
++
++static void bfq_remove_request(struct request *rq)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++ struct bfq_data *bfqd = bfqq->bfqd;
++ const int sync = rq_is_sync(rq);
++
++ if (bfqq->next_rq == rq) {
++ bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
++ bfq_updated_next_req(bfqd, bfqq);
++ }
++
++ list_del_init(&rq->queuelist);
++ BUG_ON(bfqq->queued[sync] == 0);
++ bfqq->queued[sync]--;
++ bfqd->queued--;
++ elv_rb_del(&bfqq->sort_list, rq);
++
++ if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
++ if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
++ bfq_del_bfqq_busy(bfqd, bfqq, 1);
++ /*
++ * Remove queue from request-position tree as it is empty.
++ */
++ if (bfqq->pos_root != NULL) {
++ rb_erase(&bfqq->pos_node, bfqq->pos_root);
++ bfqq->pos_root = NULL;
++ }
++ }
++
++ if (rq->cmd_flags & REQ_META) {
++ BUG_ON(bfqq->meta_pending == 0);
++ bfqq->meta_pending--;
++ }
++}
++
++static int bfq_merge(struct request_queue *q, struct request **req,
++ struct bio *bio)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct request *__rq;
++
++ __rq = bfq_find_rq_fmerge(bfqd, bio);
++ if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) {
++ *req = __rq;
++ return ELEVATOR_FRONT_MERGE;
++ }
++
++ return ELEVATOR_NO_MERGE;
++}
++
++static void bfq_merged_request(struct request_queue *q, struct request *req,
++ int type)
++{
++ if (type == ELEVATOR_FRONT_MERGE &&
++ rb_prev(&req->rb_node) &&
++ blk_rq_pos(req) <
++ blk_rq_pos(container_of(rb_prev(&req->rb_node),
++ struct request, rb_node))) {
++ struct bfq_queue *bfqq = RQ_BFQQ(req);
++ struct bfq_data *bfqd = bfqq->bfqd;
++ struct request *prev, *next_rq;
++
++ /* Reposition request in its sort_list */
++ elv_rb_del(&bfqq->sort_list, req);
++ elv_rb_add(&bfqq->sort_list, req);
++ /* Choose next request to be served for bfqq */
++ prev = bfqq->next_rq;
++ next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
++ bfqd->last_position);
++ BUG_ON(next_rq == NULL);
++ bfqq->next_rq = next_rq;
++ /*
++ * If next_rq changes, update both the queue's budget to
++ * fit the new request and the queue's position in its
++ * rq_pos_tree.
++ */
++ if (prev != bfqq->next_rq) {
++ bfq_updated_next_req(bfqd, bfqq);
++ bfq_rq_pos_tree_add(bfqd, bfqq);
++ }
++ }
++}
++
++static void bfq_merged_requests(struct request_queue *q, struct request *rq,
++ struct request *next)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++ /*
++ * Reposition in fifo if next is older than rq.
++ */
++ if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
++ time_before(next->fifo_time, rq->fifo_time)) {
++ list_move(&rq->queuelist, &next->queuelist);
++ rq->fifo_time = next->fifo_time;
++ }
++
++ if (bfqq->next_rq == next)
++ bfqq->next_rq = rq;
++
++ bfq_remove_request(next);
++}
++
++/* Must be called with bfqq != NULL */
++static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
++{
++ BUG_ON(bfqq == NULL);
++ if (bfq_bfqq_busy(bfqq))
++ bfqq->bfqd->wr_busy_queues--;
++ bfqq->wr_coeff = 1;
++ bfqq->wr_cur_max_time = 0;
++ /* Trigger a weight change on the next activation of the queue */
++ bfqq->entity.ioprio_changed = 1;
++}
++
++static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
++ struct bfq_group *bfqg)
++{
++ int i, j;
++
++ for (i = 0; i < 2; i++)
++ for (j = 0; j < IOPRIO_BE_NR; j++)
++ if (bfqg->async_bfqq[i][j] != NULL)
++ bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
++ if (bfqg->async_idle_bfqq != NULL)
++ bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
++}
++
++static void bfq_end_wr(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq;
++
++ spin_lock_irq(bfqd->queue->queue_lock);
++
++ list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
++ bfq_bfqq_end_wr(bfqq);
++ list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
++ bfq_bfqq_end_wr(bfqq);
++ bfq_end_wr_async(bfqd);
++
++ spin_unlock_irq(bfqd->queue->queue_lock);
++}
++
++static int bfq_allow_merge(struct request_queue *q, struct request *rq,
++ struct bio *bio)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_io_cq *bic;
++ struct bfq_queue *bfqq;
++
++ /*
++ * Disallow merge of a sync bio into an async request.
++ */
++ if (bfq_bio_sync(bio) && !rq_is_sync(rq))
++ return 0;
++
++ /*
++ * Lookup the bfqq that this bio will be queued with. Allow
++ * merge only if rq is queued there.
++ * Queue lock is held here.
++ */
++ bic = bfq_bic_lookup(bfqd, current->io_context);
++ if (bic == NULL)
++ return 0;
++
++ bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++ return bfqq == RQ_BFQQ(rq);
++}
++
++static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ if (bfqq != NULL) {
++ bfq_mark_bfqq_must_alloc(bfqq);
++ bfq_mark_bfqq_budget_new(bfqq);
++ bfq_clear_bfqq_fifo_expire(bfqq);
++
++ bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "set_in_service_queue, cur-budget = %lu",
++ bfqq->entity.budget);
++ }
++
++ bfqd->in_service_queue = bfqq;
++}
++
++/*
++ * Get and set a new queue for service.
++ */
++static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ if (!bfqq)
++ bfqq = bfq_get_next_queue(bfqd);
++ else
++ bfq_get_next_queue_forced(bfqd, bfqq);
++
++ __bfq_set_in_service_queue(bfqd, bfqq);
++ return bfqq;
++}
++
++static inline sector_t bfq_dist_from_last(struct bfq_data *bfqd,
++ struct request *rq)
++{
++ if (blk_rq_pos(rq) >= bfqd->last_position)
++ return blk_rq_pos(rq) - bfqd->last_position;
++ else
++ return bfqd->last_position - blk_rq_pos(rq);
++}
++
++/*
++ * Return true if bfqq has no request pending and rq is close enough to
++ * bfqd->last_position, or if rq is closer to bfqd->last_position than
++ * bfqq->next_rq
++ */
++static inline int bfq_rq_close(struct bfq_data *bfqd, struct request *rq)
++{
++ return bfq_dist_from_last(bfqd, rq) <= BFQQ_SEEK_THR;
++}
++
++static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
++{
++ struct rb_root *root = &bfqd->rq_pos_tree;
++ struct rb_node *parent, *node;
++ struct bfq_queue *__bfqq;
++ sector_t sector = bfqd->last_position;
++
++ if (RB_EMPTY_ROOT(root))
++ return NULL;
++
++ /*
++ * First, if we find a request starting at the end of the last
++ * request, choose it.
++ */
++ __bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
++ if (__bfqq != NULL)
++ return __bfqq;
++
++ /*
++ * If the exact sector wasn't found, the parent of the NULL leaf
++ * will contain the closest sector (rq_pos_tree sorted by
++ * next_request position).
++ */
++ __bfqq = rb_entry(parent, struct bfq_queue, pos_node);
++ if (bfq_rq_close(bfqd, __bfqq->next_rq))
++ return __bfqq;
++
++ if (blk_rq_pos(__bfqq->next_rq) < sector)
++ node = rb_next(&__bfqq->pos_node);
++ else
++ node = rb_prev(&__bfqq->pos_node);
++ if (node == NULL)
++ return NULL;
++
++ __bfqq = rb_entry(node, struct bfq_queue, pos_node);
++ if (bfq_rq_close(bfqd, __bfqq->next_rq))
++ return __bfqq;
++
++ return NULL;
++}
++
++/*
++ * bfqd - obvious
++ * cur_bfqq - passed in so that we don't decide that the current queue
++ * is closely cooperating with itself.
++ *
++ * We are assuming that cur_bfqq has dispatched at least one request,
++ * and that bfqd->last_position reflects a position on the disk associated
++ * with the I/O issued by cur_bfqq.
++ */
++static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
++ struct bfq_queue *cur_bfqq)
++{
++ struct bfq_queue *bfqq;
++
++ if (bfq_class_idle(cur_bfqq))
++ return NULL;
++ if (!bfq_bfqq_sync(cur_bfqq))
++ return NULL;
++ if (BFQQ_SEEKY(cur_bfqq))
++ return NULL;
++
++ /* If device has only one backlogged bfq_queue, don't search. */
++ if (bfqd->busy_queues == 1)
++ return NULL;
++
++ /*
++ * We should notice if some of the queues are cooperating, e.g.
++ * working closely on the same area of the disk. In that case,
++ * we can group them together and don't waste time idling.
++ */
++ bfqq = bfqq_close(bfqd);
++ if (bfqq == NULL || bfqq == cur_bfqq)
++ return NULL;
++
++ /*
++ * Do not merge queues from different bfq_groups.
++ */
++ if (bfqq->entity.parent != cur_bfqq->entity.parent)
++ return NULL;
++
++ /*
++ * It only makes sense to merge sync queues.
++ */
++ if (!bfq_bfqq_sync(bfqq))
++ return NULL;
++ if (BFQQ_SEEKY(bfqq))
++ return NULL;
++
++ /*
++ * Do not merge queues of different priority classes.
++ */
++ if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
++ return NULL;
++
++ return bfqq;
++}
++
++/*
++ * If enough samples have been computed, return the current max budget
++ * stored in bfqd, which is dynamically updated according to the
++ * estimated disk peak rate; otherwise return the default max budget
++ */
++static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
++{
++ if (bfqd->budgets_assigned < 194)
++ return bfq_default_max_budget;
++ else
++ return bfqd->bfq_max_budget;
++}
++
++/*
++ * Return min budget, which is a fraction of the current or default
++ * max budget (trying with 1/32)
++ */
++static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
++{
++ if (bfqd->budgets_assigned < 194)
++ return bfq_default_max_budget / 32;
++ else
++ return bfqd->bfq_max_budget / 32;
++}
++
++static void bfq_arm_slice_timer(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq = bfqd->in_service_queue;
++ struct bfq_io_cq *bic;
++ unsigned long sl;
++
++ BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list));
++
++ /* Processes have exited, don't wait. */
++ bic = bfqd->in_service_bic;
++ if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0)
++ return;
++
++ bfq_mark_bfqq_wait_request(bfqq);
++
++ /*
++ * We don't want to idle for seeks, but we do want to allow
++ * fair distribution of slice time for a process doing back-to-back
++ * seeks. So allow a little bit of time for him to submit a new rq.
++ *
++ * To prevent processes with (partly) seeky workloads from
++ * being too ill-treated, grant them a small fraction of the
++ * assigned budget before reducing the waiting time to
++ * BFQ_MIN_TT. This happened to help reduce latency.
++ */
++ sl = bfqd->bfq_slice_idle;
++ /*
++ * Unless the queue is being weight-raised, grant only minimum idle
++ * time if the queue either has been seeky for long enough or has
++ * already proved to be constantly seeky.
++ */
++ if (bfq_sample_valid(bfqq->seek_samples) &&
++ ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
++ bfq_max_budget(bfqq->bfqd) / 8) ||
++ bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1)
++ sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
++ else if (bfqq->wr_coeff > 1)
++ sl = sl * 3;
++ bfqd->last_idling_start = ktime_get();
++ mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
++ bfq_log(bfqd, "arm idle: %u/%u ms",
++ jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle));
++}
++
++/*
++ * Set the maximum time for the in-service queue to consume its
++ * budget. This prevents seeky processes from lowering the disk
++ * throughput (always guaranteed with a time slice scheme as in CFQ).
++ */
++static void bfq_set_budget_timeout(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq = bfqd->in_service_queue;
++ unsigned int timeout_coeff;
++ if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
++ timeout_coeff = 1;
++ else
++ timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
++
++ bfqd->last_budget_start = ktime_get();
++
++ bfq_clear_bfqq_budget_new(bfqq);
++ bfqq->budget_timeout = jiffies +
++ bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff;
++
++ bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
++ jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] *
++ timeout_coeff));
++}
++
++/*
++ * Move request from internal lists to the request queue dispatch list.
++ */
++static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++ /*
++ * For consistency, the next instruction should have been executed
++ * after removing the request from the queue and dispatching it.
++ * We execute instead this instruction before bfq_remove_request()
++ * (and hence introduce a temporary inconsistency), for efficiency.
++ * In fact, in a forced_dispatch, this prevents two counters related
++ * to bfqq->dispatched to risk to be uselessly decremented if bfqq
++ * is not in service, and then to be incremented again after
++ * incrementing bfqq->dispatched.
++ */
++ bfqq->dispatched++;
++ bfq_remove_request(rq);
++ elv_dispatch_sort(q, rq);
++
++ if (bfq_bfqq_sync(bfqq))
++ bfqd->sync_flight++;
++}
++
++/*
++ * Return expired entry, or NULL to just start from scratch in rbtree.
++ */
++static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
++{
++ struct request *rq = NULL;
++
++ if (bfq_bfqq_fifo_expire(bfqq))
++ return NULL;
++
++ bfq_mark_bfqq_fifo_expire(bfqq);
++
++ if (list_empty(&bfqq->fifo))
++ return NULL;
++
++ rq = rq_entry_fifo(bfqq->fifo.next);
++
++ if (time_before(jiffies, rq->fifo_time))
++ return NULL;
++
++ return rq;
++}
++
++/*
++ * Must be called with the queue_lock held.
++ */
++static int bfqq_process_refs(struct bfq_queue *bfqq)
++{
++ int process_refs, io_refs;
++
++ io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
++ process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
++ BUG_ON(process_refs < 0);
++ return process_refs;
++}
++
++static void bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++ int process_refs, new_process_refs;
++ struct bfq_queue *__bfqq;
++
++ /*
++ * If there are no process references on the new_bfqq, then it is
++ * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
++ * may have dropped their last reference (not just their last process
++ * reference).
++ */
++ if (!bfqq_process_refs(new_bfqq))
++ return;
++
++ /* Avoid a circular list and skip interim queue merges. */
++ while ((__bfqq = new_bfqq->new_bfqq)) {
++ if (__bfqq == bfqq)
++ return;
++ new_bfqq = __bfqq;
++ }
++
++ process_refs = bfqq_process_refs(bfqq);
++ new_process_refs = bfqq_process_refs(new_bfqq);
++ /*
++ * If the process for the bfqq has gone away, there is no
++ * sense in merging the queues.
++ */
++ if (process_refs == 0 || new_process_refs == 0)
++ return;
++
++ /*
++ * Merge in the direction of the lesser amount of work.
++ */
++ if (new_process_refs >= process_refs) {
++ bfqq->new_bfqq = new_bfqq;
++ atomic_add(process_refs, &new_bfqq->ref);
++ } else {
++ new_bfqq->new_bfqq = bfqq;
++ atomic_add(new_process_refs, &bfqq->ref);
++ }
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
++ new_bfqq->pid);
++}
++
++static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++ return entity->budget - entity->service;
++}
++
++static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ BUG_ON(bfqq != bfqd->in_service_queue);
++
++ __bfq_bfqd_reset_in_service(bfqd);
++
++ /*
++ * If this bfqq is shared between multiple processes, check
++ * to make sure that those processes are still issuing I/Os
++ * within the mean seek distance. If not, it may be time to
++ * break the queues apart again.
++ */
++ if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
++ bfq_mark_bfqq_split_coop(bfqq);
++
++ if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
++ /*
++ * Overloading budget_timeout field to store the time
++ * at which the queue remains with no backlog; used by
++ * the weight-raising mechanism.
++ */
++ bfqq->budget_timeout = jiffies;
++ bfq_del_bfqq_busy(bfqd, bfqq, 1);
++ } else {
++ bfq_activate_bfqq(bfqd, bfqq);
++ /*
++ * Resort priority tree of potential close cooperators.
++ */
++ bfq_rq_pos_tree_add(bfqd, bfqq);
++ }
++}
++
++/**
++ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
++ * @bfqd: device data.
++ * @bfqq: queue to update.
++ * @reason: reason for expiration.
++ *
++ * Handle the feedback on @bfqq budget. See the body for detailed
++ * comments.
++ */
++static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ enum bfqq_expiration reason)
++{
++ struct request *next_rq;
++ unsigned long budget, min_budget;
++
++ budget = bfqq->max_budget;
++ min_budget = bfq_min_budget(bfqd);
++
++ BUG_ON(bfqq != bfqd->in_service_queue);
++
++ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
++ bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
++ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
++ budget, bfq_min_budget(bfqd));
++ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
++ bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
++
++ if (bfq_bfqq_sync(bfqq)) {
++ switch (reason) {
++ /*
++ * Caveat: in all the following cases we trade latency
++ * for throughput.
++ */
++ case BFQ_BFQQ_TOO_IDLE:
++ /*
++ * This is the only case where we may reduce
++ * the budget: if there is no request of the
++ * process still waiting for completion, then
++ * we assume (tentatively) that the timer has
++ * expired because the batch of requests of
++ * the process could have been served with a
++ * smaller budget. Hence, betting that
++ * process will behave in the same way when it
++ * becomes backlogged again, we reduce its
++ * next budget. As long as we guess right,
++ * this budget cut reduces the latency
++ * experienced by the process.
++ *
++ * However, if there are still outstanding
++ * requests, then the process may have not yet
++ * issued its next request just because it is
++ * still waiting for the completion of some of
++ * the still outstanding ones. So in this
++ * subcase we do not reduce its budget, on the
++ * contrary we increase it to possibly boost
++ * the throughput, as discussed in the
++ * comments to the BUDGET_TIMEOUT case.
++ */
++ if (bfqq->dispatched > 0) /* still outstanding reqs */
++ budget = min(budget * 2, bfqd->bfq_max_budget);
++ else {
++ if (budget > 5 * min_budget)
++ budget -= 4 * min_budget;
++ else
++ budget = min_budget;
++ }
++ break;
++ case BFQ_BFQQ_BUDGET_TIMEOUT:
++ /*
++ * We double the budget here because: 1) it
++ * gives the chance to boost the throughput if
++ * this is not a seeky process (which may have
++ * bumped into this timeout because of, e.g.,
++ * ZBR), 2) together with charge_full_budget
++ * it helps give seeky processes higher
++ * timestamps, and hence be served less
++ * frequently.
++ */
++ budget = min(budget * 2, bfqd->bfq_max_budget);
++ break;
++ case BFQ_BFQQ_BUDGET_EXHAUSTED:
++ /*
++ * The process still has backlog, and did not
++ * let either the budget timeout or the disk
++ * idling timeout expire. Hence it is not
++ * seeky, has a short thinktime and may be
++ * happy with a higher budget too. So
++ * definitely increase the budget of this good
++ * candidate to boost the disk throughput.
++ */
++ budget = min(budget * 4, bfqd->bfq_max_budget);
++ break;
++ case BFQ_BFQQ_NO_MORE_REQUESTS:
++ /*
++ * Leave the budget unchanged.
++ */
++ default:
++ return;
++ }
++ } else /* async queue */
++ /* async queues get always the maximum possible budget
++ * (their ability to dispatch is limited by
++ * @bfqd->bfq_max_budget_async_rq).
++ */
++ budget = bfqd->bfq_max_budget;
++
++ bfqq->max_budget = budget;
++
++ if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
++ bfqq->max_budget > bfqd->bfq_max_budget)
++ bfqq->max_budget = bfqd->bfq_max_budget;
++
++ /*
++ * Make sure that we have enough budget for the next request.
++ * Since the finish time of the bfqq must be kept in sync with
++ * the budget, be sure to call __bfq_bfqq_expire() after the
++ * update.
++ */
++ next_rq = bfqq->next_rq;
++ if (next_rq != NULL)
++ bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
++ bfq_serv_to_charge(next_rq, bfqq));
++ else
++ bfqq->entity.budget = bfqq->max_budget;
++
++ bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
++ next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
++ bfqq->entity.budget);
++}
++
++static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
++{
++ unsigned long max_budget;
++
++ /*
++ * The max_budget calculated when autotuning is equal to the
++ * amount of sectors transfered in timeout_sync at the
++ * estimated peak rate.
++ */
++ max_budget = (unsigned long)(peak_rate * 1000 *
++ timeout >> BFQ_RATE_SHIFT);
++
++ return max_budget;
++}
++
++/*
++ * In addition to updating the peak rate, checks whether the process
++ * is "slow", and returns 1 if so. This slow flag is used, in addition
++ * to the budget timeout, to reduce the amount of service provided to
++ * seeky processes, and hence reduce their chances to lower the
++ * throughput. See the code for more details.
++ */
++static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ int compensate, enum bfqq_expiration reason)
++{
++ u64 bw, usecs, expected, timeout;
++ ktime_t delta;
++ int update = 0;
++
++ if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
++ return 0;
++
++ if (compensate)
++ delta = bfqd->last_idling_start;
++ else
++ delta = ktime_get();
++ delta = ktime_sub(delta, bfqd->last_budget_start);
++ usecs = ktime_to_us(delta);
++
++ /* Don't trust short/unrealistic values. */
++ if (usecs < 100 || usecs >= LONG_MAX)
++ return 0;
++
++ /*
++ * Calculate the bandwidth for the last slice. We use a 64 bit
++ * value to store the peak rate, in sectors per usec in fixed
++ * point math. We do so to have enough precision in the estimate
++ * and to avoid overflows.
++ */
++ bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
++ do_div(bw, (unsigned long)usecs);
++
++ timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
++
++ /*
++ * Use only long (> 20ms) intervals to filter out spikes for
++ * the peak rate estimation.
++ */
++ if (usecs > 20000) {
++ if (bw > bfqd->peak_rate ||
++ (!BFQQ_SEEKY(bfqq) &&
++ reason == BFQ_BFQQ_BUDGET_TIMEOUT)) {
++ bfq_log(bfqd, "measured bw =%llu", bw);
++ /*
++ * To smooth oscillations use a low-pass filter with
++ * alpha=7/8, i.e.,
++ * new_rate = (7/8) * old_rate + (1/8) * bw
++ */
++ do_div(bw, 8);
++ if (bw == 0)
++ return 0;
++ bfqd->peak_rate *= 7;
++ do_div(bfqd->peak_rate, 8);
++ bfqd->peak_rate += bw;
++ update = 1;
++ bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate);
++ }
++
++ update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
++
++ if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
++ bfqd->peak_rate_samples++;
++
++ if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
++ update) {
++ int dev_type = blk_queue_nonrot(bfqd->queue);
++ if (bfqd->bfq_user_max_budget == 0) {
++ bfqd->bfq_max_budget =
++ bfq_calc_max_budget(bfqd->peak_rate,
++ timeout);
++ bfq_log(bfqd, "new max_budget=%lu",
++ bfqd->bfq_max_budget);
++ }
++ if (bfqd->device_speed == BFQ_BFQD_FAST &&
++ bfqd->peak_rate < device_speed_thresh[dev_type]) {
++ bfqd->device_speed = BFQ_BFQD_SLOW;
++ bfqd->RT_prod = R_slow[dev_type] *
++ T_slow[dev_type];
++ } else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
++ bfqd->peak_rate > device_speed_thresh[dev_type]) {
++ bfqd->device_speed = BFQ_BFQD_FAST;
++ bfqd->RT_prod = R_fast[dev_type] *
++ T_fast[dev_type];
++ }
++ }
++ }
++
++ /*
++ * If the process has been served for a too short time
++ * interval to let its possible sequential accesses prevail on
++ * the initial seek time needed to move the disk head on the
++ * first sector it requested, then give the process a chance
++ * and for the moment return false.
++ */
++ if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
++ return 0;
++
++ /*
++ * A process is considered ``slow'' (i.e., seeky, so that we
++ * cannot treat it fairly in the service domain, as it would
++ * slow down too much the other processes) if, when a slice
++ * ends for whatever reason, it has received service at a
++ * rate that would not be high enough to complete the budget
++ * before the budget timeout expiration.
++ */
++ expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
++
++ /*
++ * Caveat: processes doing IO in the slower disk zones will
++ * tend to be slow(er) even if not seeky. And the estimated
++ * peak rate will actually be an average over the disk
++ * surface. Hence, to not be too harsh with unlucky processes,
++ * we keep a budget/3 margin of safety before declaring a
++ * process slow.
++ */
++ return expected > (4 * bfqq->entity.budget) / 3;
++}
++
++/*
++ * To be deemed as soft real-time, an application must meet two
++ * requirements. First, the application must not require an average
++ * bandwidth higher than the approximate bandwidth required to playback or
++ * record a compressed high-definition video.
++ * The next function is invoked on the completion of the last request of a
++ * batch, to compute the next-start time instant, soft_rt_next_start, such
++ * that, if the next request of the application does not arrive before
++ * soft_rt_next_start, then the above requirement on the bandwidth is met.
++ *
++ * The second requirement is that the request pattern of the application is
++ * isochronous, i.e., that, after issuing a request or a batch of requests,
++ * the application stops issuing new requests until all its pending requests
++ * have been completed. After that, the application may issue a new batch,
++ * and so on.
++ * For this reason the next function is invoked to compute
++ * soft_rt_next_start only for applications that meet this requirement,
++ * whereas soft_rt_next_start is set to infinity for applications that do
++ * not.
++ *
++ * Unfortunately, even a greedy application may happen to behave in an
++ * isochronous way if the CPU load is high. In fact, the application may
++ * stop issuing requests while the CPUs are busy serving other processes,
++ * then restart, then stop again for a while, and so on. In addition, if
++ * the disk achieves a low enough throughput with the request pattern
++ * issued by the application (e.g., because the request pattern is random
++ * and/or the device is slow), then the application may meet the above
++ * bandwidth requirement too. To prevent such a greedy application to be
++ * deemed as soft real-time, a further rule is used in the computation of
++ * soft_rt_next_start: soft_rt_next_start must be higher than the current
++ * time plus the maximum time for which the arrival of a request is waited
++ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
++ * This filters out greedy applications, as the latter issue instead their
++ * next request as soon as possible after the last one has been completed
++ * (in contrast, when a batch of requests is completed, a soft real-time
++ * application spends some time processing data).
++ *
++ * Unfortunately, the last filter may easily generate false positives if
++ * only bfqd->bfq_slice_idle is used as a reference time interval and one
++ * or both the following cases occur:
++ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
++ * than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
++ * HZ=100.
++ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
++ * for a while, then suddenly 'jump' by several units to recover the lost
++ * increments. This seems to happen, e.g., inside virtual machines.
++ * To address this issue, we do not use as a reference time interval just
++ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
++ * particular we add the minimum number of jiffies for which the filter
++ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
++ * machines.
++ */
++static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ return max(bfqq->last_idle_bklogged +
++ HZ * bfqq->service_from_backlogged /
++ bfqd->bfq_wr_max_softrt_rate,
++ jiffies + bfqq->bfqd->bfq_slice_idle + 4);
++}
++
++/*
++ * Return the largest-possible time instant such that, for as long as possible,
++ * the current time will be lower than this time instant according to the macro
++ * time_is_before_jiffies().
++ */
++static inline unsigned long bfq_infinity_from_now(unsigned long now)
++{
++ return now + ULONG_MAX / 2;
++}
++
++/**
++ * bfq_bfqq_expire - expire a queue.
++ * @bfqd: device owning the queue.
++ * @bfqq: the queue to expire.
++ * @compensate: if true, compensate for the time spent idling.
++ * @reason: the reason causing the expiration.
++ *
++ *
++ * If the process associated to the queue is slow (i.e., seeky), or in
++ * case of budget timeout, or, finally, if it is async, we
++ * artificially charge it an entire budget (independently of the
++ * actual service it received). As a consequence, the queue will get
++ * higher timestamps than the correct ones upon reactivation, and
++ * hence it will be rescheduled as if it had received more service
++ * than what it actually received. In the end, this class of processes
++ * will receive less service in proportion to how slowly they consume
++ * their budgets (and hence how seriously they tend to lower the
++ * throughput).
++ *
++ * In contrast, when a queue expires because it has been idling for
++ * too much or because it exhausted its budget, we do not touch the
++ * amount of service it has received. Hence when the queue will be
++ * reactivated and its timestamps updated, the latter will be in sync
++ * with the actual service received by the queue until expiration.
++ *
++ * Charging a full budget to the first type of queues and the exact
++ * service to the others has the effect of using the WF2Q+ policy to
++ * schedule the former on a timeslice basis, without violating the
++ * service domain guarantees of the latter.
++ */
++static void bfq_bfqq_expire(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ int compensate,
++ enum bfqq_expiration reason)
++{
++ int slow;
++ BUG_ON(bfqq != bfqd->in_service_queue);
++
++ /* Update disk peak rate for autotuning and check whether the
++ * process is slow (see bfq_update_peak_rate).
++ */
++ slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason);
++
++ /*
++ * As above explained, 'punish' slow (i.e., seeky), timed-out
++ * and async queues, to favor sequential sync workloads.
++ *
++ * Processes doing I/O in the slower disk zones will tend to be
++ * slow(er) even if not seeky. Hence, since the estimated peak
++ * rate is actually an average over the disk surface, these
++ * processes may timeout just for bad luck. To avoid punishing
++ * them we do not charge a full budget to a process that
++ * succeeded in consuming at least 2/3 of its budget.
++ */
++ if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
++ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3))
++ bfq_bfqq_charge_full_budget(bfqq);
++
++ bfqq->service_from_backlogged += bfqq->entity.service;
++
++ if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
++ !bfq_bfqq_constantly_seeky(bfqq)) {
++ bfq_mark_bfqq_constantly_seeky(bfqq);
++ if (!blk_queue_nonrot(bfqd->queue))
++ bfqd->const_seeky_busy_in_flight_queues++;
++ }
++
++ if (reason == BFQ_BFQQ_TOO_IDLE &&
++ bfqq->entity.service <= 2 * bfqq->entity.budget / 10 )
++ bfq_clear_bfqq_IO_bound(bfqq);
++
++ if (bfqd->low_latency && bfqq->wr_coeff == 1)
++ bfqq->last_wr_start_finish = jiffies;
++
++ if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
++ RB_EMPTY_ROOT(&bfqq->sort_list)) {
++ /*
++ * If we get here, and there are no outstanding requests,
++ * then the request pattern is isochronous (see the comments
++ * to the function bfq_bfqq_softrt_next_start()). Hence we
++ * can compute soft_rt_next_start. If, instead, the queue
++ * still has outstanding requests, then we have to wait
++ * for the completion of all the outstanding requests to
++ * discover whether the request pattern is actually
++ * isochronous.
++ */
++ if (bfqq->dispatched == 0)
++ bfqq->soft_rt_next_start =
++ bfq_bfqq_softrt_next_start(bfqd, bfqq);
++ else {
++ /*
++ * The application is still waiting for the
++ * completion of one or more requests:
++ * prevent it from possibly being incorrectly
++ * deemed as soft real-time by setting its
++ * soft_rt_next_start to infinity. In fact,
++ * without this assignment, the application
++ * would be incorrectly deemed as soft
++ * real-time if:
++ * 1) it issued a new request before the
++ * completion of all its in-flight
++ * requests, and
++ * 2) at that time, its soft_rt_next_start
++ * happened to be in the past.
++ */
++ bfqq->soft_rt_next_start =
++ bfq_infinity_from_now(jiffies);
++ /*
++ * Schedule an update of soft_rt_next_start to when
++ * the task may be discovered to be isochronous.
++ */
++ bfq_mark_bfqq_softrt_update(bfqq);
++ }
++ }
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
++ slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
++
++ /*
++ * Increase, decrease or leave budget unchanged according to
++ * reason.
++ */
++ __bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
++ __bfq_bfqq_expire(bfqd, bfqq);
++}
++
++/*
++ * Budget timeout is not implemented through a dedicated timer, but
++ * just checked on request arrivals and completions, as well as on
++ * idle timer expirations.
++ */
++static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
++{
++ if (bfq_bfqq_budget_new(bfqq) ||
++ time_before(jiffies, bfqq->budget_timeout))
++ return 0;
++ return 1;
++}
++
++/*
++ * If we expire a queue that is waiting for the arrival of a new
++ * request, we may prevent the fictitious timestamp back-shifting that
++ * allows the guarantees of the queue to be preserved (see [1] for
++ * this tricky aspect). Hence we return true only if this condition
++ * does not hold, or if the queue is slow enough to deserve only to be
++ * kicked off for preserving a high throughput.
++*/
++static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
++{
++ bfq_log_bfqq(bfqq->bfqd, bfqq,
++ "may_budget_timeout: wait_request %d left %d timeout %d",
++ bfq_bfqq_wait_request(bfqq),
++ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3,
++ bfq_bfqq_budget_timeout(bfqq));
++
++ return (!bfq_bfqq_wait_request(bfqq) ||
++ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3)
++ &&
++ bfq_bfqq_budget_timeout(bfqq);
++}
++
++/*
++ * Device idling is allowed only for the queues for which this function
++ * returns true. For this reason, the return value of this function plays a
++ * critical role for both throughput boosting and service guarantees. The
++ * return value is computed through a logical expression. In this rather
++ * long comment, we try to briefly describe all the details and motivations
++ * behind the components of this logical expression.
++ *
++ * First, the expression may be true only for sync queues. Besides, if
++ * bfqq is also being weight-raised, then the expression always evaluates
++ * to true, as device idling is instrumental for preserving low-latency
++ * guarantees (see [1]). Otherwise, the expression evaluates to true only
++ * if bfqq has a non-null idle window and at least one of the following
++ * two conditions holds. The first condition is that the device is not
++ * performing NCQ, because idling the device most certainly boosts the
++ * throughput if this condition holds and bfqq has been granted a non-null
++ * idle window. The second compound condition is made of the logical AND of
++ * two components.
++ *
++ * The first component is true only if there is no weight-raised busy
++ * queue. This guarantees that the device is not idled for a sync non-
++ * weight-raised queue when there are busy weight-raised queues. The former
++ * is then expired immediately if empty. Combined with the timestamping
++ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
++ * queues to get a lower number of requests served, and hence to ask for a
++ * lower number of requests from the request pool, before the busy weight-
++ * raised queues get served again.
++ *
++ * This is beneficial for the processes associated with weight-raised
++ * queues, when the request pool is saturated (e.g., in the presence of
++ * write hogs). In fact, if the processes associated with the other queues
++ * ask for requests at a lower rate, then weight-raised processes have a
++ * higher probability to get a request from the pool immediately (or at
++ * least soon) when they need one. Hence they have a higher probability to
++ * actually get a fraction of the disk throughput proportional to their
++ * high weight. This is especially true with NCQ-capable drives, which
++ * enqueue several requests in advance and further reorder internally-
++ * queued requests.
++ *
++ * In the end, mistreating non-weight-raised queues when there are busy
++ * weight-raised queues seems to mitigate starvation problems in the
++ * presence of heavy write workloads and NCQ, and hence to guarantee a
++ * higher application and system responsiveness in these hostile scenarios.
++ *
++ * If the first component of the compound condition is instead true, i.e.,
++ * there is no weight-raised busy queue, then the second component of the
++ * compound condition takes into account service-guarantee and throughput
++ * issues related to NCQ (recall that the compound condition is evaluated
++ * only if the device is detected as supporting NCQ).
++ *
++ * As for service guarantees, allowing the drive to enqueue more than one
++ * request at a time, and hence delegating de facto final scheduling
++ * decisions to the drive's internal scheduler, causes loss of control on
++ * the actual request service order. In this respect, when the drive is
++ * allowed to enqueue more than one request at a time, the service
++ * distribution enforced by the drive's internal scheduler is likely to
++ * coincide with the desired device-throughput distribution only in the
++ * following, perfectly symmetric, scenario:
++ * 1) all active queues have the same weight,
++ * 2) all active groups at the same level in the groups tree have the same
++ * weight,
++ * 3) all active groups at the same level in the groups tree have the same
++ * number of children.
++ *
++ * Even in such a scenario, sequential I/O may still receive a preferential
++ * treatment, but this is not likely to be a big issue with flash-based
++ * devices, because of their non-dramatic loss of throughput with random
++ * I/O. Things do differ with HDDs, for which additional care is taken, as
++ * explained after completing the discussion for flash-based devices.
++ *
++ * Unfortunately, keeping the necessary state for evaluating exactly the
++ * above symmetry conditions would be quite complex and time-consuming.
++ * Therefore BFQ evaluates instead the following stronger sub-conditions,
++ * for which it is much easier to maintain the needed state:
++ * 1) all active queues have the same weight,
++ * 2) all active groups have the same weight,
++ * 3) all active groups have at most one active child each.
++ * In particular, the last two conditions are always true if hierarchical
++ * support and the cgroups interface are not enabled, hence no state needs
++ * to be maintained in this case.
++ *
++ * According to the above considerations, the second component of the
++ * compound condition evaluates to true if any of the above symmetry
++ * sub-condition does not hold, or the device is not flash-based. Therefore,
++ * if also the first component is true, then idling is allowed for a sync
++ * queue. These are the only sub-conditions considered if the device is
++ * flash-based, as, for such a device, it is sensible to force idling only
++ * for service-guarantee issues. In fact, as for throughput, idling
++ * NCQ-capable flash-based devices would not boost the throughput even
++ * with sequential I/O; rather it would lower the throughput in proportion
++ * to how fast the device is. In the end, (only) if all the three
++ * sub-conditions hold and the device is flash-based, the compound
++ * condition evaluates to false and therefore no idling is performed.
++ *
++ * As already said, things change with a rotational device, where idling
++ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
++ * such a device the second component of the compound condition evaluates
++ * to true also if the following additional sub-condition does not hold:
++ * the queue is constantly seeky. Unfortunately, this different behavior
++ * with respect to flash-based devices causes an additional asymmetry: if
++ * some sync queues enjoy idling and some other sync queues do not, then
++ * the latter get a low share of the device throughput, simply because the
++ * former get many requests served after being set as in service, whereas
++ * the latter do not. As a consequence, to guarantee the desired throughput
++ * distribution, on HDDs the compound expression evaluates to true (and
++ * hence device idling is performed) also if the following last symmetry
++ * condition does not hold: no other queue is benefiting from idling. Also
++ * this last condition is actually replaced with a simpler-to-maintain and
++ * stronger condition: there is no busy queue which is not constantly seeky
++ * (and hence may also benefit from idling).
++ *
++ * To sum up, when all the required symmetry and throughput-boosting
++ * sub-conditions hold, the second component of the compound condition
++ * evaluates to false, and hence no idling is performed. This helps to
++ * keep the drives' internal queues full on NCQ-capable devices, and hence
++ * to boost the throughput, without causing 'almost' any loss of service
++ * guarantees. The 'almost' follows from the fact that, if the internal
++ * queue of one such device is filled while all the sub-conditions hold,
++ * but at some point in time some sub-condition stops to hold, then it may
++ * become impossible to let requests be served in the new desired order
++ * until all the requests already queued in the device have been served.
++ */
++static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
++{
++ struct bfq_data *bfqd = bfqq->bfqd;
++#ifdef CONFIG_CGROUP_BFQIO
++#define symmetric_scenario (!bfqd->active_numerous_groups && \
++ !bfq_differentiated_weights(bfqd))
++#else
++#define symmetric_scenario (!bfq_differentiated_weights(bfqd))
++#endif
++#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
++ bfqd->busy_in_flight_queues == \
++ bfqd->const_seeky_busy_in_flight_queues)
++/*
++ * Condition for expiring a non-weight-raised queue (and hence not idling
++ * the device).
++ */
++#define cond_for_expiring_non_wr (bfqd->hw_tag && \
++ (bfqd->wr_busy_queues > 0 || \
++ (symmetric_scenario && \
++ (blk_queue_nonrot(bfqd->queue) || \
++ cond_for_seeky_on_ncq_hdd))))
++
++ return bfq_bfqq_sync(bfqq) &&
++ (bfq_bfqq_IO_bound(bfqq) || bfqq->wr_coeff > 1) &&
++ (bfqq->wr_coeff > 1 ||
++ (bfq_bfqq_idle_window(bfqq) &&
++ !cond_for_expiring_non_wr)
++ );
++}
++
++/*
++ * If the in-service queue is empty but sync, and the function
++ * bfq_bfqq_must_not_expire returns true, then:
++ * 1) the queue must remain in service and cannot be expired, and
++ * 2) the disk must be idled to wait for the possible arrival of a new
++ * request for the queue.
++ * See the comments to the function bfq_bfqq_must_not_expire for the reasons
++ * why performing device idling is the best choice to boost the throughput
++ * and preserve service guarantees when bfq_bfqq_must_not_expire itself
++ * returns true.
++ */
++static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
++{
++ struct bfq_data *bfqd = bfqq->bfqd;
++
++ return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
++ bfq_bfqq_must_not_expire(bfqq);
++}
++
++/*
++ * Select a queue for service. If we have a current queue in service,
++ * check whether to continue servicing it, or retrieve and set a new one.
++ */
++static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq, *new_bfqq = NULL;
++ struct request *next_rq;
++ enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
++
++ bfqq = bfqd->in_service_queue;
++ if (bfqq == NULL)
++ goto new_queue;
++
++ bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
++
++ /*
++ * If another queue has a request waiting within our mean seek
++ * distance, let it run. The expire code will check for close
++ * cooperators and put the close queue at the front of the
++ * service tree. If possible, merge the expiring queue with the
++ * new bfqq.
++ */
++ new_bfqq = bfq_close_cooperator(bfqd, bfqq);
++ if (new_bfqq != NULL && bfqq->new_bfqq == NULL)
++ bfq_setup_merge(bfqq, new_bfqq);
++
++ if (bfq_may_expire_for_budg_timeout(bfqq) &&
++ !timer_pending(&bfqd->idle_slice_timer) &&
++ !bfq_bfqq_must_idle(bfqq))
++ goto expire;
++
++ next_rq = bfqq->next_rq;
++ /*
++ * If bfqq has requests queued and it has enough budget left to
++ * serve them, keep the queue, otherwise expire it.
++ */
++ if (next_rq != NULL) {
++ if (bfq_serv_to_charge(next_rq, bfqq) >
++ bfq_bfqq_budget_left(bfqq)) {
++ reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
++ goto expire;
++ } else {
++ /*
++ * The idle timer may be pending because we may
++ * not disable disk idling even when a new request
++ * arrives.
++ */
++ if (timer_pending(&bfqd->idle_slice_timer)) {
++ /*
++ * If we get here: 1) at least a new request
++ * has arrived but we have not disabled the
++ * timer because the request was too small,
++ * 2) then the block layer has unplugged
++ * the device, causing the dispatch to be
++ * invoked.
++ *
++ * Since the device is unplugged, now the
++ * requests are probably large enough to
++ * provide a reasonable throughput.
++ * So we disable idling.
++ */
++ bfq_clear_bfqq_wait_request(bfqq);
++ del_timer(&bfqd->idle_slice_timer);
++ }
++ if (new_bfqq == NULL)
++ goto keep_queue;
++ else
++ goto expire;
++ }
++ }
++
++ /*
++ * No requests pending. If the in-service queue still has requests
++ * in flight (possibly waiting for a completion) or is idling for a
++ * new request, then keep it.
++ */
++ if (new_bfqq == NULL && (timer_pending(&bfqd->idle_slice_timer) ||
++ (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq)))) {
++ bfqq = NULL;
++ goto keep_queue;
++ } else if (new_bfqq != NULL && timer_pending(&bfqd->idle_slice_timer)) {
++ /*
++ * Expiring the queue because there is a close cooperator,
++ * cancel timer.
++ */
++ bfq_clear_bfqq_wait_request(bfqq);
++ del_timer(&bfqd->idle_slice_timer);
++ }
++
++ reason = BFQ_BFQQ_NO_MORE_REQUESTS;
++expire:
++ bfq_bfqq_expire(bfqd, bfqq, 0, reason);
++new_queue:
++ bfqq = bfq_set_in_service_queue(bfqd, new_bfqq);
++ bfq_log(bfqd, "select_queue: new queue %d returned",
++ bfqq != NULL ? bfqq->pid : 0);
++keep_queue:
++ return bfqq;
++}
++
++static void bfq_update_wr_data(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ if (bfqq->wr_coeff > 1) { /* queue is being boosted */
++ struct bfq_entity *entity = &bfqq->entity;
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "raising period dur %u/%u msec, old coeff %u, w %d(%d)",
++ jiffies_to_msecs(jiffies -
++ bfqq->last_wr_start_finish),
++ jiffies_to_msecs(bfqq->wr_cur_max_time),
++ bfqq->wr_coeff,
++ bfqq->entity.weight, bfqq->entity.orig_weight);
++
++ BUG_ON(bfqq != bfqd->in_service_queue && entity->weight !=
++ entity->orig_weight * bfqq->wr_coeff);
++ if (entity->ioprio_changed)
++ bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
++ /*
++ * If too much time has elapsed from the beginning
++ * of this weight-raising, stop it.
++ */
++ if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++ bfqq->wr_cur_max_time)) {
++ bfqq->last_wr_start_finish = jiffies;
++ bfq_log_bfqq(bfqd, bfqq,
++ "wrais ending at %lu, rais_max_time %u",
++ bfqq->last_wr_start_finish,
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ bfq_bfqq_end_wr(bfqq);
++ __bfq_entity_update_weight_prio(
++ bfq_entity_service_tree(entity),
++ entity);
++ }
++ }
++}
++
++/*
++ * Dispatch one request from bfqq, moving it to the request queue
++ * dispatch list.
++ */
++static int bfq_dispatch_request(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ int dispatched = 0;
++ struct request *rq;
++ unsigned long service_to_charge;
++
++ BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list));
++
++ /* Follow expired path, else get first next available. */
++ rq = bfq_check_fifo(bfqq);
++ if (rq == NULL)
++ rq = bfqq->next_rq;
++ service_to_charge = bfq_serv_to_charge(rq, bfqq);
++
++ if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
++ /*
++ * This may happen if the next rq is chosen in fifo order
++ * instead of sector order. The budget is properly
++ * dimensioned to be always sufficient to serve the next
++ * request only if it is chosen in sector order. The reason
++ * is that it would be quite inefficient and little useful
++ * to always make sure that the budget is large enough to
++ * serve even the possible next rq in fifo order.
++ * In fact, requests are seldom served in fifo order.
++ *
++ * Expire the queue for budget exhaustion, and make sure
++ * that the next act_budget is enough to serve the next
++ * request, even if it comes from the fifo expired path.
++ */
++ bfqq->next_rq = rq;
++ /*
++ * Since this dispatch is failed, make sure that
++ * a new one will be performed
++ */
++ if (!bfqd->rq_in_driver)
++ bfq_schedule_dispatch(bfqd);
++ goto expire;
++ }
++
++ /* Finally, insert request into driver dispatch list. */
++ bfq_bfqq_served(bfqq, service_to_charge);
++ bfq_dispatch_insert(bfqd->queue, rq);
++
++ bfq_update_wr_data(bfqd, bfqq);
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "dispatched %u sec req (%llu), budg left %lu",
++ blk_rq_sectors(rq),
++ (long long unsigned)blk_rq_pos(rq),
++ bfq_bfqq_budget_left(bfqq));
++
++ dispatched++;
++
++ if (bfqd->in_service_bic == NULL) {
++ atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
++ bfqd->in_service_bic = RQ_BIC(rq);
++ }
++
++ if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
++ dispatched >= bfqd->bfq_max_budget_async_rq) ||
++ bfq_class_idle(bfqq)))
++ goto expire;
++
++ return dispatched;
++
++expire:
++ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED);
++ return dispatched;
++}
++
++static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
++{
++ int dispatched = 0;
++
++ while (bfqq->next_rq != NULL) {
++ bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
++ dispatched++;
++ }
++
++ BUG_ON(!list_empty(&bfqq->fifo));
++ return dispatched;
++}
++
++/*
++ * Drain our current requests.
++ * Used for barriers and when switching io schedulers on-the-fly.
++ */
++static int bfq_forced_dispatch(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq, *n;
++ struct bfq_service_tree *st;
++ int dispatched = 0;
++
++ bfqq = bfqd->in_service_queue;
++ if (bfqq != NULL)
++ __bfq_bfqq_expire(bfqd, bfqq);
++
++ /*
++ * Loop through classes, and be careful to leave the scheduler
++ * in a consistent state, as feedback mechanisms and vtime
++ * updates cannot be disabled during the process.
++ */
++ list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
++ st = bfq_entity_service_tree(&bfqq->entity);
++
++ dispatched += __bfq_forced_dispatch_bfqq(bfqq);
++ bfqq->max_budget = bfq_max_budget(bfqd);
++
++ bfq_forget_idle(st);
++ }
++
++ BUG_ON(bfqd->busy_queues != 0);
++
++ return dispatched;
++}
++
++static int bfq_dispatch_requests(struct request_queue *q, int force)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_queue *bfqq;
++ int max_dispatch;
++
++ bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
++ if (bfqd->busy_queues == 0)
++ return 0;
++
++ if (unlikely(force))
++ return bfq_forced_dispatch(bfqd);
++
++ bfqq = bfq_select_queue(bfqd);
++ if (bfqq == NULL)
++ return 0;
++
++ max_dispatch = bfqd->bfq_quantum;
++ if (bfq_class_idle(bfqq))
++ max_dispatch = 1;
++
++ if (!bfq_bfqq_sync(bfqq))
++ max_dispatch = bfqd->bfq_max_budget_async_rq;
++
++ if (bfqq->dispatched >= max_dispatch) {
++ if (bfqd->busy_queues > 1)
++ return 0;
++ if (bfqq->dispatched >= 4 * max_dispatch)
++ return 0;
++ }
++
++ if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))
++ return 0;
++
++ bfq_clear_bfqq_wait_request(bfqq);
++ BUG_ON(timer_pending(&bfqd->idle_slice_timer));
++
++ if (!bfq_dispatch_request(bfqd, bfqq))
++ return 0;
++
++ bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
++ bfqq->pid, max_dispatch);
++
++ return 1;
++}
++
++/*
++ * Task holds one reference to the queue, dropped when task exits. Each rq
++ * in-flight on this queue also holds a reference, dropped when rq is freed.
++ *
++ * Queue lock must be held here.
++ */
++static void bfq_put_queue(struct bfq_queue *bfqq)
++{
++ struct bfq_data *bfqd = bfqq->bfqd;
++
++ BUG_ON(atomic_read(&bfqq->ref) <= 0);
++
++ bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq,
++ atomic_read(&bfqq->ref));
++ if (!atomic_dec_and_test(&bfqq->ref))
++ return;
++
++ BUG_ON(rb_first(&bfqq->sort_list) != NULL);
++ BUG_ON(bfqq->allocated[READ] + bfqq->allocated[WRITE] != 0);
++ BUG_ON(bfqq->entity.tree != NULL);
++ BUG_ON(bfq_bfqq_busy(bfqq));
++ BUG_ON(bfqd->in_service_queue == bfqq);
++
++ bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
++
++ kmem_cache_free(bfq_pool, bfqq);
++}
++
++static void bfq_put_cooperator(struct bfq_queue *bfqq)
++{
++ struct bfq_queue *__bfqq, *next;
++
++ /*
++ * If this queue was scheduled to merge with another queue, be
++ * sure to drop the reference taken on that queue (and others in
++ * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
++ */
++ __bfqq = bfqq->new_bfqq;
++ while (__bfqq) {
++ if (__bfqq == bfqq)
++ break;
++ next = __bfqq->new_bfqq;
++ bfq_put_queue(__bfqq);
++ __bfqq = next;
++ }
++}
++
++static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ if (bfqq == bfqd->in_service_queue) {
++ __bfq_bfqq_expire(bfqd, bfqq);
++ bfq_schedule_dispatch(bfqd);
++ }
++
++ bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
++ atomic_read(&bfqq->ref));
++
++ bfq_put_cooperator(bfqq);
++
++ bfq_put_queue(bfqq);
++}
++
++static inline void bfq_init_icq(struct io_cq *icq)
++{
++ struct bfq_io_cq *bic = icq_to_bic(icq);
++
++ bic->ttime.last_end_request = jiffies;
++}
++
++static void bfq_exit_icq(struct io_cq *icq)
++{
++ struct bfq_io_cq *bic = icq_to_bic(icq);
++ struct bfq_data *bfqd = bic_to_bfqd(bic);
++
++ if (bic->bfqq[BLK_RW_ASYNC]) {
++ bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]);
++ bic->bfqq[BLK_RW_ASYNC] = NULL;
++ }
++
++ if (bic->bfqq[BLK_RW_SYNC]) {
++ bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
++ bic->bfqq[BLK_RW_SYNC] = NULL;
++ }
++}
++
++/*
++ * Update the entity prio values; note that the new values will not
++ * be used until the next (re)activation.
++ */
++static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
++{
++ struct task_struct *tsk = current;
++ int ioprio_class;
++
++ if (!bfq_bfqq_prio_changed(bfqq))
++ return;
++
++ ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
++ switch (ioprio_class) {
++ default:
++ dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
++ "bfq: bad prio %x\n", ioprio_class);
++ case IOPRIO_CLASS_NONE:
++ /*
++ * No prio set, inherit CPU scheduling settings.
++ */
++ bfqq->entity.new_ioprio = task_nice_ioprio(tsk);
++ bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk);
++ break;
++ case IOPRIO_CLASS_RT:
++ bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT;
++ break;
++ case IOPRIO_CLASS_BE:
++ bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE;
++ break;
++ case IOPRIO_CLASS_IDLE:
++ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE;
++ bfqq->entity.new_ioprio = 7;
++ bfq_clear_bfqq_idle_window(bfqq);
++ break;
++ }
++
++ bfqq->entity.ioprio_changed = 1;
++
++ bfq_clear_bfqq_prio_changed(bfqq);
++}
++
++static void bfq_changed_ioprio(struct bfq_io_cq *bic)
++{
++ struct bfq_data *bfqd;
++ struct bfq_queue *bfqq, *new_bfqq;
++ struct bfq_group *bfqg;
++ unsigned long uninitialized_var(flags);
++ int ioprio = bic->icq.ioc->ioprio;
++
++ bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
++ &flags);
++ /*
++ * This condition may trigger on a newly created bic, be sure to
++ * drop the lock before returning.
++ */
++ if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio))
++ goto out;
++
++ bfqq = bic->bfqq[BLK_RW_ASYNC];
++ if (bfqq != NULL) {
++ bfqg = container_of(bfqq->entity.sched_data, struct bfq_group,
++ sched_data);
++ new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic,
++ GFP_ATOMIC);
++ if (new_bfqq != NULL) {
++ bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
++ bfq_log_bfqq(bfqd, bfqq,
++ "changed_ioprio: bfqq %p %d",
++ bfqq, atomic_read(&bfqq->ref));
++ bfq_put_queue(bfqq);
++ }
++ }
++
++ bfqq = bic->bfqq[BLK_RW_SYNC];
++ if (bfqq != NULL)
++ bfq_mark_bfqq_prio_changed(bfqq);
++
++ bic->ioprio = ioprio;
++
++out:
++ bfq_put_bfqd_unlock(bfqd, &flags);
++}
++
++static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ pid_t pid, int is_sync)
++{
++ RB_CLEAR_NODE(&bfqq->entity.rb_node);
++ INIT_LIST_HEAD(&bfqq->fifo);
++
++ atomic_set(&bfqq->ref, 0);
++ bfqq->bfqd = bfqd;
++
++ bfq_mark_bfqq_prio_changed(bfqq);
++
++ if (is_sync) {
++ if (!bfq_class_idle(bfqq))
++ bfq_mark_bfqq_idle_window(bfqq);
++ bfq_mark_bfqq_sync(bfqq);
++ }
++ bfq_mark_bfqq_IO_bound(bfqq);
++
++ /* Tentative initial value to trade off between thr and lat */
++ bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
++ bfqq->pid = pid;
++
++ bfqq->wr_coeff = 1;
++ bfqq->last_wr_start_finish = 0;
++ /*
++ * Set to the value for which bfqq will not be deemed as
++ * soft rt when it becomes backlogged.
++ */
++ bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies);
++}
++
++static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
++ struct bfq_group *bfqg,
++ int is_sync,
++ struct bfq_io_cq *bic,
++ gfp_t gfp_mask)
++{
++ struct bfq_queue *bfqq, *new_bfqq = NULL;
++
++retry:
++ /* bic always exists here */
++ bfqq = bic_to_bfqq(bic, is_sync);
++
++ /*
++ * Always try a new alloc if we fall back to the OOM bfqq
++ * originally, since it should just be a temporary situation.
++ */
++ if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
++ bfqq = NULL;
++ if (new_bfqq != NULL) {
++ bfqq = new_bfqq;
++ new_bfqq = NULL;
++ } else if (gfp_mask & __GFP_WAIT) {
++ spin_unlock_irq(bfqd->queue->queue_lock);
++ new_bfqq = kmem_cache_alloc_node(bfq_pool,
++ gfp_mask | __GFP_ZERO,
++ bfqd->queue->node);
++ spin_lock_irq(bfqd->queue->queue_lock);
++ if (new_bfqq != NULL)
++ goto retry;
++ } else {
++ bfqq = kmem_cache_alloc_node(bfq_pool,
++ gfp_mask | __GFP_ZERO,
++ bfqd->queue->node);
++ }
++
++ if (bfqq != NULL) {
++ bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync);
++ bfq_log_bfqq(bfqd, bfqq, "allocated");
++ } else {
++ bfqq = &bfqd->oom_bfqq;
++ bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
++ }
++
++ bfq_init_prio_data(bfqq, bic);
++ bfq_init_entity(&bfqq->entity, bfqg);
++ }
++
++ if (new_bfqq != NULL)
++ kmem_cache_free(bfq_pool, new_bfqq);
++
++ return bfqq;
++}
++
++static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
++ struct bfq_group *bfqg,
++ int ioprio_class, int ioprio)
++{
++ switch (ioprio_class) {
++ case IOPRIO_CLASS_RT:
++ return &bfqg->async_bfqq[0][ioprio];
++ case IOPRIO_CLASS_NONE:
++ ioprio = IOPRIO_NORM;
++ /* fall through */
++ case IOPRIO_CLASS_BE:
++ return &bfqg->async_bfqq[1][ioprio];
++ case IOPRIO_CLASS_IDLE:
++ return &bfqg->async_idle_bfqq;
++ default:
++ BUG();
++ }
++}
++
++static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
++ struct bfq_group *bfqg, int is_sync,
++ struct bfq_io_cq *bic, gfp_t gfp_mask)
++{
++ const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++ const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
++ struct bfq_queue **async_bfqq = NULL;
++ struct bfq_queue *bfqq = NULL;
++
++ if (!is_sync) {
++ async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
++ ioprio);
++ bfqq = *async_bfqq;
++ }
++
++ if (bfqq == NULL)
++ bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
++
++ /*
++ * Pin the queue now that it's allocated, scheduler exit will
++ * prune it.
++ */
++ if (!is_sync && *async_bfqq == NULL) {
++ atomic_inc(&bfqq->ref);
++ bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
++ bfqq, atomic_read(&bfqq->ref));
++ *async_bfqq = bfqq;
++ }
++
++ atomic_inc(&bfqq->ref);
++ bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq,
++ atomic_read(&bfqq->ref));
++ return bfqq;
++}
++
++static void bfq_update_io_thinktime(struct bfq_data *bfqd,
++ struct bfq_io_cq *bic)
++{
++ unsigned long elapsed = jiffies - bic->ttime.last_end_request;
++ unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle);
++
++ bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
++ bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8;
++ bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) /
++ bic->ttime.ttime_samples;
++}
++
++static void bfq_update_io_seektime(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ struct request *rq)
++{
++ sector_t sdist;
++ u64 total;
++
++ if (bfqq->last_request_pos < blk_rq_pos(rq))
++ sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
++ else
++ sdist = bfqq->last_request_pos - blk_rq_pos(rq);
++
++ /*
++ * Don't allow the seek distance to get too large from the
++ * odd fragment, pagein, etc.
++ */
++ if (bfqq->seek_samples == 0) /* first request, not really a seek */
++ sdist = 0;
++ else if (bfqq->seek_samples <= 60) /* second & third seek */
++ sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
++ else
++ sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
++
++ bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
++ bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
++ total = bfqq->seek_total + (bfqq->seek_samples/2);
++ do_div(total, bfqq->seek_samples);
++ bfqq->seek_mean = (sector_t)total;
++
++ bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist,
++ (u64)bfqq->seek_mean);
++}
++
++/*
++ * Disable idle window if the process thinks too long or seeks so much that
++ * it doesn't matter.
++ */
++static void bfq_update_idle_window(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ struct bfq_io_cq *bic)
++{
++ int enable_idle;
++
++ /* Don't idle for async or idle io prio class. */
++ if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
++ return;
++
++ enable_idle = bfq_bfqq_idle_window(bfqq);
++
++ if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
++ bfqd->bfq_slice_idle == 0 ||
++ (bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
++ bfqq->wr_coeff == 1))
++ enable_idle = 0;
++ else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
++ if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
++ bfqq->wr_coeff == 1)
++ enable_idle = 0;
++ else
++ enable_idle = 1;
++ }
++ bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
++ enable_idle);
++
++ if (enable_idle)
++ bfq_mark_bfqq_idle_window(bfqq);
++ else
++ bfq_clear_bfqq_idle_window(bfqq);
++}
++
++/*
++ * Called when a new fs request (rq) is added to bfqq. Check if there's
++ * something we should do about it.
++ */
++static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ struct request *rq)
++{
++ struct bfq_io_cq *bic = RQ_BIC(rq);
++
++ if (rq->cmd_flags & REQ_META)
++ bfqq->meta_pending++;
++
++ bfq_update_io_thinktime(bfqd, bic);
++ bfq_update_io_seektime(bfqd, bfqq, rq);
++ if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
++ bfq_clear_bfqq_constantly_seeky(bfqq);
++ if (!blk_queue_nonrot(bfqd->queue)) {
++ BUG_ON(!bfqd->const_seeky_busy_in_flight_queues);
++ bfqd->const_seeky_busy_in_flight_queues--;
++ }
++ }
++ if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
++ !BFQQ_SEEKY(bfqq))
++ bfq_update_idle_window(bfqd, bfqq, bic);
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
++ bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
++ (long long unsigned)bfqq->seek_mean);
++
++ bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
++
++ if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
++ int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
++ blk_rq_sectors(rq) < 32;
++ int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
++
++ /*
++ * There is just this request queued: if the request
++ * is small and the queue is not to be expired, then
++ * just exit.
++ *
++ * In this way, if the disk is being idled to wait for
++ * a new request from the in-service queue, we avoid
++ * unplugging the device and committing the disk to serve
++ * just a small request. On the contrary, we wait for
++ * the block layer to decide when to unplug the device:
++ * hopefully, new requests will be merged to this one
++ * quickly, then the device will be unplugged and
++ * larger requests will be dispatched.
++ */
++ if (small_req && !budget_timeout)
++ return;
++
++ /*
++ * A large enough request arrived, or the queue is to
++ * be expired: in both cases disk idling is to be
++ * stopped, so clear wait_request flag and reset
++ * timer.
++ */
++ bfq_clear_bfqq_wait_request(bfqq);
++ del_timer(&bfqd->idle_slice_timer);
++
++ /*
++ * The queue is not empty, because a new request just
++ * arrived. Hence we can safely expire the queue, in
++ * case of budget timeout, without risking that the
++ * timestamps of the queue are not updated correctly.
++ * See [1] for more details.
++ */
++ if (budget_timeout)
++ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
++
++ /*
++ * Let the request rip immediately, or let a new queue be
++ * selected if bfqq has just been expired.
++ */
++ __blk_run_queue(bfqd->queue);
++ }
++}
++
++static void bfq_insert_request(struct request_queue *q, struct request *rq)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++ assert_spin_locked(bfqd->queue->queue_lock);
++ bfq_init_prio_data(bfqq, RQ_BIC(rq));
++
++ bfq_add_request(rq);
++
++ rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
++ list_add_tail(&rq->queuelist, &bfqq->fifo);
++
++ bfq_rq_enqueued(bfqd, bfqq, rq);
++}
++
++static void bfq_update_hw_tag(struct bfq_data *bfqd)
++{
++ bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
++ bfqd->rq_in_driver);
++
++ if (bfqd->hw_tag == 1)
++ return;
++
++ /*
++ * This sample is valid if the number of outstanding requests
++ * is large enough to allow a queueing behavior. Note that the
++ * sum is not exact, as it's not taking into account deactivated
++ * requests.
++ */
++ if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
++ return;
++
++ if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
++ return;
++
++ bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
++ bfqd->max_rq_in_driver = 0;
++ bfqd->hw_tag_samples = 0;
++}
++
++static void bfq_completed_request(struct request_queue *q, struct request *rq)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++ struct bfq_data *bfqd = bfqq->bfqd;
++ bool sync = bfq_bfqq_sync(bfqq);
++
++ bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)",
++ blk_rq_sectors(rq), sync);
++
++ bfq_update_hw_tag(bfqd);
++
++ BUG_ON(!bfqd->rq_in_driver);
++ BUG_ON(!bfqq->dispatched);
++ bfqd->rq_in_driver--;
++ bfqq->dispatched--;
++
++ if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
++ bfq_weights_tree_remove(bfqd, &bfqq->entity,
++ &bfqd->queue_weights_tree);
++ if (!blk_queue_nonrot(bfqd->queue)) {
++ BUG_ON(!bfqd->busy_in_flight_queues);
++ bfqd->busy_in_flight_queues--;
++ if (bfq_bfqq_constantly_seeky(bfqq)) {
++ BUG_ON(!bfqd->
++ const_seeky_busy_in_flight_queues);
++ bfqd->const_seeky_busy_in_flight_queues--;
++ }
++ }
++ }
++
++ if (sync) {
++ bfqd->sync_flight--;
++ RQ_BIC(rq)->ttime.last_end_request = jiffies;
++ }
++
++ /*
++ * If we are waiting to discover whether the request pattern of the
++ * task associated with the queue is actually isochronous, and
++ * both requisites for this condition to hold are satisfied, then
++ * compute soft_rt_next_start (see the comments to the function
++ * bfq_bfqq_softrt_next_start()).
++ */
++ if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
++ RB_EMPTY_ROOT(&bfqq->sort_list))
++ bfqq->soft_rt_next_start =
++ bfq_bfqq_softrt_next_start(bfqd, bfqq);
++
++ /*
++ * If this is the in-service queue, check if it needs to be expired,
++ * or if we want to idle in case it has no pending requests.
++ */
++ if (bfqd->in_service_queue == bfqq) {
++ if (bfq_bfqq_budget_new(bfqq))
++ bfq_set_budget_timeout(bfqd);
++
++ if (bfq_bfqq_must_idle(bfqq)) {
++ bfq_arm_slice_timer(bfqd);
++ goto out;
++ } else if (bfq_may_expire_for_budg_timeout(bfqq))
++ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
++ else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
++ (bfqq->dispatched == 0 ||
++ !bfq_bfqq_must_not_expire(bfqq)))
++ bfq_bfqq_expire(bfqd, bfqq, 0,
++ BFQ_BFQQ_NO_MORE_REQUESTS);
++ }
++
++ if (!bfqd->rq_in_driver)
++ bfq_schedule_dispatch(bfqd);
++
++out:
++ return;
++}
++
++static inline int __bfq_may_queue(struct bfq_queue *bfqq)
++{
++ if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
++ bfq_clear_bfqq_must_alloc(bfqq);
++ return ELV_MQUEUE_MUST;
++ }
++
++ return ELV_MQUEUE_MAY;
++}
++
++static int bfq_may_queue(struct request_queue *q, int rw)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct task_struct *tsk = current;
++ struct bfq_io_cq *bic;
++ struct bfq_queue *bfqq;
++
++ /*
++ * Don't force setup of a queue from here, as a call to may_queue
++ * does not necessarily imply that a request actually will be
++ * queued. So just lookup a possibly existing queue, or return
++ * 'may queue' if that fails.
++ */
++ bic = bfq_bic_lookup(bfqd, tsk->io_context);
++ if (bic == NULL)
++ return ELV_MQUEUE_MAY;
++
++ bfqq = bic_to_bfqq(bic, rw_is_sync(rw));
++ if (bfqq != NULL) {
++ bfq_init_prio_data(bfqq, bic);
++
++ return __bfq_may_queue(bfqq);
++ }
++
++ return ELV_MQUEUE_MAY;
++}
++
++/*
++ * Queue lock held here.
++ */
++static void bfq_put_request(struct request *rq)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++ if (bfqq != NULL) {
++ const int rw = rq_data_dir(rq);
++
++ BUG_ON(!bfqq->allocated[rw]);
++ bfqq->allocated[rw]--;
++
++ rq->elv.priv[0] = NULL;
++ rq->elv.priv[1] = NULL;
++
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
++ bfqq, atomic_read(&bfqq->ref));
++ bfq_put_queue(bfqq);
++ }
++}
++
++static struct bfq_queue *
++bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
++ struct bfq_queue *bfqq)
++{
++ bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
++ (long unsigned)bfqq->new_bfqq->pid);
++ bic_set_bfqq(bic, bfqq->new_bfqq, 1);
++ bfq_mark_bfqq_coop(bfqq->new_bfqq);
++ bfq_put_queue(bfqq);
++ return bic_to_bfqq(bic, 1);
++}
++
++/*
++ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
++ * was the last process referring to said bfqq.
++ */
++static struct bfq_queue *
++bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
++{
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
++ if (bfqq_process_refs(bfqq) == 1) {
++ bfqq->pid = current->pid;
++ bfq_clear_bfqq_coop(bfqq);
++ bfq_clear_bfqq_split_coop(bfqq);
++ return bfqq;
++ }
++
++ bic_set_bfqq(bic, NULL, 1);
++
++ bfq_put_cooperator(bfqq);
++
++ bfq_put_queue(bfqq);
++ return NULL;
++}
++
++/*
++ * Allocate bfq data structures associated with this request.
++ */
++static int bfq_set_request(struct request_queue *q, struct request *rq,
++ struct bio *bio, gfp_t gfp_mask)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
++ const int rw = rq_data_dir(rq);
++ const int is_sync = rq_is_sync(rq);
++ struct bfq_queue *bfqq;
++ struct bfq_group *bfqg;
++ unsigned long flags;
++
++ might_sleep_if(gfp_mask & __GFP_WAIT);
++
++ bfq_changed_ioprio(bic);
++
++ spin_lock_irqsave(q->queue_lock, flags);
++
++ if (bic == NULL)
++ goto queue_fail;
++
++ bfqg = bfq_bic_update_cgroup(bic);
++
++new_queue:
++ bfqq = bic_to_bfqq(bic, is_sync);
++ if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
++ bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
++ bic_set_bfqq(bic, bfqq, is_sync);
++ } else {
++ /*
++ * If the queue was seeky for too long, break it apart.
++ */
++ if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
++ bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
++ bfqq = bfq_split_bfqq(bic, bfqq);
++ if (!bfqq)
++ goto new_queue;
++ }
++
++ /*
++ * Check to see if this queue is scheduled to merge with
++ * another closely cooperating queue. The merging of queues
++ * happens here as it must be done in process context.
++ * The reference on new_bfqq was taken in merge_bfqqs.
++ */
++ if (bfqq->new_bfqq != NULL)
++ bfqq = bfq_merge_bfqqs(bfqd, bic, bfqq);
++ }
++
++ bfqq->allocated[rw]++;
++ atomic_inc(&bfqq->ref);
++ bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq,
++ atomic_read(&bfqq->ref));
++
++ rq->elv.priv[0] = bic;
++ rq->elv.priv[1] = bfqq;
++
++ spin_unlock_irqrestore(q->queue_lock, flags);
++
++ return 0;
++
++queue_fail:
++ bfq_schedule_dispatch(bfqd);
++ spin_unlock_irqrestore(q->queue_lock, flags);
++
++ return 1;
++}
++
++static void bfq_kick_queue(struct work_struct *work)
++{
++ struct bfq_data *bfqd =
++ container_of(work, struct bfq_data, unplug_work);
++ struct request_queue *q = bfqd->queue;
++
++ spin_lock_irq(q->queue_lock);
++ __blk_run_queue(q);
++ spin_unlock_irq(q->queue_lock);
++}
++
++/*
++ * Handler of the expiration of the timer running if the in-service queue
++ * is idling inside its time slice.
++ */
++static void bfq_idle_slice_timer(unsigned long data)
++{
++ struct bfq_data *bfqd = (struct bfq_data *)data;
++ struct bfq_queue *bfqq;
++ unsigned long flags;
++ enum bfqq_expiration reason;
++
++ spin_lock_irqsave(bfqd->queue->queue_lock, flags);
++
++ bfqq = bfqd->in_service_queue;
++ /*
++ * Theoretical race here: the in-service queue can be NULL or
++ * different from the queue that was idling if the timer handler
++ * spins on the queue_lock and a new request arrives for the
++ * current queue and there is a full dispatch cycle that changes
++ * the in-service queue. This can hardly happen, but in the worst
++ * case we just expire a queue too early.
++ */
++ if (bfqq != NULL) {
++ bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
++ if (bfq_bfqq_budget_timeout(bfqq))
++ /*
++ * Also here the queue can be safely expired
++ * for budget timeout without wasting
++ * guarantees
++ */
++ reason = BFQ_BFQQ_BUDGET_TIMEOUT;
++ else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
++ /*
++ * The queue may not be empty upon timer expiration,
++ * because we may not disable the timer when the
++ * first request of the in-service queue arrives
++ * during disk idling.
++ */
++ reason = BFQ_BFQQ_TOO_IDLE;
++ else
++ goto schedule_dispatch;
++
++ bfq_bfqq_expire(bfqd, bfqq, 1, reason);
++ }
++
++schedule_dispatch:
++ bfq_schedule_dispatch(bfqd);
++
++ spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
++}
++
++static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
++{
++ del_timer_sync(&bfqd->idle_slice_timer);
++ cancel_work_sync(&bfqd->unplug_work);
++}
++
++static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
++ struct bfq_queue **bfqq_ptr)
++{
++ struct bfq_group *root_group = bfqd->root_group;
++ struct bfq_queue *bfqq = *bfqq_ptr;
++
++ bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
++ if (bfqq != NULL) {
++ bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group);
++ bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
++ bfqq, atomic_read(&bfqq->ref));
++ bfq_put_queue(bfqq);
++ *bfqq_ptr = NULL;
++ }
++}
++
++/*
++ * Release all the bfqg references to its async queues. If we are
++ * deallocating the group these queues may still contain requests, so
++ * we reparent them to the root cgroup (i.e., the only one that will
++ * exist for sure until all the requests on a device are gone).
++ */
++static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
++{
++ int i, j;
++
++ for (i = 0; i < 2; i++)
++ for (j = 0; j < IOPRIO_BE_NR; j++)
++ __bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
++
++ __bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
++}
++
++static void bfq_exit_queue(struct elevator_queue *e)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ struct request_queue *q = bfqd->queue;
++ struct bfq_queue *bfqq, *n;
++
++ bfq_shutdown_timer_wq(bfqd);
++
++ spin_lock_irq(q->queue_lock);
++
++ BUG_ON(bfqd->in_service_queue != NULL);
++ list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
++ bfq_deactivate_bfqq(bfqd, bfqq, 0);
++
++ bfq_disconnect_groups(bfqd);
++ spin_unlock_irq(q->queue_lock);
++
++ bfq_shutdown_timer_wq(bfqd);
++
++ synchronize_rcu();
++
++ BUG_ON(timer_pending(&bfqd->idle_slice_timer));
++
++ bfq_free_root_group(bfqd);
++ kfree(bfqd);
++}
++
++static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
++{
++ struct bfq_group *bfqg;
++ struct bfq_data *bfqd;
++ struct elevator_queue *eq;
++
++ eq = elevator_alloc(q, e);
++ if (eq == NULL)
++ return -ENOMEM;
++
++ bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
++ if (bfqd == NULL) {
++ kobject_put(&eq->kobj);
++ return -ENOMEM;
++ }
++ eq->elevator_data = bfqd;
++
++ /*
++ * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
++ * Grab a permanent reference to it, so that the normal code flow
++ * will not attempt to free it.
++ */
++ bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0);
++ atomic_inc(&bfqd->oom_bfqq.ref);
++
++ bfqd->queue = q;
++
++ spin_lock_irq(q->queue_lock);
++ q->elevator = eq;
++ spin_unlock_irq(q->queue_lock);
++
++ bfqg = bfq_alloc_root_group(bfqd, q->node);
++ if (bfqg == NULL) {
++ kfree(bfqd);
++ kobject_put(&eq->kobj);
++ return -ENOMEM;
++ }
++
++ bfqd->root_group = bfqg;
++#ifdef CONFIG_CGROUP_BFQIO
++ bfqd->active_numerous_groups = 0;
++#endif
++
++ init_timer(&bfqd->idle_slice_timer);
++ bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
++ bfqd->idle_slice_timer.data = (unsigned long)bfqd;
++
++ bfqd->rq_pos_tree = RB_ROOT;
++ bfqd->queue_weights_tree = RB_ROOT;
++ bfqd->group_weights_tree = RB_ROOT;
++
++ INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
++
++ INIT_LIST_HEAD(&bfqd->active_list);
++ INIT_LIST_HEAD(&bfqd->idle_list);
++
++ bfqd->hw_tag = -1;
++
++ bfqd->bfq_max_budget = bfq_default_max_budget;
++
++ bfqd->bfq_quantum = bfq_quantum;
++ bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
++ bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
++ bfqd->bfq_back_max = bfq_back_max;
++ bfqd->bfq_back_penalty = bfq_back_penalty;
++ bfqd->bfq_slice_idle = bfq_slice_idle;
++ bfqd->bfq_class_idle_last_service = 0;
++ bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq;
++ bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
++ bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
++
++ bfqd->bfq_coop_thresh = 2;
++ bfqd->bfq_failed_cooperations = 7000;
++ bfqd->bfq_requests_within_timer = 120;
++
++ bfqd->low_latency = true;
++
++ bfqd->bfq_wr_coeff = 20;
++ bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
++ bfqd->bfq_wr_max_time = 0;
++ bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
++ bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
++ bfqd->bfq_wr_max_softrt_rate = 7000; /*
++ * Approximate rate required
++ * to playback or record a
++ * high-definition compressed
++ * video.
++ */
++ bfqd->wr_busy_queues = 0;
++ bfqd->busy_in_flight_queues = 0;
++ bfqd->const_seeky_busy_in_flight_queues = 0;
++
++ /*
++ * Begin by assuming, optimistically, that the device peak rate is
++ * equal to the highest reference rate.
++ */
++ bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
++ T_fast[blk_queue_nonrot(bfqd->queue)];
++ bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)];
++ bfqd->device_speed = BFQ_BFQD_FAST;
++
++ return 0;
++}
++
++static void bfq_slab_kill(void)
++{
++ if (bfq_pool != NULL)
++ kmem_cache_destroy(bfq_pool);
++}
++
++static int __init bfq_slab_setup(void)
++{
++ bfq_pool = KMEM_CACHE(bfq_queue, 0);
++ if (bfq_pool == NULL)
++ return -ENOMEM;
++ return 0;
++}
++
++static ssize_t bfq_var_show(unsigned int var, char *page)
++{
++ return sprintf(page, "%d\n", var);
++}
++
++static ssize_t bfq_var_store(unsigned long *var, const char *page,
++ size_t count)
++{
++ unsigned long new_val;
++ int ret = kstrtoul(page, 10, &new_val);
++
++ if (ret == 0)
++ *var = new_val;
++
++ return count;
++}
++
++static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
++ jiffies_to_msecs(bfqd->bfq_wr_max_time) :
++ jiffies_to_msecs(bfq_wr_duration(bfqd)));
++}
++
++static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
++{
++ struct bfq_queue *bfqq;
++ struct bfq_data *bfqd = e->elevator_data;
++ ssize_t num_char = 0;
++
++ num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
++ bfqd->queued);
++
++ spin_lock_irq(bfqd->queue->queue_lock);
++
++ num_char += sprintf(page + num_char, "Active:\n");
++ list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
++ num_char += sprintf(page + num_char,
++ "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n",
++ bfqq->pid,
++ bfqq->entity.weight,
++ bfqq->queued[0],
++ bfqq->queued[1],
++ jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ }
++
++ num_char += sprintf(page + num_char, "Idle:\n");
++ list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
++ num_char += sprintf(page + num_char,
++ "pid%d: weight %hu, dur %d/%u\n",
++ bfqq->pid,
++ bfqq->entity.weight,
++ jiffies_to_msecs(jiffies -
++ bfqq->last_wr_start_finish),
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ }
++
++ spin_unlock_irq(bfqd->queue->queue_lock);
++
++ return num_char;
++}
++
++#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
++static ssize_t __FUNC(struct elevator_queue *e, char *page) \
++{ \
++ struct bfq_data *bfqd = e->elevator_data; \
++ unsigned int __data = __VAR; \
++ if (__CONV) \
++ __data = jiffies_to_msecs(__data); \
++ return bfq_var_show(__data, (page)); \
++}
++SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0);
++SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1);
++SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1);
++SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
++SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
++SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1);
++SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
++SHOW_FUNCTION(bfq_max_budget_async_rq_show,
++ bfqd->bfq_max_budget_async_rq, 0);
++SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
++SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
++SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
++SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
++SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
++SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
++SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
++ 1);
++SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
++#undef SHOW_FUNCTION
++
++#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
++static ssize_t \
++__FUNC(struct elevator_queue *e, const char *page, size_t count) \
++{ \
++ struct bfq_data *bfqd = e->elevator_data; \
++ unsigned long uninitialized_var(__data); \
++ int ret = bfq_var_store(&__data, (page), count); \
++ if (__data < (MIN)) \
++ __data = (MIN); \
++ else if (__data > (MAX)) \
++ __data = (MAX); \
++ if (__CONV) \
++ *(__PTR) = msecs_to_jiffies(__data); \
++ else \
++ *(__PTR) = __data; \
++ return ret; \
++}
++STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
++ INT_MAX, 1);
++STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
++ INT_MAX, 1);
++STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
++STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
++ INT_MAX, 0);
++STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
++ 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
++ INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
++ 1);
++STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
++ INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
++ &bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
++ INT_MAX, 0);
++#undef STORE_FUNCTION
++
++/* do nothing for the moment */
++static ssize_t bfq_weights_store(struct elevator_queue *e,
++ const char *page, size_t count)
++{
++ return count;
++}
++
++static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
++{
++ u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
++
++ if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
++ return bfq_calc_max_budget(bfqd->peak_rate, timeout);
++ else
++ return bfq_default_max_budget;
++}
++
++static ssize_t bfq_max_budget_store(struct elevator_queue *e,
++ const char *page, size_t count)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ unsigned long uninitialized_var(__data);
++ int ret = bfq_var_store(&__data, (page), count);
++
++ if (__data == 0)
++ bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
++ else {
++ if (__data > INT_MAX)
++ __data = INT_MAX;
++ bfqd->bfq_max_budget = __data;
++ }
++
++ bfqd->bfq_user_max_budget = __data;
++
++ return ret;
++}
++
++static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
++ const char *page, size_t count)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ unsigned long uninitialized_var(__data);
++ int ret = bfq_var_store(&__data, (page), count);
++
++ if (__data < 1)
++ __data = 1;
++ else if (__data > INT_MAX)
++ __data = INT_MAX;
++
++ bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data);
++ if (bfqd->bfq_user_max_budget == 0)
++ bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
++
++ return ret;
++}
++
++static ssize_t bfq_low_latency_store(struct elevator_queue *e,
++ const char *page, size_t count)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ unsigned long uninitialized_var(__data);
++ int ret = bfq_var_store(&__data, (page), count);
++
++ if (__data > 1)
++ __data = 1;
++ if (__data == 0 && bfqd->low_latency != 0)
++ bfq_end_wr(bfqd);
++ bfqd->low_latency = __data;
++
++ return ret;
++}
++
++#define BFQ_ATTR(name) \
++ __ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
++
++static struct elv_fs_entry bfq_attrs[] = {
++ BFQ_ATTR(quantum),
++ BFQ_ATTR(fifo_expire_sync),
++ BFQ_ATTR(fifo_expire_async),
++ BFQ_ATTR(back_seek_max),
++ BFQ_ATTR(back_seek_penalty),
++ BFQ_ATTR(slice_idle),
++ BFQ_ATTR(max_budget),
++ BFQ_ATTR(max_budget_async_rq),
++ BFQ_ATTR(timeout_sync),
++ BFQ_ATTR(timeout_async),
++ BFQ_ATTR(low_latency),
++ BFQ_ATTR(wr_coeff),
++ BFQ_ATTR(wr_max_time),
++ BFQ_ATTR(wr_rt_max_time),
++ BFQ_ATTR(wr_min_idle_time),
++ BFQ_ATTR(wr_min_inter_arr_async),
++ BFQ_ATTR(wr_max_softrt_rate),
++ BFQ_ATTR(weights),
++ __ATTR_NULL
++};
++
++static struct elevator_type iosched_bfq = {
++ .ops = {
++ .elevator_merge_fn = bfq_merge,
++ .elevator_merged_fn = bfq_merged_request,
++ .elevator_merge_req_fn = bfq_merged_requests,
++ .elevator_allow_merge_fn = bfq_allow_merge,
++ .elevator_dispatch_fn = bfq_dispatch_requests,
++ .elevator_add_req_fn = bfq_insert_request,
++ .elevator_activate_req_fn = bfq_activate_request,
++ .elevator_deactivate_req_fn = bfq_deactivate_request,
++ .elevator_completed_req_fn = bfq_completed_request,
++ .elevator_former_req_fn = elv_rb_former_request,
++ .elevator_latter_req_fn = elv_rb_latter_request,
++ .elevator_init_icq_fn = bfq_init_icq,
++ .elevator_exit_icq_fn = bfq_exit_icq,
++ .elevator_set_req_fn = bfq_set_request,
++ .elevator_put_req_fn = bfq_put_request,
++ .elevator_may_queue_fn = bfq_may_queue,
++ .elevator_init_fn = bfq_init_queue,
++ .elevator_exit_fn = bfq_exit_queue,
++ },
++ .icq_size = sizeof(struct bfq_io_cq),
++ .icq_align = __alignof__(struct bfq_io_cq),
++ .elevator_attrs = bfq_attrs,
++ .elevator_name = "bfq",
++ .elevator_owner = THIS_MODULE,
++};
++
++static int __init bfq_init(void)
++{
++ /*
++ * Can be 0 on HZ < 1000 setups.
++ */
++ if (bfq_slice_idle == 0)
++ bfq_slice_idle = 1;
++
++ if (bfq_timeout_async == 0)
++ bfq_timeout_async = 1;
++
++ if (bfq_slab_setup())
++ return -ENOMEM;
++
++ /*
++ * Times to load large popular applications for the typical systems
++ * installed on the reference devices (see the comments before the
++ * definitions of the two arrays).
++ */
++ T_slow[0] = msecs_to_jiffies(2600);
++ T_slow[1] = msecs_to_jiffies(1000);
++ T_fast[0] = msecs_to_jiffies(5500);
++ T_fast[1] = msecs_to_jiffies(2000);
++
++ /*
++ * Thresholds that determine the switch between speed classes (see
++ * the comments before the definition of the array).
++ */
++ device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2;
++ device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
++
++ elv_register(&iosched_bfq);
++ pr_info("BFQ I/O-scheduler version: v7r5");
++
++ return 0;
++}
++
++static void __exit bfq_exit(void)
++{
++ elv_unregister(&iosched_bfq);
++ bfq_slab_kill();
++}
++
++module_init(bfq_init);
++module_exit(bfq_exit);
++
++MODULE_AUTHOR("Fabio Checconi, Paolo Valente");
++MODULE_LICENSE("GPL");
+diff --git a/block/bfq-sched.c b/block/bfq-sched.c
+new file mode 100644
+index 0000000..c4831b7
+--- /dev/null
++++ b/block/bfq-sched.c
+@@ -0,0 +1,1207 @@
++/*
++ * BFQ: Hierarchical B-WF2Q+ scheduler.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++#ifdef CONFIG_CGROUP_BFQIO
++#define for_each_entity(entity) \
++ for (; entity != NULL; entity = entity->parent)
++
++#define for_each_entity_safe(entity, parent) \
++ for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
++
++static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
++ int extract,
++ struct bfq_data *bfqd);
++
++static inline void bfq_update_budget(struct bfq_entity *next_in_service)
++{
++ struct bfq_entity *bfqg_entity;
++ struct bfq_group *bfqg;
++ struct bfq_sched_data *group_sd;
++
++ BUG_ON(next_in_service == NULL);
++
++ group_sd = next_in_service->sched_data;
++
++ bfqg = container_of(group_sd, struct bfq_group, sched_data);
++ /*
++ * bfq_group's my_entity field is not NULL only if the group
++ * is not the root group. We must not touch the root entity
++ * as it must never become an in-service entity.
++ */
++ bfqg_entity = bfqg->my_entity;
++ if (bfqg_entity != NULL)
++ bfqg_entity->budget = next_in_service->budget;
++}
++
++static int bfq_update_next_in_service(struct bfq_sched_data *sd)
++{
++ struct bfq_entity *next_in_service;
++
++ if (sd->in_service_entity != NULL)
++ /* will update/requeue at the end of service */
++ return 0;
++
++ /*
++ * NOTE: this can be improved in many ways, such as returning
++ * 1 (and thus propagating upwards the update) only when the
++ * budget changes, or caching the bfqq that will be scheduled
++ * next from this subtree. By now we worry more about
++ * correctness than about performance...
++ */
++ next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
++ sd->next_in_service = next_in_service;
++
++ if (next_in_service != NULL)
++ bfq_update_budget(next_in_service);
++
++ return 1;
++}
++
++static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
++ struct bfq_entity *entity)
++{
++ BUG_ON(sd->next_in_service != entity);
++}
++#else
++#define for_each_entity(entity) \
++ for (; entity != NULL; entity = NULL)
++
++#define for_each_entity_safe(entity, parent) \
++ for (parent = NULL; entity != NULL; entity = parent)
++
++static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
++{
++ return 0;
++}
++
++static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
++ struct bfq_entity *entity)
++{
++}
++
++static inline void bfq_update_budget(struct bfq_entity *next_in_service)
++{
++}
++#endif
++
++/*
++ * Shift for timestamp calculations. This actually limits the maximum
++ * service allowed in one timestamp delta (small shift values increase it),
++ * the maximum total weight that can be used for the queues in the system
++ * (big shift values increase it), and the period of virtual time
++ * wraparounds.
++ */
++#define WFQ_SERVICE_SHIFT 22
++
++/**
++ * bfq_gt - compare two timestamps.
++ * @a: first ts.
++ * @b: second ts.
++ *
++ * Return @a > @b, dealing with wrapping correctly.
++ */
++static inline int bfq_gt(u64 a, u64 b)
++{
++ return (s64)(a - b) > 0;
++}
++
++static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = NULL;
++
++ BUG_ON(entity == NULL);
++
++ if (entity->my_sched_data == NULL)
++ bfqq = container_of(entity, struct bfq_queue, entity);
++
++ return bfqq;
++}
++
++
++/**
++ * bfq_delta - map service into the virtual time domain.
++ * @service: amount of service.
++ * @weight: scale factor (weight of an entity or weight sum).
++ */
++static inline u64 bfq_delta(unsigned long service,
++ unsigned long weight)
++{
++ u64 d = (u64)service << WFQ_SERVICE_SHIFT;
++
++ do_div(d, weight);
++ return d;
++}
++
++/**
++ * bfq_calc_finish - assign the finish time to an entity.
++ * @entity: the entity to act upon.
++ * @service: the service to be charged to the entity.
++ */
++static inline void bfq_calc_finish(struct bfq_entity *entity,
++ unsigned long service)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++ BUG_ON(entity->weight == 0);
++
++ entity->finish = entity->start +
++ bfq_delta(service, entity->weight);
++
++ if (bfqq != NULL) {
++ bfq_log_bfqq(bfqq->bfqd, bfqq,
++ "calc_finish: serv %lu, w %d",
++ service, entity->weight);
++ bfq_log_bfqq(bfqq->bfqd, bfqq,
++ "calc_finish: start %llu, finish %llu, delta %llu",
++ entity->start, entity->finish,
++ bfq_delta(service, entity->weight));
++ }
++}
++
++/**
++ * bfq_entity_of - get an entity from a node.
++ * @node: the node field of the entity.
++ *
++ * Convert a node pointer to the relative entity. This is used only
++ * to simplify the logic of some functions and not as the generic
++ * conversion mechanism because, e.g., in the tree walking functions,
++ * the check for a %NULL value would be redundant.
++ */
++static inline struct bfq_entity *bfq_entity_of(struct rb_node *node)
++{
++ struct bfq_entity *entity = NULL;
++
++ if (node != NULL)
++ entity = rb_entry(node, struct bfq_entity, rb_node);
++
++ return entity;
++}
++
++/**
++ * bfq_extract - remove an entity from a tree.
++ * @root: the tree root.
++ * @entity: the entity to remove.
++ */
++static inline void bfq_extract(struct rb_root *root,
++ struct bfq_entity *entity)
++{
++ BUG_ON(entity->tree != root);
++
++ entity->tree = NULL;
++ rb_erase(&entity->rb_node, root);
++}
++
++/**
++ * bfq_idle_extract - extract an entity from the idle tree.
++ * @st: the service tree of the owning @entity.
++ * @entity: the entity being removed.
++ */
++static void bfq_idle_extract(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct rb_node *next;
++
++ BUG_ON(entity->tree != &st->idle);
++
++ if (entity == st->first_idle) {
++ next = rb_next(&entity->rb_node);
++ st->first_idle = bfq_entity_of(next);
++ }
++
++ if (entity == st->last_idle) {
++ next = rb_prev(&entity->rb_node);
++ st->last_idle = bfq_entity_of(next);
++ }
++
++ bfq_extract(&st->idle, entity);
++
++ if (bfqq != NULL)
++ list_del(&bfqq->bfqq_list);
++}
++
++/**
++ * bfq_insert - generic tree insertion.
++ * @root: tree root.
++ * @entity: entity to insert.
++ *
++ * This is used for the idle and the active tree, since they are both
++ * ordered by finish time.
++ */
++static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
++{
++ struct bfq_entity *entry;
++ struct rb_node **node = &root->rb_node;
++ struct rb_node *parent = NULL;
++
++ BUG_ON(entity->tree != NULL);
++
++ while (*node != NULL) {
++ parent = *node;
++ entry = rb_entry(parent, struct bfq_entity, rb_node);
++
++ if (bfq_gt(entry->finish, entity->finish))
++ node = &parent->rb_left;
++ else
++ node = &parent->rb_right;
++ }
++
++ rb_link_node(&entity->rb_node, parent, node);
++ rb_insert_color(&entity->rb_node, root);
++
++ entity->tree = root;
++}
++
++/**
++ * bfq_update_min - update the min_start field of a entity.
++ * @entity: the entity to update.
++ * @node: one of its children.
++ *
++ * This function is called when @entity may store an invalid value for
++ * min_start due to updates to the active tree. The function assumes
++ * that the subtree rooted at @node (which may be its left or its right
++ * child) has a valid min_start value.
++ */
++static inline void bfq_update_min(struct bfq_entity *entity,
++ struct rb_node *node)
++{
++ struct bfq_entity *child;
++
++ if (node != NULL) {
++ child = rb_entry(node, struct bfq_entity, rb_node);
++ if (bfq_gt(entity->min_start, child->min_start))
++ entity->min_start = child->min_start;
++ }
++}
++
++/**
++ * bfq_update_active_node - recalculate min_start.
++ * @node: the node to update.
++ *
++ * @node may have changed position or one of its children may have moved,
++ * this function updates its min_start value. The left and right subtrees
++ * are assumed to hold a correct min_start value.
++ */
++static inline void bfq_update_active_node(struct rb_node *node)
++{
++ struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
++
++ entity->min_start = entity->start;
++ bfq_update_min(entity, node->rb_right);
++ bfq_update_min(entity, node->rb_left);
++}
++
++/**
++ * bfq_update_active_tree - update min_start for the whole active tree.
++ * @node: the starting node.
++ *
++ * @node must be the deepest modified node after an update. This function
++ * updates its min_start using the values held by its children, assuming
++ * that they did not change, and then updates all the nodes that may have
++ * changed in the path to the root. The only nodes that may have changed
++ * are the ones in the path or their siblings.
++ */
++static void bfq_update_active_tree(struct rb_node *node)
++{
++ struct rb_node *parent;
++
++up:
++ bfq_update_active_node(node);
++
++ parent = rb_parent(node);
++ if (parent == NULL)
++ return;
++
++ if (node == parent->rb_left && parent->rb_right != NULL)
++ bfq_update_active_node(parent->rb_right);
++ else if (parent->rb_left != NULL)
++ bfq_update_active_node(parent->rb_left);
++
++ node = parent;
++ goto up;
++}
++
++static void bfq_weights_tree_add(struct bfq_data *bfqd,
++ struct bfq_entity *entity,
++ struct rb_root *root);
++
++static void bfq_weights_tree_remove(struct bfq_data *bfqd,
++ struct bfq_entity *entity,
++ struct rb_root *root);
++
++
++/**
++ * bfq_active_insert - insert an entity in the active tree of its
++ * group/device.
++ * @st: the service tree of the entity.
++ * @entity: the entity being inserted.
++ *
++ * The active tree is ordered by finish time, but an extra key is kept
++ * per each node, containing the minimum value for the start times of
++ * its children (and the node itself), so it's possible to search for
++ * the eligible node with the lowest finish time in logarithmic time.
++ */
++static void bfq_active_insert(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct rb_node *node = &entity->rb_node;
++#ifdef CONFIG_CGROUP_BFQIO
++ struct bfq_sched_data *sd = NULL;
++ struct bfq_group *bfqg = NULL;
++ struct bfq_data *bfqd = NULL;
++#endif
++
++ bfq_insert(&st->active, entity);
++
++ if (node->rb_left != NULL)
++ node = node->rb_left;
++ else if (node->rb_right != NULL)
++ node = node->rb_right;
++
++ bfq_update_active_tree(node);
++
++#ifdef CONFIG_CGROUP_BFQIO
++ sd = entity->sched_data;
++ bfqg = container_of(sd, struct bfq_group, sched_data);
++ BUG_ON(!bfqg);
++ bfqd = (struct bfq_data *)bfqg->bfqd;
++#endif
++ if (bfqq != NULL)
++ list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
++#ifdef CONFIG_CGROUP_BFQIO
++ else { /* bfq_group */
++ BUG_ON(!bfqd);
++ bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
++ }
++ if (bfqg != bfqd->root_group) {
++ BUG_ON(!bfqg);
++ BUG_ON(!bfqd);
++ bfqg->active_entities++;
++ if (bfqg->active_entities == 2)
++ bfqd->active_numerous_groups++;
++ }
++#endif
++}
++
++/**
++ * bfq_ioprio_to_weight - calc a weight from an ioprio.
++ * @ioprio: the ioprio value to convert.
++ */
++static inline unsigned short bfq_ioprio_to_weight(int ioprio)
++{
++ BUG_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
++ return IOPRIO_BE_NR - ioprio;
++}
++
++/**
++ * bfq_weight_to_ioprio - calc an ioprio from a weight.
++ * @weight: the weight value to convert.
++ *
++ * To preserve as mush as possible the old only-ioprio user interface,
++ * 0 is used as an escape ioprio value for weights (numerically) equal or
++ * larger than IOPRIO_BE_NR
++ */
++static inline unsigned short bfq_weight_to_ioprio(int weight)
++{
++ BUG_ON(weight < BFQ_MIN_WEIGHT || weight > BFQ_MAX_WEIGHT);
++ return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight;
++}
++
++static inline void bfq_get_entity(struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++ if (bfqq != NULL) {
++ atomic_inc(&bfqq->ref);
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
++ bfqq, atomic_read(&bfqq->ref));
++ }
++}
++
++/**
++ * bfq_find_deepest - find the deepest node that an extraction can modify.
++ * @node: the node being removed.
++ *
++ * Do the first step of an extraction in an rb tree, looking for the
++ * node that will replace @node, and returning the deepest node that
++ * the following modifications to the tree can touch. If @node is the
++ * last node in the tree return %NULL.
++ */
++static struct rb_node *bfq_find_deepest(struct rb_node *node)
++{
++ struct rb_node *deepest;
++
++ if (node->rb_right == NULL && node->rb_left == NULL)
++ deepest = rb_parent(node);
++ else if (node->rb_right == NULL)
++ deepest = node->rb_left;
++ else if (node->rb_left == NULL)
++ deepest = node->rb_right;
++ else {
++ deepest = rb_next(node);
++ if (deepest->rb_right != NULL)
++ deepest = deepest->rb_right;
++ else if (rb_parent(deepest) != node)
++ deepest = rb_parent(deepest);
++ }
++
++ return deepest;
++}
++
++/**
++ * bfq_active_extract - remove an entity from the active tree.
++ * @st: the service_tree containing the tree.
++ * @entity: the entity being removed.
++ */
++static void bfq_active_extract(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct rb_node *node;
++#ifdef CONFIG_CGROUP_BFQIO
++ struct bfq_sched_data *sd = NULL;
++ struct bfq_group *bfqg = NULL;
++ struct bfq_data *bfqd = NULL;
++#endif
++
++ node = bfq_find_deepest(&entity->rb_node);
++ bfq_extract(&st->active, entity);
++
++ if (node != NULL)
++ bfq_update_active_tree(node);
++
++#ifdef CONFIG_CGROUP_BFQIO
++ sd = entity->sched_data;
++ bfqg = container_of(sd, struct bfq_group, sched_data);
++ BUG_ON(!bfqg);
++ bfqd = (struct bfq_data *)bfqg->bfqd;
++#endif
++ if (bfqq != NULL)
++ list_del(&bfqq->bfqq_list);
++#ifdef CONFIG_CGROUP_BFQIO
++ else { /* bfq_group */
++ BUG_ON(!bfqd);
++ bfq_weights_tree_remove(bfqd, entity,
++ &bfqd->group_weights_tree);
++ }
++ if (bfqg != bfqd->root_group) {
++ BUG_ON(!bfqg);
++ BUG_ON(!bfqd);
++ BUG_ON(!bfqg->active_entities);
++ bfqg->active_entities--;
++ if (bfqg->active_entities == 1) {
++ BUG_ON(!bfqd->active_numerous_groups);
++ bfqd->active_numerous_groups--;
++ }
++ }
++#endif
++}
++
++/**
++ * bfq_idle_insert - insert an entity into the idle tree.
++ * @st: the service tree containing the tree.
++ * @entity: the entity to insert.
++ */
++static void bfq_idle_insert(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct bfq_entity *first_idle = st->first_idle;
++ struct bfq_entity *last_idle = st->last_idle;
++
++ if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
++ st->first_idle = entity;
++ if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
++ st->last_idle = entity;
++
++ bfq_insert(&st->idle, entity);
++
++ if (bfqq != NULL)
++ list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
++}
++
++/**
++ * bfq_forget_entity - remove an entity from the wfq trees.
++ * @st: the service tree.
++ * @entity: the entity being removed.
++ *
++ * Update the device status and forget everything about @entity, putting
++ * the device reference to it, if it is a queue. Entities belonging to
++ * groups are not refcounted.
++ */
++static void bfq_forget_entity(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct bfq_sched_data *sd;
++
++ BUG_ON(!entity->on_st);
++
++ entity->on_st = 0;
++ st->wsum -= entity->weight;
++ if (bfqq != NULL) {
++ sd = entity->sched_data;
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
++ bfqq, atomic_read(&bfqq->ref));
++ bfq_put_queue(bfqq);
++ }
++}
++
++/**
++ * bfq_put_idle_entity - release the idle tree ref of an entity.
++ * @st: service tree for the entity.
++ * @entity: the entity being released.
++ */
++static void bfq_put_idle_entity(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ bfq_idle_extract(st, entity);
++ bfq_forget_entity(st, entity);
++}
++
++/**
++ * bfq_forget_idle - update the idle tree if necessary.
++ * @st: the service tree to act upon.
++ *
++ * To preserve the global O(log N) complexity we only remove one entry here;
++ * as the idle tree will not grow indefinitely this can be done safely.
++ */
++static void bfq_forget_idle(struct bfq_service_tree *st)
++{
++ struct bfq_entity *first_idle = st->first_idle;
++ struct bfq_entity *last_idle = st->last_idle;
++
++ if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
++ !bfq_gt(last_idle->finish, st->vtime)) {
++ /*
++ * Forget the whole idle tree, increasing the vtime past
++ * the last finish time of idle entities.
++ */
++ st->vtime = last_idle->finish;
++ }
++
++ if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
++ bfq_put_idle_entity(st, first_idle);
++}
++
++static struct bfq_service_tree *
++__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
++ struct bfq_entity *entity)
++{
++ struct bfq_service_tree *new_st = old_st;
++
++ if (entity->ioprio_changed) {
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ unsigned short prev_weight, new_weight;
++ struct bfq_data *bfqd = NULL;
++ struct rb_root *root;
++#ifdef CONFIG_CGROUP_BFQIO
++ struct bfq_sched_data *sd;
++ struct bfq_group *bfqg;
++#endif
++
++ if (bfqq != NULL)
++ bfqd = bfqq->bfqd;
++#ifdef CONFIG_CGROUP_BFQIO
++ else {
++ sd = entity->my_sched_data;
++ bfqg = container_of(sd, struct bfq_group, sched_data);
++ BUG_ON(!bfqg);
++ bfqd = (struct bfq_data *)bfqg->bfqd;
++ BUG_ON(!bfqd);
++ }
++#endif
++
++ BUG_ON(old_st->wsum < entity->weight);
++ old_st->wsum -= entity->weight;
++
++ if (entity->new_weight != entity->orig_weight) {
++ entity->orig_weight = entity->new_weight;
++ entity->ioprio =
++ bfq_weight_to_ioprio(entity->orig_weight);
++ } else if (entity->new_ioprio != entity->ioprio) {
++ entity->ioprio = entity->new_ioprio;
++ entity->orig_weight =
++ bfq_ioprio_to_weight(entity->ioprio);
++ } else
++ entity->new_weight = entity->orig_weight =
++ bfq_ioprio_to_weight(entity->ioprio);
++
++ entity->ioprio_class = entity->new_ioprio_class;
++ entity->ioprio_changed = 0;
++
++ /*
++ * NOTE: here we may be changing the weight too early,
++ * this will cause unfairness. The correct approach
++ * would have required additional complexity to defer
++ * weight changes to the proper time instants (i.e.,
++ * when entity->finish <= old_st->vtime).
++ */
++ new_st = bfq_entity_service_tree(entity);
++
++ prev_weight = entity->weight;
++ new_weight = entity->orig_weight *
++ (bfqq != NULL ? bfqq->wr_coeff : 1);
++ /*
++ * If the weight of the entity changes, remove the entity
++ * from its old weight counter (if there is a counter
++ * associated with the entity), and add it to the counter
++ * associated with its new weight.
++ */
++ if (prev_weight != new_weight) {
++ root = bfqq ? &bfqd->queue_weights_tree :
++ &bfqd->group_weights_tree;
++ bfq_weights_tree_remove(bfqd, entity, root);
++ }
++ entity->weight = new_weight;
++ /*
++ * Add the entity to its weights tree only if it is
++ * not associated with a weight-raised queue.
++ */
++ if (prev_weight != new_weight &&
++ (bfqq ? bfqq->wr_coeff == 1 : 1))
++ /* If we get here, root has been initialized. */
++ bfq_weights_tree_add(bfqd, entity, root);
++
++ new_st->wsum += entity->weight;
++
++ if (new_st != old_st)
++ entity->start = new_st->vtime;
++ }
++
++ return new_st;
++}
++
++/**
++ * bfq_bfqq_served - update the scheduler status after selection for
++ * service.
++ * @bfqq: the queue being served.
++ * @served: bytes to transfer.
++ *
++ * NOTE: this can be optimized, as the timestamps of upper level entities
++ * are synchronized every time a new bfqq is selected for service. By now,
++ * we keep it to better check consistency.
++ */
++static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++ struct bfq_service_tree *st;
++
++ for_each_entity(entity) {
++ st = bfq_entity_service_tree(entity);
++
++ entity->service += served;
++ BUG_ON(entity->service > entity->budget);
++ BUG_ON(st->wsum == 0);
++
++ st->vtime += bfq_delta(served, st->wsum);
++ bfq_forget_idle(st);
++ }
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served);
++}
++
++/**
++ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
++ * @bfqq: the queue that needs a service update.
++ *
++ * When it's not possible to be fair in the service domain, because
++ * a queue is not consuming its budget fast enough (the meaning of
++ * fast depends on the timeout parameter), we charge it a full
++ * budget. In this way we should obtain a sort of time-domain
++ * fairness among all the seeky/slow queues.
++ */
++static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
++
++ bfq_bfqq_served(bfqq, entity->budget - entity->service);
++}
++
++/**
++ * __bfq_activate_entity - activate an entity.
++ * @entity: the entity being activated.
++ *
++ * Called whenever an entity is activated, i.e., it is not active and one
++ * of its children receives a new request, or has to be reactivated due to
++ * budget exhaustion. It uses the current budget of the entity (and the
++ * service received if @entity is active) of the queue to calculate its
++ * timestamps.
++ */
++static void __bfq_activate_entity(struct bfq_entity *entity)
++{
++ struct bfq_sched_data *sd = entity->sched_data;
++ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++
++ if (entity == sd->in_service_entity) {
++ BUG_ON(entity->tree != NULL);
++ /*
++ * If we are requeueing the current entity we have
++ * to take care of not charging to it service it has
++ * not received.
++ */
++ bfq_calc_finish(entity, entity->service);
++ entity->start = entity->finish;
++ sd->in_service_entity = NULL;
++ } else if (entity->tree == &st->active) {
++ /*
++ * Requeueing an entity due to a change of some
++ * next_in_service entity below it. We reuse the
++ * old start time.
++ */
++ bfq_active_extract(st, entity);
++ } else if (entity->tree == &st->idle) {
++ /*
++ * Must be on the idle tree, bfq_idle_extract() will
++ * check for that.
++ */
++ bfq_idle_extract(st, entity);
++ entity->start = bfq_gt(st->vtime, entity->finish) ?
++ st->vtime : entity->finish;
++ } else {
++ /*
++ * The finish time of the entity may be invalid, and
++ * it is in the past for sure, otherwise the queue
++ * would have been on the idle tree.
++ */
++ entity->start = st->vtime;
++ st->wsum += entity->weight;
++ bfq_get_entity(entity);
++
++ BUG_ON(entity->on_st);
++ entity->on_st = 1;
++ }
++
++ st = __bfq_entity_update_weight_prio(st, entity);
++ bfq_calc_finish(entity, entity->budget);
++ bfq_active_insert(st, entity);
++}
++
++/**
++ * bfq_activate_entity - activate an entity and its ancestors if necessary.
++ * @entity: the entity to activate.
++ *
++ * Activate @entity and all the entities on the path from it to the root.
++ */
++static void bfq_activate_entity(struct bfq_entity *entity)
++{
++ struct bfq_sched_data *sd;
++
++ for_each_entity(entity) {
++ __bfq_activate_entity(entity);
++
++ sd = entity->sched_data;
++ if (!bfq_update_next_in_service(sd))
++ /*
++ * No need to propagate the activation to the
++ * upper entities, as they will be updated when
++ * the in-service entity is rescheduled.
++ */
++ break;
++ }
++}
++
++/**
++ * __bfq_deactivate_entity - deactivate an entity from its service tree.
++ * @entity: the entity to deactivate.
++ * @requeue: if false, the entity will not be put into the idle tree.
++ *
++ * Deactivate an entity, independently from its previous state. If the
++ * entity was not on a service tree just return, otherwise if it is on
++ * any scheduler tree, extract it from that tree, and if necessary
++ * and if the caller did not specify @requeue, put it on the idle tree.
++ *
++ * Return %1 if the caller should update the entity hierarchy, i.e.,
++ * if the entity was in service or if it was the next_in_service for
++ * its sched_data; return %0 otherwise.
++ */
++static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
++{
++ struct bfq_sched_data *sd = entity->sched_data;
++ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++ int was_in_service = entity == sd->in_service_entity;
++ int ret = 0;
++
++ if (!entity->on_st)
++ return 0;
++
++ BUG_ON(was_in_service && entity->tree != NULL);
++
++ if (was_in_service) {
++ bfq_calc_finish(entity, entity->service);
++ sd->in_service_entity = NULL;
++ } else if (entity->tree == &st->active)
++ bfq_active_extract(st, entity);
++ else if (entity->tree == &st->idle)
++ bfq_idle_extract(st, entity);
++ else if (entity->tree != NULL)
++ BUG();
++
++ if (was_in_service || sd->next_in_service == entity)
++ ret = bfq_update_next_in_service(sd);
++
++ if (!requeue || !bfq_gt(entity->finish, st->vtime))
++ bfq_forget_entity(st, entity);
++ else
++ bfq_idle_insert(st, entity);
++
++ BUG_ON(sd->in_service_entity == entity);
++ BUG_ON(sd->next_in_service == entity);
++
++ return ret;
++}
++
++/**
++ * bfq_deactivate_entity - deactivate an entity.
++ * @entity: the entity to deactivate.
++ * @requeue: true if the entity can be put on the idle tree
++ */
++static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
++{
++ struct bfq_sched_data *sd;
++ struct bfq_entity *parent;
++
++ for_each_entity_safe(entity, parent) {
++ sd = entity->sched_data;
++
++ if (!__bfq_deactivate_entity(entity, requeue))
++ /*
++ * The parent entity is still backlogged, and
++ * we don't need to update it as it is still
++ * in service.
++ */
++ break;
++
++ if (sd->next_in_service != NULL)
++ /*
++ * The parent entity is still backlogged and
++ * the budgets on the path towards the root
++ * need to be updated.
++ */
++ goto update;
++
++ /*
++ * If we reach there the parent is no more backlogged and
++ * we want to propagate the dequeue upwards.
++ */
++ requeue = 1;
++ }
++
++ return;
++
++update:
++ entity = parent;
++ for_each_entity(entity) {
++ __bfq_activate_entity(entity);
++
++ sd = entity->sched_data;
++ if (!bfq_update_next_in_service(sd))
++ break;
++ }
++}
++
++/**
++ * bfq_update_vtime - update vtime if necessary.
++ * @st: the service tree to act upon.
++ *
++ * If necessary update the service tree vtime to have at least one
++ * eligible entity, skipping to its start time. Assumes that the
++ * active tree of the device is not empty.
++ *
++ * NOTE: this hierarchical implementation updates vtimes quite often,
++ * we may end up with reactivated processes getting timestamps after a
++ * vtime skip done because we needed a ->first_active entity on some
++ * intermediate node.
++ */
++static void bfq_update_vtime(struct bfq_service_tree *st)
++{
++ struct bfq_entity *entry;
++ struct rb_node *node = st->active.rb_node;
++
++ entry = rb_entry(node, struct bfq_entity, rb_node);
++ if (bfq_gt(entry->min_start, st->vtime)) {
++ st->vtime = entry->min_start;
++ bfq_forget_idle(st);
++ }
++}
++
++/**
++ * bfq_first_active_entity - find the eligible entity with
++ * the smallest finish time
++ * @st: the service tree to select from.
++ *
++ * This function searches the first schedulable entity, starting from the
++ * root of the tree and going on the left every time on this side there is
++ * a subtree with at least one eligible (start >= vtime) entity. The path on
++ * the right is followed only if a) the left subtree contains no eligible
++ * entities and b) no eligible entity has been found yet.
++ */
++static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
++{
++ struct bfq_entity *entry, *first = NULL;
++ struct rb_node *node = st->active.rb_node;
++
++ while (node != NULL) {
++ entry = rb_entry(node, struct bfq_entity, rb_node);
++left:
++ if (!bfq_gt(entry->start, st->vtime))
++ first = entry;
++
++ BUG_ON(bfq_gt(entry->min_start, st->vtime));
++
++ if (node->rb_left != NULL) {
++ entry = rb_entry(node->rb_left,
++ struct bfq_entity, rb_node);
++ if (!bfq_gt(entry->min_start, st->vtime)) {
++ node = node->rb_left;
++ goto left;
++ }
++ }
++ if (first != NULL)
++ break;
++ node = node->rb_right;
++ }
++
++ BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
++ return first;
++}
++
++/**
++ * __bfq_lookup_next_entity - return the first eligible entity in @st.
++ * @st: the service tree.
++ *
++ * Update the virtual time in @st and return the first eligible entity
++ * it contains.
++ */
++static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
++ bool force)
++{
++ struct bfq_entity *entity, *new_next_in_service = NULL;
++
++ if (RB_EMPTY_ROOT(&st->active))
++ return NULL;
++
++ bfq_update_vtime(st);
++ entity = bfq_first_active_entity(st);
++ BUG_ON(bfq_gt(entity->start, st->vtime));
++
++ /*
++ * If the chosen entity does not match with the sched_data's
++ * next_in_service and we are forcedly serving the IDLE priority
++ * class tree, bubble up budget update.
++ */
++ if (unlikely(force && entity != entity->sched_data->next_in_service)) {
++ new_next_in_service = entity;
++ for_each_entity(new_next_in_service)
++ bfq_update_budget(new_next_in_service);
++ }
++
++ return entity;
++}
++
++/**
++ * bfq_lookup_next_entity - return the first eligible entity in @sd.
++ * @sd: the sched_data.
++ * @extract: if true the returned entity will be also extracted from @sd.
++ *
++ * NOTE: since we cache the next_in_service entity at each level of the
++ * hierarchy, the complexity of the lookup can be decreased with
++ * absolutely no effort just returning the cached next_in_service value;
++ * we prefer to do full lookups to test the consistency of * the data
++ * structures.
++ */
++static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
++ int extract,
++ struct bfq_data *bfqd)
++{
++ struct bfq_service_tree *st = sd->service_tree;
++ struct bfq_entity *entity;
++ int i = 0;
++
++ BUG_ON(sd->in_service_entity != NULL);
++
++ if (bfqd != NULL &&
++ jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) {
++ entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
++ true);
++ if (entity != NULL) {
++ i = BFQ_IOPRIO_CLASSES - 1;
++ bfqd->bfq_class_idle_last_service = jiffies;
++ sd->next_in_service = entity;
++ }
++ }
++ for (; i < BFQ_IOPRIO_CLASSES; i++) {
++ entity = __bfq_lookup_next_entity(st + i, false);
++ if (entity != NULL) {
++ if (extract) {
++ bfq_check_next_in_service(sd, entity);
++ bfq_active_extract(st + i, entity);
++ sd->in_service_entity = entity;
++ sd->next_in_service = NULL;
++ }
++ break;
++ }
++ }
++
++ return entity;
++}
++
++/*
++ * Get next queue for service.
++ */
++static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
++{
++ struct bfq_entity *entity = NULL;
++ struct bfq_sched_data *sd;
++ struct bfq_queue *bfqq;
++
++ BUG_ON(bfqd->in_service_queue != NULL);
++
++ if (bfqd->busy_queues == 0)
++ return NULL;
++
++ sd = &bfqd->root_group->sched_data;
++ for (; sd != NULL; sd = entity->my_sched_data) {
++ entity = bfq_lookup_next_entity(sd, 1, bfqd);
++ BUG_ON(entity == NULL);
++ entity->service = 0;
++ }
++
++ bfqq = bfq_entity_to_bfqq(entity);
++ BUG_ON(bfqq == NULL);
++
++ return bfqq;
++}
++
++/*
++ * Forced extraction of the given queue.
++ */
++static void bfq_get_next_queue_forced(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity;
++ struct bfq_sched_data *sd;
++
++ BUG_ON(bfqd->in_service_queue != NULL);
++
++ entity = &bfqq->entity;
++ /*
++ * Bubble up extraction/update from the leaf to the root.
++ */
++ for_each_entity(entity) {
++ sd = entity->sched_data;
++ bfq_update_budget(entity);
++ bfq_update_vtime(bfq_entity_service_tree(entity));
++ bfq_active_extract(bfq_entity_service_tree(entity), entity);
++ sd->in_service_entity = entity;
++ sd->next_in_service = NULL;
++ entity->service = 0;
++ }
++
++ return;
++}
++
++static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
++{
++ if (bfqd->in_service_bic != NULL) {
++ put_io_context(bfqd->in_service_bic->icq.ioc);
++ bfqd->in_service_bic = NULL;
++ }
++
++ bfqd->in_service_queue = NULL;
++ del_timer(&bfqd->idle_slice_timer);
++}
++
++static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ int requeue)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++
++ if (bfqq == bfqd->in_service_queue)
++ __bfq_bfqd_reset_in_service(bfqd);
++
++ bfq_deactivate_entity(entity, requeue);
++}
++
++static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++
++ bfq_activate_entity(entity);
++}
++
++/*
++ * Called when the bfqq no longer has requests pending, remove it from
++ * the service tree.
++ */
++static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ int requeue)
++{
++ BUG_ON(!bfq_bfqq_busy(bfqq));
++ BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list));
++
++ bfq_log_bfqq(bfqd, bfqq, "del from busy");
++
++ bfq_clear_bfqq_busy(bfqq);
++
++ BUG_ON(bfqd->busy_queues == 0);
++ bfqd->busy_queues--;
++
++ if (!bfqq->dispatched) {
++ bfq_weights_tree_remove(bfqd, &bfqq->entity,
++ &bfqd->queue_weights_tree);
++ if (!blk_queue_nonrot(bfqd->queue)) {
++ BUG_ON(!bfqd->busy_in_flight_queues);
++ bfqd->busy_in_flight_queues--;
++ if (bfq_bfqq_constantly_seeky(bfqq)) {
++ BUG_ON(!bfqd->
++ const_seeky_busy_in_flight_queues);
++ bfqd->const_seeky_busy_in_flight_queues--;
++ }
++ }
++ }
++ if (bfqq->wr_coeff > 1)
++ bfqd->wr_busy_queues--;
++
++ bfq_deactivate_bfqq(bfqd, bfqq, requeue);
++}
++
++/*
++ * Called when an inactive queue receives a new request.
++ */
++static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ BUG_ON(bfq_bfqq_busy(bfqq));
++ BUG_ON(bfqq == bfqd->in_service_queue);
++
++ bfq_log_bfqq(bfqd, bfqq, "add to busy");
++
++ bfq_activate_bfqq(bfqd, bfqq);
++
++ bfq_mark_bfqq_busy(bfqq);
++ bfqd->busy_queues++;
++
++ if (!bfqq->dispatched) {
++ if (bfqq->wr_coeff == 1)
++ bfq_weights_tree_add(bfqd, &bfqq->entity,
++ &bfqd->queue_weights_tree);
++ if (!blk_queue_nonrot(bfqd->queue)) {
++ bfqd->busy_in_flight_queues++;
++ if (bfq_bfqq_constantly_seeky(bfqq))
++ bfqd->const_seeky_busy_in_flight_queues++;
++ }
++ }
++ if (bfqq->wr_coeff > 1)
++ bfqd->wr_busy_queues++;
++}
+diff --git a/block/bfq.h b/block/bfq.h
+new file mode 100644
+index 0000000..a83e69d
+--- /dev/null
++++ b/block/bfq.h
+@@ -0,0 +1,742 @@
++/*
++ * BFQ-v7r5 for 3.16.0: data structures and common functions prototypes.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++#ifndef _BFQ_H
++#define _BFQ_H
++
++#include <linux/blktrace_api.h>
++#include <linux/hrtimer.h>
++#include <linux/ioprio.h>
++#include <linux/rbtree.h>
++
++#define BFQ_IOPRIO_CLASSES 3
++#define BFQ_CL_IDLE_TIMEOUT (HZ/5)
++
++#define BFQ_MIN_WEIGHT 1
++#define BFQ_MAX_WEIGHT 1000
++
++#define BFQ_DEFAULT_GRP_WEIGHT 10
++#define BFQ_DEFAULT_GRP_IOPRIO 0
++#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
++
++struct bfq_entity;
++
++/**
++ * struct bfq_service_tree - per ioprio_class service tree.
++ * @active: tree for active entities (i.e., those backlogged).
++ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
++ * @first_idle: idle entity with minimum F_i.
++ * @last_idle: idle entity with maximum F_i.
++ * @vtime: scheduler virtual time.
++ * @wsum: scheduler weight sum; active and idle entities contribute to it.
++ *
++ * Each service tree represents a B-WF2Q+ scheduler on its own. Each
++ * ioprio_class has its own independent scheduler, and so its own
++ * bfq_service_tree. All the fields are protected by the queue lock
++ * of the containing bfqd.
++ */
++struct bfq_service_tree {
++ struct rb_root active;
++ struct rb_root idle;
++
++ struct bfq_entity *first_idle;
++ struct bfq_entity *last_idle;
++
++ u64 vtime;
++ unsigned long wsum;
++};
++
++/**
++ * struct bfq_sched_data - multi-class scheduler.
++ * @in_service_entity: entity in service.
++ * @next_in_service: head-of-the-line entity in the scheduler.
++ * @service_tree: array of service trees, one per ioprio_class.
++ *
++ * bfq_sched_data is the basic scheduler queue. It supports three
++ * ioprio_classes, and can be used either as a toplevel queue or as
++ * an intermediate queue on a hierarchical setup.
++ * @next_in_service points to the active entity of the sched_data
++ * service trees that will be scheduled next.
++ *
++ * The supported ioprio_classes are the same as in CFQ, in descending
++ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
++ * Requests from higher priority queues are served before all the
++ * requests from lower priority queues; among requests of the same
++ * queue requests are served according to B-WF2Q+.
++ * All the fields are protected by the queue lock of the containing bfqd.
++ */
++struct bfq_sched_data {
++ struct bfq_entity *in_service_entity;
++ struct bfq_entity *next_in_service;
++ struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
++};
++
++/**
++ * struct bfq_weight_counter - counter of the number of all active entities
++ * with a given weight.
++ * @weight: weight of the entities that this counter refers to.
++ * @num_active: number of active entities with this weight.
++ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
++ * and @group_weights_tree).
++ */
++struct bfq_weight_counter {
++ short int weight;
++ unsigned int num_active;
++ struct rb_node weights_node;
++};
++
++/**
++ * struct bfq_entity - schedulable entity.
++ * @rb_node: service_tree member.
++ * @weight_counter: pointer to the weight counter associated with this entity.
++ * @on_st: flag, true if the entity is on a tree (either the active or
++ * the idle one of its service_tree).
++ * @finish: B-WF2Q+ finish timestamp (aka F_i).
++ * @start: B-WF2Q+ start timestamp (aka S_i).
++ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
++ * @min_start: minimum start time of the (active) subtree rooted at
++ * this entity; used for O(log N) lookups into active trees.
++ * @service: service received during the last round of service.
++ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
++ * @weight: weight of the queue
++ * @parent: parent entity, for hierarchical scheduling.
++ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
++ * associated scheduler queue, %NULL on leaf nodes.
++ * @sched_data: the scheduler queue this entity belongs to.
++ * @ioprio: the ioprio in use.
++ * @new_weight: when a weight change is requested, the new weight value.
++ * @orig_weight: original weight, used to implement weight boosting
++ * @new_ioprio: when an ioprio change is requested, the new ioprio value.
++ * @ioprio_class: the ioprio_class in use.
++ * @new_ioprio_class: when an ioprio_class change is requested, the new
++ * ioprio_class value.
++ * @ioprio_changed: flag, true when the user requested a weight, ioprio or
++ * ioprio_class change.
++ *
++ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
++ * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each
++ * entity belongs to the sched_data of the parent group in the cgroup
++ * hierarchy. Non-leaf entities have also their own sched_data, stored
++ * in @my_sched_data.
++ *
++ * Each entity stores independently its priority values; this would
++ * allow different weights on different devices, but this
++ * functionality is not exported to userspace by now. Priorities and
++ * weights are updated lazily, first storing the new values into the
++ * new_* fields, then setting the @ioprio_changed flag. As soon as
++ * there is a transition in the entity state that allows the priority
++ * update to take place the effective and the requested priority
++ * values are synchronized.
++ *
++ * Unless cgroups are used, the weight value is calculated from the
++ * ioprio to export the same interface as CFQ. When dealing with
++ * ``well-behaved'' queues (i.e., queues that do not spend too much
++ * time to consume their budget and have true sequential behavior, and
++ * when there are no external factors breaking anticipation) the
++ * relative weights at each level of the cgroups hierarchy should be
++ * guaranteed. All the fields are protected by the queue lock of the
++ * containing bfqd.
++ */
++struct bfq_entity {
++ struct rb_node rb_node;
++ struct bfq_weight_counter *weight_counter;
++
++ int on_st;
++
++ u64 finish;
++ u64 start;
++
++ struct rb_root *tree;
++
++ u64 min_start;
++
++ unsigned long service, budget;
++ unsigned short weight, new_weight;
++ unsigned short orig_weight;
++
++ struct bfq_entity *parent;
++
++ struct bfq_sched_data *my_sched_data;
++ struct bfq_sched_data *sched_data;
++
++ unsigned short ioprio, new_ioprio;
++ unsigned short ioprio_class, new_ioprio_class;
++
++ int ioprio_changed;
++};
++
++struct bfq_group;
++
++/**
++ * struct bfq_queue - leaf schedulable entity.
++ * @ref: reference counter.
++ * @bfqd: parent bfq_data.
++ * @new_bfqq: shared bfq_queue if queue is cooperating with
++ * one or more other queues.
++ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree).
++ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree).
++ * @sort_list: sorted list of pending requests.
++ * @next_rq: if fifo isn't expired, next request to serve.
++ * @queued: nr of requests queued in @sort_list.
++ * @allocated: currently allocated requests.
++ * @meta_pending: pending metadata requests.
++ * @fifo: fifo list of requests in sort_list.
++ * @entity: entity representing this queue in the scheduler.
++ * @max_budget: maximum budget allowed from the feedback mechanism.
++ * @budget_timeout: budget expiration (in jiffies).
++ * @dispatched: number of requests on the dispatch list or inside driver.
++ * @flags: status flags.
++ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
++ * @seek_samples: number of seeks sampled
++ * @seek_total: sum of the distances of the seeks sampled
++ * @seek_mean: mean seek distance
++ * @last_request_pos: position of the last request enqueued
++ * @requests_within_timer: number of consecutive pairs of request completion
++ * and arrival, such that the queue becomes idle
++ * after the completion, but the next request arrives
++ * within an idle time slice; used only if the queue's
++ * IO_bound has been cleared.
++ * @pid: pid of the process owning the queue, used for logging purposes.
++ * @last_wr_start_finish: start time of the current weight-raising period if
++ * the @bfq-queue is being weight-raised, otherwise
++ * finish time of the last weight-raising period
++ * @wr_cur_max_time: current max raising time for this queue
++ * @soft_rt_next_start: minimum time instant such that, only if a new
++ * request is enqueued after this time instant in an
++ * idle @bfq_queue with no outstanding requests, then
++ * the task associated with the queue it is deemed as
++ * soft real-time (see the comments to the function
++ * bfq_bfqq_softrt_next_start()).
++ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
++ * idle to backlogged
++ * @service_from_backlogged: cumulative service received from the @bfq_queue
++ * since the last transition from idle to
++ * backlogged
++ *
++ * A bfq_queue is a leaf request queue; it can be associated with an io_context
++ * or more, if it is async or shared between cooperating processes. @cgroup
++ * holds a reference to the cgroup, to be sure that it does not disappear while
++ * a bfqq still references it (mostly to avoid races between request issuing and
++ * task migration followed by cgroup destruction).
++ * All the fields are protected by the queue lock of the containing bfqd.
++ */
++struct bfq_queue {
++ atomic_t ref;
++ struct bfq_data *bfqd;
++
++ /* fields for cooperating queues handling */
++ struct bfq_queue *new_bfqq;
++ struct rb_node pos_node;
++ struct rb_root *pos_root;
++
++ struct rb_root sort_list;
++ struct request *next_rq;
++ int queued[2];
++ int allocated[2];
++ int meta_pending;
++ struct list_head fifo;
++
++ struct bfq_entity entity;
++
++ unsigned long max_budget;
++ unsigned long budget_timeout;
++
++ int dispatched;
++
++ unsigned int flags;
++
++ struct list_head bfqq_list;
++
++ unsigned int seek_samples;
++ u64 seek_total;
++ sector_t seek_mean;
++ sector_t last_request_pos;
++
++ unsigned int requests_within_timer;
++
++ pid_t pid;
++
++ /* weight-raising fields */
++ unsigned long wr_cur_max_time;
++ unsigned long soft_rt_next_start;
++ unsigned long last_wr_start_finish;
++ unsigned int wr_coeff;
++ unsigned long last_idle_bklogged;
++ unsigned long service_from_backlogged;
++};
++
++/**
++ * struct bfq_ttime - per process thinktime stats.
++ * @ttime_total: total process thinktime
++ * @ttime_samples: number of thinktime samples
++ * @ttime_mean: average process thinktime
++ */
++struct bfq_ttime {
++ unsigned long last_end_request;
++
++ unsigned long ttime_total;
++ unsigned long ttime_samples;
++ unsigned long ttime_mean;
++};
++
++/**
++ * struct bfq_io_cq - per (request_queue, io_context) structure.
++ * @icq: associated io_cq structure
++ * @bfqq: array of two process queues, the sync and the async
++ * @ttime: associated @bfq_ttime struct
++ */
++struct bfq_io_cq {
++ struct io_cq icq; /* must be the first member */
++ struct bfq_queue *bfqq[2];
++ struct bfq_ttime ttime;
++ int ioprio;
++};
++
++enum bfq_device_speed {
++ BFQ_BFQD_FAST,
++ BFQ_BFQD_SLOW,
++};
++
++/**
++ * struct bfq_data - per device data structure.
++ * @queue: request queue for the managed device.
++ * @root_group: root bfq_group for the device.
++ * @rq_pos_tree: rbtree sorted by next_request position, used when
++ * determining if two or more queues have interleaving
++ * requests (see bfq_close_cooperator()).
++ * @active_numerous_groups: number of bfq_groups containing more than one
++ * active @bfq_entity.
++ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
++ * weight. Used to keep track of whether all @bfq_queues
++ * have the same weight. The tree contains one counter
++ * for each distinct weight associated to some active
++ * and not weight-raised @bfq_queue (see the comments to
++ * the functions bfq_weights_tree_[add|remove] for
++ * further details).
++ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
++ * by weight. Used to keep track of whether all
++ * @bfq_groups have the same weight. The tree contains
++ * one counter for each distinct weight associated to
++ * some active @bfq_group (see the comments to the
++ * functions bfq_weights_tree_[add|remove] for further
++ * details).
++ * @busy_queues: number of bfq_queues containing requests (including the
++ * queue in service, even if it is idling).
++ * @busy_in_flight_queues: number of @bfq_queues containing pending or
++ * in-flight requests, plus the @bfq_queue in
++ * service, even if idle but waiting for the
++ * possible arrival of its next sync request. This
++ * field is updated only if the device is rotational,
++ * but used only if the device is also NCQ-capable.
++ * The reason why the field is updated also for non-
++ * NCQ-capable rotational devices is related to the
++ * fact that the value of @hw_tag may be set also
++ * later than when busy_in_flight_queues may need to
++ * be incremented for the first time(s). Taking also
++ * this possibility into account, to avoid unbalanced
++ * increments/decrements, would imply more overhead
++ * than just updating busy_in_flight_queues
++ * regardless of the value of @hw_tag.
++ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
++ * (that is, seeky queues that expired
++ * for budget timeout at least once)
++ * containing pending or in-flight
++ * requests, including the in-service
++ * @bfq_queue if constantly seeky. This
++ * field is updated only if the device
++ * is rotational, but used only if the
++ * device is also NCQ-capable (see the
++ * comments to @busy_in_flight_queues).
++ * @wr_busy_queues: number of weight-raised busy @bfq_queues.
++ * @queued: number of queued requests.
++ * @rq_in_driver: number of requests dispatched and waiting for completion.
++ * @sync_flight: number of sync requests in the driver.
++ * @max_rq_in_driver: max number of reqs in driver in the last
++ * @hw_tag_samples completed requests.
++ * @hw_tag_samples: nr of samples used to calculate hw_tag.
++ * @hw_tag: flag set to one if the driver is showing a queueing behavior.
++ * @budgets_assigned: number of budgets assigned.
++ * @idle_slice_timer: timer set when idling for the next sequential request
++ * from the queue in service.
++ * @unplug_work: delayed work to restart dispatching on the request queue.
++ * @in_service_queue: bfq_queue in service.
++ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue.
++ * @last_position: on-disk position of the last served request.
++ * @last_budget_start: beginning of the last budget.
++ * @last_idling_start: beginning of the last idle slice.
++ * @peak_rate: peak transfer rate observed for a budget.
++ * @peak_rate_samples: number of samples used to calculate @peak_rate.
++ * @bfq_max_budget: maximum budget allotted to a bfq_queue before
++ * rescheduling.
++ * @group_list: list of all the bfq_groups active on the device.
++ * @active_list: list of all the bfq_queues active on the device.
++ * @idle_list: list of all the bfq_queues idle on the device.
++ * @bfq_quantum: max number of requests dispatched per dispatch round.
++ * @bfq_fifo_expire: timeout for async/sync requests; when it expires
++ * requests are served in fifo order.
++ * @bfq_back_penalty: weight of backward seeks wrt forward ones.
++ * @bfq_back_max: maximum allowed backward seek.
++ * @bfq_slice_idle: maximum idling time.
++ * @bfq_user_max_budget: user-configured max budget value
++ * (0 for auto-tuning).
++ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to
++ * async queues.
++ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to
++ * to prevent seeky queues to impose long latencies to well
++ * behaved ones (this also implies that seeky queues cannot
++ * receive guarantees in the service domain; after a timeout
++ * they are charged for the whole allocated budget, to try
++ * to preserve a behavior reasonably fair among them, but
++ * without service-domain guarantees).
++ * @bfq_coop_thresh: number of queue merges after which a @bfq_queue is
++ * no more granted any weight-raising.
++ * @bfq_failed_cooperations: number of consecutive failed cooperation
++ * chances after which weight-raising is restored
++ * to a queue subject to more than bfq_coop_thresh
++ * queue merges.
++ * @bfq_requests_within_timer: number of consecutive requests that must be
++ * issued within the idle time slice to set
++ * again idling to a queue which was marked as
++ * non-I/O-bound (see the definition of the
++ * IO_bound flag for further details).
++ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
++ * queue is multiplied
++ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
++ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
++ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
++ * may be reactivated for a queue (in jiffies)
++ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
++ * after which weight-raising may be
++ * reactivated for an already busy queue
++ * (in jiffies)
++ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
++ * sectors per seconds
++ * @RT_prod: cached value of the product R*T used for computing the maximum
++ * duration of the weight raising automatically
++ * @device_speed: device-speed class for the low-latency heuristic
++ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
++ *
++ * All the fields are protected by the @queue lock.
++ */
++struct bfq_data {
++ struct request_queue *queue;
++
++ struct bfq_group *root_group;
++ struct rb_root rq_pos_tree;
++
++#ifdef CONFIG_CGROUP_BFQIO
++ int active_numerous_groups;
++#endif
++
++ struct rb_root queue_weights_tree;
++ struct rb_root group_weights_tree;
++
++ int busy_queues;
++ int busy_in_flight_queues;
++ int const_seeky_busy_in_flight_queues;
++ int wr_busy_queues;
++ int queued;
++ int rq_in_driver;
++ int sync_flight;
++
++ int max_rq_in_driver;
++ int hw_tag_samples;
++ int hw_tag;
++
++ int budgets_assigned;
++
++ struct timer_list idle_slice_timer;
++ struct work_struct unplug_work;
++
++ struct bfq_queue *in_service_queue;
++ struct bfq_io_cq *in_service_bic;
++
++ sector_t last_position;
++
++ ktime_t last_budget_start;
++ ktime_t last_idling_start;
++ int peak_rate_samples;
++ u64 peak_rate;
++ unsigned long bfq_max_budget;
++
++ struct hlist_head group_list;
++ struct list_head active_list;
++ struct list_head idle_list;
++
++ unsigned int bfq_quantum;
++ unsigned int bfq_fifo_expire[2];
++ unsigned int bfq_back_penalty;
++ unsigned int bfq_back_max;
++ unsigned int bfq_slice_idle;
++ u64 bfq_class_idle_last_service;
++
++ unsigned int bfq_user_max_budget;
++ unsigned int bfq_max_budget_async_rq;
++ unsigned int bfq_timeout[2];
++
++ unsigned int bfq_coop_thresh;
++ unsigned int bfq_failed_cooperations;
++ unsigned int bfq_requests_within_timer;
++
++ bool low_latency;
++
++ /* parameters of the low_latency heuristics */
++ unsigned int bfq_wr_coeff;
++ unsigned int bfq_wr_max_time;
++ unsigned int bfq_wr_rt_max_time;
++ unsigned int bfq_wr_min_idle_time;
++ unsigned long bfq_wr_min_inter_arr_async;
++ unsigned int bfq_wr_max_softrt_rate;
++ u64 RT_prod;
++ enum bfq_device_speed device_speed;
++
++ struct bfq_queue oom_bfqq;
++};
++
++enum bfqq_state_flags {
++ BFQ_BFQQ_FLAG_busy = 0, /* has requests or is in service */
++ BFQ_BFQQ_FLAG_wait_request, /* waiting for a request */
++ BFQ_BFQQ_FLAG_must_alloc, /* must be allowed rq alloc */
++ BFQ_BFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */
++ BFQ_BFQQ_FLAG_idle_window, /* slice idling enabled */
++ BFQ_BFQQ_FLAG_prio_changed, /* task priority has changed */
++ BFQ_BFQQ_FLAG_sync, /* synchronous queue */
++ BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */
++ BFQ_BFQQ_FLAG_IO_bound, /*
++ * bfqq has timed-out at least once
++ * having consumed at most 2/10 of
++ * its budget
++ */
++ BFQ_BFQQ_FLAG_constantly_seeky, /*
++ * bfqq has proved to be slow and
++ * seeky until budget timeout
++ */
++ BFQ_BFQQ_FLAG_softrt_update, /*
++ * may need softrt-next-start
++ * update
++ */
++ BFQ_BFQQ_FLAG_coop, /* bfqq is shared */
++ BFQ_BFQQ_FLAG_split_coop, /* shared bfqq will be splitted */
++};
++
++#define BFQ_BFQQ_FNS(name) \
++static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \
++{ \
++ (bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name); \
++} \
++static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \
++{ \
++ (bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name); \
++} \
++static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq) \
++{ \
++ return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0; \
++}
++
++BFQ_BFQQ_FNS(busy);
++BFQ_BFQQ_FNS(wait_request);
++BFQ_BFQQ_FNS(must_alloc);
++BFQ_BFQQ_FNS(fifo_expire);
++BFQ_BFQQ_FNS(idle_window);
++BFQ_BFQQ_FNS(prio_changed);
++BFQ_BFQQ_FNS(sync);
++BFQ_BFQQ_FNS(budget_new);
++BFQ_BFQQ_FNS(IO_bound);
++BFQ_BFQQ_FNS(constantly_seeky);
++BFQ_BFQQ_FNS(coop);
++BFQ_BFQQ_FNS(split_coop);
++BFQ_BFQQ_FNS(softrt_update);
++#undef BFQ_BFQQ_FNS
++
++/* Logging facilities. */
++#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
++ blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
++
++#define bfq_log(bfqd, fmt, args...) \
++ blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
++
++/* Expiration reasons. */
++enum bfqq_expiration {
++ BFQ_BFQQ_TOO_IDLE = 0, /*
++ * queue has been idling for
++ * too long
++ */
++ BFQ_BFQQ_BUDGET_TIMEOUT, /* budget took too long to be used */
++ BFQ_BFQQ_BUDGET_EXHAUSTED, /* budget consumed */
++ BFQ_BFQQ_NO_MORE_REQUESTS, /* the queue has no more requests */
++};
++
++#ifdef CONFIG_CGROUP_BFQIO
++/**
++ * struct bfq_group - per (device, cgroup) data structure.
++ * @entity: schedulable entity to insert into the parent group sched_data.
++ * @sched_data: own sched_data, to contain child entities (they may be
++ * both bfq_queues and bfq_groups).
++ * @group_node: node to be inserted into the bfqio_cgroup->group_data
++ * list of the containing cgroup's bfqio_cgroup.
++ * @bfqd_node: node to be inserted into the @bfqd->group_list list
++ * of the groups active on the same device; used for cleanup.
++ * @bfqd: the bfq_data for the device this group acts upon.
++ * @async_bfqq: array of async queues for all the tasks belonging to
++ * the group, one queue per ioprio value per ioprio_class,
++ * except for the idle class that has only one queue.
++ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
++ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
++ * to avoid too many special cases during group creation/
++ * migration.
++ * @active_entities: number of active entities belonging to the group;
++ * unused for the root group. Used to know whether there
++ * are groups with more than one active @bfq_entity
++ * (see the comments to the function
++ * bfq_bfqq_must_not_expire()).
++ *
++ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
++ * there is a set of bfq_groups, each one collecting the lower-level
++ * entities belonging to the group that are acting on the same device.
++ *
++ * Locking works as follows:
++ * o @group_node is protected by the bfqio_cgroup lock, and is accessed
++ * via RCU from its readers.
++ * o @bfqd is protected by the queue lock, RCU is used to access it
++ * from the readers.
++ * o All the other fields are protected by the @bfqd queue lock.
++ */
++struct bfq_group {
++ struct bfq_entity entity;
++ struct bfq_sched_data sched_data;
++
++ struct hlist_node group_node;
++ struct hlist_node bfqd_node;
++
++ void *bfqd;
++
++ struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
++ struct bfq_queue *async_idle_bfqq;
++
++ struct bfq_entity *my_entity;
++
++ int active_entities;
++};
++
++/**
++ * struct bfqio_cgroup - bfq cgroup data structure.
++ * @css: subsystem state for bfq in the containing cgroup.
++ * @online: flag marked when the subsystem is inserted.
++ * @weight: cgroup weight.
++ * @ioprio: cgroup ioprio.
++ * @ioprio_class: cgroup ioprio_class.
++ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
++ * @group_data: list containing the bfq_group belonging to this cgroup.
++ *
++ * @group_data is accessed using RCU, with @lock protecting the updates,
++ * @ioprio and @ioprio_class are protected by @lock.
++ */
++struct bfqio_cgroup {
++ struct cgroup_subsys_state css;
++ bool online;
++
++ unsigned short weight, ioprio, ioprio_class;
++
++ spinlock_t lock;
++ struct hlist_head group_data;
++};
++#else
++struct bfq_group {
++ struct bfq_sched_data sched_data;
++
++ struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
++ struct bfq_queue *async_idle_bfqq;
++};
++#endif
++
++static inline struct bfq_service_tree *
++bfq_entity_service_tree(struct bfq_entity *entity)
++{
++ struct bfq_sched_data *sched_data = entity->sched_data;
++ unsigned int idx = entity->ioprio_class - 1;
++
++ BUG_ON(idx >= BFQ_IOPRIO_CLASSES);
++ BUG_ON(sched_data == NULL);
++
++ return sched_data->service_tree + idx;
++}
++
++static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
++ int is_sync)
++{
++ return bic->bfqq[!!is_sync];
++}
++
++static inline void bic_set_bfqq(struct bfq_io_cq *bic,
++ struct bfq_queue *bfqq, int is_sync)
++{
++ bic->bfqq[!!is_sync] = bfqq;
++}
++
++static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
++{
++ return bic->icq.q->elevator->elevator_data;
++}
++
++/**
++ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer.
++ * @ptr: a pointer to a bfqd.
++ * @flags: storage for the flags to be saved.
++ *
++ * This function allows bfqg->bfqd to be protected by the
++ * queue lock of the bfqd they reference; the pointer is dereferenced
++ * under RCU, so the storage for bfqd is assured to be safe as long
++ * as the RCU read side critical section does not end. After the
++ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be
++ * sure that no other writer accessed it. If we raced with a writer,
++ * the function returns NULL, with the queue unlocked, otherwise it
++ * returns the dereferenced pointer, with the queue locked.
++ */
++static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr,
++ unsigned long *flags)
++{
++ struct bfq_data *bfqd;
++
++ rcu_read_lock();
++ bfqd = rcu_dereference(*(struct bfq_data **)ptr);
++
++ if (bfqd != NULL) {
++ spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
++ if (*ptr == bfqd)
++ goto out;
++ spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
++ }
++
++ bfqd = NULL;
++out:
++ rcu_read_unlock();
++ return bfqd;
++}
++
++static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
++ unsigned long *flags)
++{
++ spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
++}
++
++static void bfq_changed_ioprio(struct bfq_io_cq *bic);
++static void bfq_put_queue(struct bfq_queue *bfqq);
++static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
++static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
++ struct bfq_group *bfqg, int is_sync,
++ struct bfq_io_cq *bic, gfp_t gfp_mask);
++static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
++ struct bfq_group *bfqg);
++static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
++static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
++
++#endif /* _BFQ_H */
+--
+2.0.3
+
diff --git a/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
new file mode 100644
index 0000000..e606f5d
--- /dev/null
+++ b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
@@ -0,0 +1,1188 @@
+From 5b290be286aa74051b4b77a216032b771ceadd23 Mon Sep 17 00:00:00 2001
+From: Mauro Andreolini <mauro.andreolini@unimore.it>
+Date: Wed, 18 Jun 2014 17:38:07 +0200
+Subject: [PATCH 3/3] block, bfq: add Early Queue Merge (EQM) to BFQ-v7r5 for
+ 3.16.0
+
+A set of processes may happen to perform interleaved reads, i.e.,requests
+whose union would give rise to a sequential read pattern. There are two
+typical cases: in the first case, processes read fixed-size chunks of
+data at a fixed distance from each other, while in the second case processes
+may read variable-size chunks at variable distances. The latter case occurs
+for example with QEMU, which splits the I/O generated by the guest into
+multiple chunks, and lets these chunks be served by a pool of cooperating
+processes, iteratively assigning the next chunk of I/O to the first
+available process. CFQ uses actual queue merging for the first type of
+rocesses, whereas it uses preemption to get a sequential read pattern out
+of the read requests performed by the second type of processes. In the end
+it uses two different mechanisms to achieve the same goal: boosting the
+throughput with interleaved I/O.
+
+This patch introduces Early Queue Merge (EQM), a unified mechanism to get a
+sequential read pattern with both types of processes. The main idea is
+checking newly arrived requests against the next request of the active queue
+both in case of actual request insert and in case of request merge. By doing
+so, both the types of processes can be handled by just merging their queues.
+EQM is then simpler and more compact than the pair of mechanisms used in
+CFQ.
+
+Finally, EQM also preserves the typical low-latency properties of BFQ, by
+properly restoring the weight-raising state of a queue when it gets back to
+a non-merged state.
+
+Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+---
+ block/bfq-iosched.c | 736 ++++++++++++++++++++++++++++++++++++----------------
+ block/bfq-sched.c | 28 --
+ block/bfq.h | 46 +++-
+ 3 files changed, 556 insertions(+), 254 deletions(-)
+
+diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
+index 0a0891b..d1d8e67 100644
+--- a/block/bfq-iosched.c
++++ b/block/bfq-iosched.c
+@@ -571,6 +571,57 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+ return dur;
+ }
+
++static inline unsigned
++bfq_bfqq_cooperations(struct bfq_queue *bfqq)
++{
++ return bfqq->bic ? bfqq->bic->cooperations : 0;
++}
++
++static inline void
++bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
++{
++ if (bic->saved_idle_window)
++ bfq_mark_bfqq_idle_window(bfqq);
++ else
++ bfq_clear_bfqq_idle_window(bfqq);
++ if (bic->saved_IO_bound)
++ bfq_mark_bfqq_IO_bound(bfqq);
++ else
++ bfq_clear_bfqq_IO_bound(bfqq);
++ if (bic->wr_time_left && bfqq->bfqd->low_latency &&
++ bic->cooperations < bfqq->bfqd->bfq_coop_thresh) {
++ /*
++ * Start a weight raising period with the duration given by
++ * the raising_time_left snapshot.
++ */
++ if (bfq_bfqq_busy(bfqq))
++ bfqq->bfqd->wr_busy_queues++;
++ bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff;
++ bfqq->wr_cur_max_time = bic->wr_time_left;
++ bfqq->last_wr_start_finish = jiffies;
++ bfqq->entity.ioprio_changed = 1;
++ }
++ /*
++ * Clear wr_time_left to prevent bfq_bfqq_save_state() from
++ * getting confused about the queue's need of a weight-raising
++ * period.
++ */
++ bic->wr_time_left = 0;
++}
++
++/*
++ * Must be called with the queue_lock held.
++ */
++static int bfqq_process_refs(struct bfq_queue *bfqq)
++{
++ int process_refs, io_refs;
++
++ io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
++ process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
++ BUG_ON(process_refs < 0);
++ return process_refs;
++}
++
+ static void bfq_add_request(struct request *rq)
+ {
+ struct bfq_queue *bfqq = RQ_BFQQ(rq);
+@@ -602,8 +653,11 @@ static void bfq_add_request(struct request *rq)
+
+ if (!bfq_bfqq_busy(bfqq)) {
+ int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++ bfq_bfqq_cooperations(bfqq) < bfqd->bfq_coop_thresh &&
+ time_is_before_jiffies(bfqq->soft_rt_next_start);
+- idle_for_long_time = time_is_before_jiffies(
++ idle_for_long_time = bfq_bfqq_cooperations(bfqq) <
++ bfqd->bfq_coop_thresh &&
++ time_is_before_jiffies(
+ bfqq->budget_timeout +
+ bfqd->bfq_wr_min_idle_time);
+ entity->budget = max_t(unsigned long, bfqq->max_budget,
+@@ -624,11 +678,20 @@ static void bfq_add_request(struct request *rq)
+ if (!bfqd->low_latency)
+ goto add_bfqq_busy;
+
++ if (bfq_bfqq_just_split(bfqq))
++ goto set_ioprio_changed;
++
+ /*
+- * If the queue is not being boosted and has been idle
+- * for enough time, start a weight-raising period
++ * If the queue:
++ * - is not being boosted,
++ * - has been idle for enough time,
++ * - is not a sync queue or is linked to a bfq_io_cq (it is
++ * shared "for its nature" or it is not shared and its
++ * requests have not been redirected to a shared queue)
++ * start a weight-raising period.
+ */
+- if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++ if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
++ (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ if (idle_for_long_time)
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+@@ -642,9 +705,11 @@ static void bfq_add_request(struct request *rq)
+ } else if (old_wr_coeff > 1) {
+ if (idle_for_long_time)
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+- else if (bfqq->wr_cur_max_time ==
+- bfqd->bfq_wr_rt_max_time &&
+- !soft_rt) {
++ else if (bfq_bfqq_cooperations(bfqq) >=
++ bfqd->bfq_coop_thresh ||
++ (bfqq->wr_cur_max_time ==
++ bfqd->bfq_wr_rt_max_time &&
++ !soft_rt)) {
+ bfqq->wr_coeff = 1;
+ bfq_log_bfqq(bfqd, bfqq,
+ "wrais ending at %lu, rais_max_time %u",
+@@ -660,18 +725,18 @@ static void bfq_add_request(struct request *rq)
+ /*
+ *
+ * The remaining weight-raising time is lower
+- * than bfqd->bfq_wr_rt_max_time, which
+- * means that the application is enjoying
+- * weight raising either because deemed soft-
+- * rt in the near past, or because deemed
+- * interactive a long ago. In both cases,
+- * resetting now the current remaining weight-
+- * raising time for the application to the
+- * weight-raising duration for soft rt
+- * applications would not cause any latency
+- * increase for the application (as the new
+- * duration would be higher than the remaining
+- * time).
++ * than bfqd->bfq_wr_rt_max_time, which means
++ * that the application is enjoying weight
++ * raising either because deemed soft-rt in
++ * the near past, or because deemed interactive
++ * a long ago.
++ * In both cases, resetting now the current
++ * remaining weight-raising time for the
++ * application to the weight-raising duration
++ * for soft rt applications would not cause any
++ * latency increase for the application (as the
++ * new duration would be higher than the
++ * remaining time).
+ *
+ * In addition, the application is now meeting
+ * the requirements for being deemed soft rt.
+@@ -706,6 +771,7 @@ static void bfq_add_request(struct request *rq)
+ bfqd->bfq_wr_rt_max_time;
+ }
+ }
++set_ioprio_changed:
+ if (old_wr_coeff != bfqq->wr_coeff)
+ entity->ioprio_changed = 1;
+ add_bfqq_busy:
+@@ -918,90 +984,35 @@ static void bfq_end_wr(struct bfq_data *bfqd)
+ spin_unlock_irq(bfqd->queue->queue_lock);
+ }
+
+-static int bfq_allow_merge(struct request_queue *q, struct request *rq,
+- struct bio *bio)
++static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
+ {
+- struct bfq_data *bfqd = q->elevator->elevator_data;
+- struct bfq_io_cq *bic;
+- struct bfq_queue *bfqq;
+-
+- /*
+- * Disallow merge of a sync bio into an async request.
+- */
+- if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+- return 0;
+-
+- /*
+- * Lookup the bfqq that this bio will be queued with. Allow
+- * merge only if rq is queued there.
+- * Queue lock is held here.
+- */
+- bic = bfq_bic_lookup(bfqd, current->io_context);
+- if (bic == NULL)
+- return 0;
+-
+- bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+- return bfqq == RQ_BFQQ(rq);
+-}
+-
+-static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+- struct bfq_queue *bfqq)
+-{
+- if (bfqq != NULL) {
+- bfq_mark_bfqq_must_alloc(bfqq);
+- bfq_mark_bfqq_budget_new(bfqq);
+- bfq_clear_bfqq_fifo_expire(bfqq);
+-
+- bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+-
+- bfq_log_bfqq(bfqd, bfqq,
+- "set_in_service_queue, cur-budget = %lu",
+- bfqq->entity.budget);
+- }
+-
+- bfqd->in_service_queue = bfqq;
+-}
+-
+-/*
+- * Get and set a new queue for service.
+- */
+-static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd,
+- struct bfq_queue *bfqq)
+-{
+- if (!bfqq)
+- bfqq = bfq_get_next_queue(bfqd);
++ if (request)
++ return blk_rq_pos(io_struct);
+ else
+- bfq_get_next_queue_forced(bfqd, bfqq);
+-
+- __bfq_set_in_service_queue(bfqd, bfqq);
+- return bfqq;
++ return ((struct bio *)io_struct)->bi_iter.bi_sector;
+ }
+
+-static inline sector_t bfq_dist_from_last(struct bfq_data *bfqd,
+- struct request *rq)
++static inline sector_t bfq_dist_from(sector_t pos1,
++ sector_t pos2)
+ {
+- if (blk_rq_pos(rq) >= bfqd->last_position)
+- return blk_rq_pos(rq) - bfqd->last_position;
++ if (pos1 >= pos2)
++ return pos1 - pos2;
+ else
+- return bfqd->last_position - blk_rq_pos(rq);
++ return pos2 - pos1;
+ }
+
+-/*
+- * Return true if bfqq has no request pending and rq is close enough to
+- * bfqd->last_position, or if rq is closer to bfqd->last_position than
+- * bfqq->next_rq
+- */
+-static inline int bfq_rq_close(struct bfq_data *bfqd, struct request *rq)
++static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
++ sector_t sector)
+ {
+- return bfq_dist_from_last(bfqd, rq) <= BFQQ_SEEK_THR;
++ return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
++ BFQQ_SEEK_THR;
+ }
+
+-static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
++static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
+ {
+ struct rb_root *root = &bfqd->rq_pos_tree;
+ struct rb_node *parent, *node;
+ struct bfq_queue *__bfqq;
+- sector_t sector = bfqd->last_position;
+
+ if (RB_EMPTY_ROOT(root))
+ return NULL;
+@@ -1020,7 +1031,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ * next_request position).
+ */
+ __bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+- if (bfq_rq_close(bfqd, __bfqq->next_rq))
++ if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ return __bfqq;
+
+ if (blk_rq_pos(__bfqq->next_rq) < sector)
+@@ -1031,7 +1042,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ return NULL;
+
+ __bfqq = rb_entry(node, struct bfq_queue, pos_node);
+- if (bfq_rq_close(bfqd, __bfqq->next_rq))
++ if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ return __bfqq;
+
+ return NULL;
+@@ -1040,14 +1051,12 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ /*
+ * bfqd - obvious
+ * cur_bfqq - passed in so that we don't decide that the current queue
+- * is closely cooperating with itself.
+- *
+- * We are assuming that cur_bfqq has dispatched at least one request,
+- * and that bfqd->last_position reflects a position on the disk associated
+- * with the I/O issued by cur_bfqq.
++ * is closely cooperating with itself
++ * sector - used as a reference point to search for a close queue
+ */
+ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+- struct bfq_queue *cur_bfqq)
++ struct bfq_queue *cur_bfqq,
++ sector_t sector)
+ {
+ struct bfq_queue *bfqq;
+
+@@ -1067,7 +1076,7 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+ * working closely on the same area of the disk. In that case,
+ * we can group them together and don't waste time idling.
+ */
+- bfqq = bfqq_close(bfqd);
++ bfqq = bfqq_close(bfqd, sector);
+ if (bfqq == NULL || bfqq == cur_bfqq)
+ return NULL;
+
+@@ -1094,6 +1103,305 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+ return bfqq;
+ }
+
++static struct bfq_queue *
++bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++ int process_refs, new_process_refs;
++ struct bfq_queue *__bfqq;
++
++ /*
++ * If there are no process references on the new_bfqq, then it is
++ * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
++ * may have dropped their last reference (not just their last process
++ * reference).
++ */
++ if (!bfqq_process_refs(new_bfqq))
++ return NULL;
++
++ /* Avoid a circular list and skip interim queue merges. */
++ while ((__bfqq = new_bfqq->new_bfqq)) {
++ if (__bfqq == bfqq)
++ return NULL;
++ new_bfqq = __bfqq;
++ }
++
++ process_refs = bfqq_process_refs(bfqq);
++ new_process_refs = bfqq_process_refs(new_bfqq);
++ /*
++ * If the process for the bfqq has gone away, there is no
++ * sense in merging the queues.
++ */
++ if (process_refs == 0 || new_process_refs == 0)
++ return NULL;
++
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
++ new_bfqq->pid);
++
++ /*
++ * Merging is just a redirection: the requests of the process
++ * owning one of the two queues are redirected to the other queue.
++ * The latter queue, in its turn, is set as shared if this is the
++ * first time that the requests of some process are redirected to
++ * it.
++ *
++ * We redirect bfqq to new_bfqq and not the opposite, because we
++ * are in the context of the process owning bfqq, hence we have
++ * the io_cq of this process. So we can immediately configure this
++ * io_cq to redirect the requests of the process to new_bfqq.
++ *
++ * NOTE, even if new_bfqq coincides with the in-service queue, the
++ * io_cq of new_bfqq is not available, because, if the in-service
++ * queue is shared, bfqd->in_service_bic may not point to the
++ * io_cq of the in-service queue.
++ * Redirecting the requests of the process owning bfqq to the
++ * currently in-service queue is in any case the best option, as
++ * we feed the in-service queue with new requests close to the
++ * last request served and, by doing so, hopefully increase the
++ * throughput.
++ */
++ bfqq->new_bfqq = new_bfqq;
++ atomic_add(process_refs, &new_bfqq->ref);
++ return new_bfqq;
++}
++
++/*
++ * Attempt to schedule a merge of bfqq with the currently in-service queue
++ * or with a close queue among the scheduled queues.
++ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
++ * structure otherwise.
++ */
++static struct bfq_queue *
++bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ void *io_struct, bool request)
++{
++ struct bfq_queue *in_service_bfqq, *new_bfqq;
++
++ if (bfqq->new_bfqq)
++ return bfqq->new_bfqq;
++
++ if (!io_struct)
++ return NULL;
++
++ in_service_bfqq = bfqd->in_service_queue;
++
++ if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
++ !bfqd->in_service_bic)
++ goto check_scheduled;
++
++ if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
++ goto check_scheduled;
++
++ if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
++ goto check_scheduled;
++
++ if (in_service_bfqq->entity.parent != bfqq->entity.parent)
++ goto check_scheduled;
++
++ if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
++ bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
++ new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
++ if (new_bfqq != NULL)
++ return new_bfqq; /* Merge with in-service queue */
++ }
++
++ /*
++ * Check whether there is a cooperator among currently scheduled
++ * queues. The only thing we need is that the bio/request is not
++ * NULL, as we need it to establish whether a cooperator exists.
++ */
++check_scheduled:
++ new_bfqq = bfq_close_cooperator(bfqd, bfqq,
++ bfq_io_struct_pos(io_struct, request));
++ if (new_bfqq)
++ return bfq_setup_merge(bfqq, new_bfqq);
++
++ return NULL;
++}
++
++static inline void
++bfq_bfqq_save_state(struct bfq_queue *bfqq)
++{
++ /*
++ * If bfqq->bic == NULL, the queue is already shared or its requests
++ * have already been redirected to a shared queue; both idle window
++ * and weight raising state have already been saved. Do nothing.
++ */
++ if (bfqq->bic == NULL)
++ return;
++ if (bfqq->bic->wr_time_left)
++ /*
++ * This is the queue of a just-started process, and would
++ * deserve weight raising: we set wr_time_left to the full
++ * weight-raising duration to trigger weight-raising when
++ * and if the queue is split and the first request of the
++ * queue is enqueued.
++ */
++ bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd);
++ else if (bfqq->wr_coeff > 1) {
++ unsigned long wr_duration =
++ jiffies - bfqq->last_wr_start_finish;
++ /*
++ * It may happen that a queue's weight raising period lasts
++ * longer than its wr_cur_max_time, as weight raising is
++ * handled only when a request is enqueued or dispatched (it
++ * does not use any timer). If the weight raising period is
++ * about to end, don't save it.
++ */
++ if (bfqq->wr_cur_max_time <= wr_duration)
++ bfqq->bic->wr_time_left = 0;
++ else
++ bfqq->bic->wr_time_left =
++ bfqq->wr_cur_max_time - wr_duration;
++ /*
++ * The bfq_queue is becoming shared or the requests of the
++ * process owning the queue are being redirected to a shared
++ * queue. Stop the weight raising period of the queue, as in
++ * both cases it should not be owned by an interactive or
++ * soft real-time application.
++ */
++ bfq_bfqq_end_wr(bfqq);
++ } else
++ bfqq->bic->wr_time_left = 0;
++ bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
++ bfqq->bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
++ bfqq->bic->cooperations++;
++ bfqq->bic->failed_cooperations = 0;
++}
++
++static inline void
++bfq_get_bic_reference(struct bfq_queue *bfqq)
++{
++ /*
++ * If bfqq->bic has a non-NULL value, the bic to which it belongs
++ * is about to begin using a shared bfq_queue.
++ */
++ if (bfqq->bic)
++ atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
++}
++
++static void
++bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
++ struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++ bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
++ (long unsigned)new_bfqq->pid);
++ /* Save weight raising and idle window of the merged queues */
++ bfq_bfqq_save_state(bfqq);
++ bfq_bfqq_save_state(new_bfqq);
++ if (bfq_bfqq_IO_bound(bfqq))
++ bfq_mark_bfqq_IO_bound(new_bfqq);
++ bfq_clear_bfqq_IO_bound(bfqq);
++ /*
++ * Grab a reference to the bic, to prevent it from being destroyed
++ * before being possibly touched by a bfq_split_bfqq().
++ */
++ bfq_get_bic_reference(bfqq);
++ bfq_get_bic_reference(new_bfqq);
++ /*
++ * Merge queues (that is, let bic redirect its requests to new_bfqq)
++ */
++ bic_set_bfqq(bic, new_bfqq, 1);
++ bfq_mark_bfqq_coop(new_bfqq);
++ /*
++ * new_bfqq now belongs to at least two bics (it is a shared queue):
++ * set new_bfqq->bic to NULL. bfqq either:
++ * - does not belong to any bic any more, and hence bfqq->bic must
++ * be set to NULL, or
++ * - is a queue whose owning bics have already been redirected to a
++ * different queue, hence the queue is destined to not belong to
++ * any bic soon and bfqq->bic is already NULL (therefore the next
++ * assignment causes no harm).
++ */
++ new_bfqq->bic = NULL;
++ bfqq->bic = NULL;
++ bfq_put_queue(bfqq);
++}
++
++static inline void bfq_bfqq_increase_failed_cooperations(struct bfq_queue *bfqq)
++{
++ struct bfq_io_cq *bic = bfqq->bic;
++ struct bfq_data *bfqd = bfqq->bfqd;
++
++ if (bic && bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh) {
++ bic->failed_cooperations++;
++ if (bic->failed_cooperations >= bfqd->bfq_failed_cooperations)
++ bic->cooperations = 0;
++ }
++}
++
++static int bfq_allow_merge(struct request_queue *q, struct request *rq,
++ struct bio *bio)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_io_cq *bic;
++ struct bfq_queue *bfqq, *new_bfqq;
++
++ /*
++ * Disallow merge of a sync bio into an async request.
++ */
++ if (bfq_bio_sync(bio) && !rq_is_sync(rq))
++ return 0;
++
++ /*
++ * Lookup the bfqq that this bio will be queued with. Allow
++ * merge only if rq is queued there.
++ * Queue lock is held here.
++ */
++ bic = bfq_bic_lookup(bfqd, current->io_context);
++ if (bic == NULL)
++ return 0;
++
++ bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++ /*
++ * We take advantage of this function to perform an early merge
++ * of the queues of possible cooperating processes.
++ */
++ if (bfqq != NULL) {
++ new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
++ if (new_bfqq != NULL) {
++ bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq);
++ /*
++ * If we get here, the bio will be queued in the
++ * shared queue, i.e., new_bfqq, so use new_bfqq
++ * to decide whether bio and rq can be merged.
++ */
++ bfqq = new_bfqq;
++ } else
++ bfq_bfqq_increase_failed_cooperations(bfqq);
++ }
++
++ return bfqq == RQ_BFQQ(rq);
++}
++
++static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ if (bfqq != NULL) {
++ bfq_mark_bfqq_must_alloc(bfqq);
++ bfq_mark_bfqq_budget_new(bfqq);
++ bfq_clear_bfqq_fifo_expire(bfqq);
++
++ bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "set_in_service_queue, cur-budget = %lu",
++ bfqq->entity.budget);
++ }
++
++ bfqd->in_service_queue = bfqq;
++}
++
++/*
++ * Get and set a new queue for service.
++ */
++static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
++
++ __bfq_set_in_service_queue(bfqd, bfqq);
++ return bfqq;
++}
++
+ /*
+ * If enough samples have been computed, return the current max budget
+ * stored in bfqd, which is dynamically updated according to the
+@@ -1237,63 +1545,6 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+ return rq;
+ }
+
+-/*
+- * Must be called with the queue_lock held.
+- */
+-static int bfqq_process_refs(struct bfq_queue *bfqq)
+-{
+- int process_refs, io_refs;
+-
+- io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
+- process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
+- BUG_ON(process_refs < 0);
+- return process_refs;
+-}
+-
+-static void bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+-{
+- int process_refs, new_process_refs;
+- struct bfq_queue *__bfqq;
+-
+- /*
+- * If there are no process references on the new_bfqq, then it is
+- * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+- * may have dropped their last reference (not just their last process
+- * reference).
+- */
+- if (!bfqq_process_refs(new_bfqq))
+- return;
+-
+- /* Avoid a circular list and skip interim queue merges. */
+- while ((__bfqq = new_bfqq->new_bfqq)) {
+- if (__bfqq == bfqq)
+- return;
+- new_bfqq = __bfqq;
+- }
+-
+- process_refs = bfqq_process_refs(bfqq);
+- new_process_refs = bfqq_process_refs(new_bfqq);
+- /*
+- * If the process for the bfqq has gone away, there is no
+- * sense in merging the queues.
+- */
+- if (process_refs == 0 || new_process_refs == 0)
+- return;
+-
+- /*
+- * Merge in the direction of the lesser amount of work.
+- */
+- if (new_process_refs >= process_refs) {
+- bfqq->new_bfqq = new_bfqq;
+- atomic_add(process_refs, &new_bfqq->ref);
+- } else {
+- new_bfqq->new_bfqq = bfqq;
+- atomic_add(new_process_refs, &bfqq->ref);
+- }
+- bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+- new_bfqq->pid);
+-}
+-
+ static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+ {
+ struct bfq_entity *entity = &bfqq->entity;
+@@ -2011,7 +2262,7 @@ static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+ */
+ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ {
+- struct bfq_queue *bfqq, *new_bfqq = NULL;
++ struct bfq_queue *bfqq;
+ struct request *next_rq;
+ enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+
+@@ -2021,17 +2272,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+
+ bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+
+- /*
+- * If another queue has a request waiting within our mean seek
+- * distance, let it run. The expire code will check for close
+- * cooperators and put the close queue at the front of the
+- * service tree. If possible, merge the expiring queue with the
+- * new bfqq.
+- */
+- new_bfqq = bfq_close_cooperator(bfqd, bfqq);
+- if (new_bfqq != NULL && bfqq->new_bfqq == NULL)
+- bfq_setup_merge(bfqq, new_bfqq);
+-
+ if (bfq_may_expire_for_budg_timeout(bfqq) &&
+ !timer_pending(&bfqd->idle_slice_timer) &&
+ !bfq_bfqq_must_idle(bfqq))
+@@ -2070,10 +2310,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ bfq_clear_bfqq_wait_request(bfqq);
+ del_timer(&bfqd->idle_slice_timer);
+ }
+- if (new_bfqq == NULL)
+- goto keep_queue;
+- else
+- goto expire;
++ goto keep_queue;
+ }
+ }
+
+@@ -2082,40 +2319,30 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ * in flight (possibly waiting for a completion) or is idling for a
+ * new request, then keep it.
+ */
+- if (new_bfqq == NULL && (timer_pending(&bfqd->idle_slice_timer) ||
+- (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq)))) {
++ if (timer_pending(&bfqd->idle_slice_timer) ||
++ (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) {
+ bfqq = NULL;
+ goto keep_queue;
+- } else if (new_bfqq != NULL && timer_pending(&bfqd->idle_slice_timer)) {
+- /*
+- * Expiring the queue because there is a close cooperator,
+- * cancel timer.
+- */
+- bfq_clear_bfqq_wait_request(bfqq);
+- del_timer(&bfqd->idle_slice_timer);
+ }
+
+ reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+ expire:
+ bfq_bfqq_expire(bfqd, bfqq, 0, reason);
+ new_queue:
+- bfqq = bfq_set_in_service_queue(bfqd, new_bfqq);
++ bfqq = bfq_set_in_service_queue(bfqd);
+ bfq_log(bfqd, "select_queue: new queue %d returned",
+ bfqq != NULL ? bfqq->pid : 0);
+ keep_queue:
+ return bfqq;
+ }
+
+-static void bfq_update_wr_data(struct bfq_data *bfqd,
+- struct bfq_queue *bfqq)
++static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+ {
+- if (bfqq->wr_coeff > 1) { /* queue is being boosted */
+- struct bfq_entity *entity = &bfqq->entity;
+-
++ struct bfq_entity *entity = &bfqq->entity;
++ if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+ bfq_log_bfqq(bfqd, bfqq,
+ "raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+- jiffies_to_msecs(jiffies -
+- bfqq->last_wr_start_finish),
++ jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+ jiffies_to_msecs(bfqq->wr_cur_max_time),
+ bfqq->wr_coeff,
+ bfqq->entity.weight, bfqq->entity.orig_weight);
+@@ -2124,11 +2351,15 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+ entity->orig_weight * bfqq->wr_coeff);
+ if (entity->ioprio_changed)
+ bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
++
+ /*
+ * If too much time has elapsed from the beginning
+- * of this weight-raising, stop it.
++ * of this weight-raising period, or the queue has
++ * exceeded the acceptable number of cooperations,
++ * stop it.
+ */
+- if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++ if (bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh ||
++ time_is_before_jiffies(bfqq->last_wr_start_finish +
+ bfqq->wr_cur_max_time)) {
+ bfqq->last_wr_start_finish = jiffies;
+ bfq_log_bfqq(bfqd, bfqq,
+@@ -2136,11 +2367,13 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+ bfqq->last_wr_start_finish,
+ jiffies_to_msecs(bfqq->wr_cur_max_time));
+ bfq_bfqq_end_wr(bfqq);
+- __bfq_entity_update_weight_prio(
+- bfq_entity_service_tree(entity),
+- entity);
+ }
+ }
++ /* Update weight both if it must be raised and if it must be lowered */
++ if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
++ __bfq_entity_update_weight_prio(
++ bfq_entity_service_tree(entity),
++ entity);
+ }
+
+ /*
+@@ -2377,6 +2610,25 @@ static inline void bfq_init_icq(struct io_cq *icq)
+ struct bfq_io_cq *bic = icq_to_bic(icq);
+
+ bic->ttime.last_end_request = jiffies;
++ /*
++ * A newly created bic indicates that the process has just
++ * started doing I/O, and is probably mapping into memory its
++ * executable and libraries: it definitely needs weight raising.
++ * There is however the possibility that the process performs,
++ * for a while, I/O close to some other process. EQM intercepts
++ * this behavior and may merge the queue corresponding to the
++ * process with some other queue, BEFORE the weight of the queue
++ * is raised. Merged queues are not weight-raised (they are assumed
++ * to belong to processes that benefit only from high throughput).
++ * If the merge is basically the consequence of an accident, then
++ * the queue will be split soon and will get back its old weight.
++ * It is then important to write down somewhere that this queue
++ * does need weight raising, even if it did not make it to get its
++ * weight raised before being merged. To this purpose, we overload
++ * the field raising_time_left and assign 1 to it, to mark the queue
++ * as needing weight raising.
++ */
++ bic->wr_time_left = 1;
+ }
+
+ static void bfq_exit_icq(struct io_cq *icq)
+@@ -2390,6 +2642,13 @@ static void bfq_exit_icq(struct io_cq *icq)
+ }
+
+ if (bic->bfqq[BLK_RW_SYNC]) {
++ /*
++ * If the bic is using a shared queue, put the reference
++ * taken on the io_context when the bic started using a
++ * shared bfq_queue.
++ */
++ if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC]))
++ put_io_context(icq->ioc);
+ bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+ bic->bfqq[BLK_RW_SYNC] = NULL;
+ }
+@@ -2678,6 +2937,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
+ if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+ return;
+
++ /* Idle window just restored, statistics are meaningless. */
++ if (bfq_bfqq_just_split(bfqq))
++ return;
++
+ enable_idle = bfq_bfqq_idle_window(bfqq);
+
+ if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+@@ -2725,6 +2988,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+ !BFQQ_SEEKY(bfqq))
+ bfq_update_idle_window(bfqd, bfqq, bic);
++ bfq_clear_bfqq_just_split(bfqq);
+
+ bfq_log_bfqq(bfqd, bfqq,
+ "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
+@@ -2785,13 +3049,49 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ static void bfq_insert_request(struct request_queue *q, struct request *rq)
+ {
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+- struct bfq_queue *bfqq = RQ_BFQQ(rq);
++ struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
+
+ assert_spin_locked(bfqd->queue->queue_lock);
++
++ /*
++ * An unplug may trigger a requeue of a request from the device
++ * driver: make sure we are in process context while trying to
++ * merge two bfq_queues.
++ */
++ if (!in_interrupt()) {
++ new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
++ if (new_bfqq != NULL) {
++ if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
++ new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
++ /*
++ * Release the request's reference to the old bfqq
++ * and make sure one is taken to the shared queue.
++ */
++ new_bfqq->allocated[rq_data_dir(rq)]++;
++ bfqq->allocated[rq_data_dir(rq)]--;
++ atomic_inc(&new_bfqq->ref);
++ bfq_put_queue(bfqq);
++ if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
++ bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
++ bfqq, new_bfqq);
++ rq->elv.priv[1] = new_bfqq;
++ bfqq = new_bfqq;
++ } else
++ bfq_bfqq_increase_failed_cooperations(bfqq);
++ }
++
+ bfq_init_prio_data(bfqq, RQ_BIC(rq));
+
+ bfq_add_request(rq);
+
++ /*
++ * Here a newly-created bfq_queue has already started a weight-raising
++ * period: clear raising_time_left to prevent bfq_bfqq_save_state()
++ * from assigning it a full weight-raising period. See the detailed
++ * comments about this field in bfq_init_icq().
++ */
++ if (bfqq->bic != NULL)
++ bfqq->bic->wr_time_left = 0;
+ rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+ list_add_tail(&rq->queuelist, &bfqq->fifo);
+
+@@ -2956,18 +3256,6 @@ static void bfq_put_request(struct request *rq)
+ }
+ }
+
+-static struct bfq_queue *
+-bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+- struct bfq_queue *bfqq)
+-{
+- bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+- (long unsigned)bfqq->new_bfqq->pid);
+- bic_set_bfqq(bic, bfqq->new_bfqq, 1);
+- bfq_mark_bfqq_coop(bfqq->new_bfqq);
+- bfq_put_queue(bfqq);
+- return bic_to_bfqq(bic, 1);
+-}
+-
+ /*
+ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+ * was the last process referring to said bfqq.
+@@ -2976,6 +3264,9 @@ static struct bfq_queue *
+ bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+ {
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
++
++ put_io_context(bic->icq.ioc);
++
+ if (bfqq_process_refs(bfqq) == 1) {
+ bfqq->pid = current->pid;
+ bfq_clear_bfqq_coop(bfqq);
+@@ -3004,6 +3295,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
+ struct bfq_queue *bfqq;
+ struct bfq_group *bfqg;
+ unsigned long flags;
++ bool split = false;
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+
+@@ -3022,24 +3314,14 @@ new_queue:
+ bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
+ bic_set_bfqq(bic, bfqq, is_sync);
+ } else {
+- /*
+- * If the queue was seeky for too long, break it apart.
+- */
++ /* If the queue was seeky for too long, break it apart. */
+ if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+ bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+ bfqq = bfq_split_bfqq(bic, bfqq);
++ split = true;
+ if (!bfqq)
+ goto new_queue;
+ }
+-
+- /*
+- * Check to see if this queue is scheduled to merge with
+- * another closely cooperating queue. The merging of queues
+- * happens here as it must be done in process context.
+- * The reference on new_bfqq was taken in merge_bfqqs.
+- */
+- if (bfqq->new_bfqq != NULL)
+- bfqq = bfq_merge_bfqqs(bfqd, bic, bfqq);
+ }
+
+ bfqq->allocated[rw]++;
+@@ -3050,6 +3332,26 @@ new_queue:
+ rq->elv.priv[0] = bic;
+ rq->elv.priv[1] = bfqq;
+
++ /*
++ * If a bfq_queue has only one process reference, it is owned
++ * by only one bfq_io_cq: we can set the bic field of the
++ * bfq_queue to the address of that structure. Also, if the
++ * queue has just been split, mark a flag so that the
++ * information is available to the other scheduler hooks.
++ */
++ if (bfqq_process_refs(bfqq) == 1) {
++ bfqq->bic = bic;
++ if (split) {
++ bfq_mark_bfqq_just_split(bfqq);
++ /*
++ * If the queue has just been split from a shared
++ * queue, restore the idle window and the possible
++ * weight raising period.
++ */
++ bfq_bfqq_resume_state(bfqq, bic);
++ }
++ }
++
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return 0;
+diff --git a/block/bfq-sched.c b/block/bfq-sched.c
+index c4831b7..546a254 100644
+--- a/block/bfq-sched.c
++++ b/block/bfq-sched.c
+@@ -1084,34 +1084,6 @@ static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+ return bfqq;
+ }
+
+-/*
+- * Forced extraction of the given queue.
+- */
+-static void bfq_get_next_queue_forced(struct bfq_data *bfqd,
+- struct bfq_queue *bfqq)
+-{
+- struct bfq_entity *entity;
+- struct bfq_sched_data *sd;
+-
+- BUG_ON(bfqd->in_service_queue != NULL);
+-
+- entity = &bfqq->entity;
+- /*
+- * Bubble up extraction/update from the leaf to the root.
+- */
+- for_each_entity(entity) {
+- sd = entity->sched_data;
+- bfq_update_budget(entity);
+- bfq_update_vtime(bfq_entity_service_tree(entity));
+- bfq_active_extract(bfq_entity_service_tree(entity), entity);
+- sd->in_service_entity = entity;
+- sd->next_in_service = NULL;
+- entity->service = 0;
+- }
+-
+- return;
+-}
+-
+ static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+ {
+ if (bfqd->in_service_bic != NULL) {
+diff --git a/block/bfq.h b/block/bfq.h
+index a83e69d..ebbd040 100644
+--- a/block/bfq.h
++++ b/block/bfq.h
+@@ -215,18 +215,21 @@ struct bfq_group;
+ * idle @bfq_queue with no outstanding requests, then
+ * the task associated with the queue it is deemed as
+ * soft real-time (see the comments to the function
+- * bfq_bfqq_softrt_next_start()).
++ * bfq_bfqq_softrt_next_start())
+ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
+ * idle to backlogged
+ * @service_from_backlogged: cumulative service received from the @bfq_queue
+ * since the last transition from idle to
+ * backlogged
++ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the
++ * queue is shared
+ *
+- * A bfq_queue is a leaf request queue; it can be associated with an io_context
+- * or more, if it is async or shared between cooperating processes. @cgroup
+- * holds a reference to the cgroup, to be sure that it does not disappear while
+- * a bfqq still references it (mostly to avoid races between request issuing and
+- * task migration followed by cgroup destruction).
++ * A bfq_queue is a leaf request queue; it can be associated with an
++ * io_context or more, if it is async or shared between cooperating
++ * processes. @cgroup holds a reference to the cgroup, to be sure that it
++ * does not disappear while a bfqq still references it (mostly to avoid
++ * races between request issuing and task migration followed by cgroup
++ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+ struct bfq_queue {
+@@ -264,6 +267,7 @@ struct bfq_queue {
+ unsigned int requests_within_timer;
+
+ pid_t pid;
++ struct bfq_io_cq *bic;
+
+ /* weight-raising fields */
+ unsigned long wr_cur_max_time;
+@@ -293,12 +297,34 @@ struct bfq_ttime {
+ * @icq: associated io_cq structure
+ * @bfqq: array of two process queues, the sync and the async
+ * @ttime: associated @bfq_ttime struct
++ * @wr_time_left: snapshot of the time left before weight raising ends
++ * for the sync queue associated to this process; this
++ * snapshot is taken to remember this value while the weight
++ * raising is suspended because the queue is merged with a
++ * shared queue, and is used to set @raising_cur_max_time
++ * when the queue is split from the shared queue and its
++ * weight is raised again
++ * @saved_idle_window: same purpose as the previous field for the idle
++ * window
++ * @saved_IO_bound: same purpose as the previous two fields for the I/O
++ * bound classification of a queue
++ * @cooperations: counter of consecutive successful queue merges underwent
++ * by any of the process' @bfq_queues
++ * @failed_cooperations: counter of consecutive failed queue merges of any
++ * of the process' @bfq_queues
+ */
+ struct bfq_io_cq {
+ struct io_cq icq; /* must be the first member */
+ struct bfq_queue *bfqq[2];
+ struct bfq_ttime ttime;
+ int ioprio;
++
++ unsigned int wr_time_left;
++ unsigned int saved_idle_window;
++ unsigned int saved_IO_bound;
++
++ unsigned int cooperations;
++ unsigned int failed_cooperations;
+ };
+
+ enum bfq_device_speed {
+@@ -511,7 +537,7 @@ enum bfqq_state_flags {
+ BFQ_BFQQ_FLAG_prio_changed, /* task priority has changed */
+ BFQ_BFQQ_FLAG_sync, /* synchronous queue */
+ BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */
+- BFQ_BFQQ_FLAG_IO_bound, /*
++ BFQ_BFQQ_FLAG_IO_bound, /*
+ * bfqq has timed-out at least once
+ * having consumed at most 2/10 of
+ * its budget
+@@ -520,12 +546,13 @@ enum bfqq_state_flags {
+ * bfqq has proved to be slow and
+ * seeky until budget timeout
+ */
+- BFQ_BFQQ_FLAG_softrt_update, /*
++ BFQ_BFQQ_FLAG_softrt_update, /*
+ * may need softrt-next-start
+ * update
+ */
+ BFQ_BFQQ_FLAG_coop, /* bfqq is shared */
+- BFQ_BFQQ_FLAG_split_coop, /* shared bfqq will be splitted */
++ BFQ_BFQQ_FLAG_split_coop, /* shared bfqq will be split */
++ BFQ_BFQQ_FLAG_just_split, /* queue has just been split */
+ };
+
+ #define BFQ_BFQQ_FNS(name) \
+@@ -554,6 +581,7 @@ BFQ_BFQQ_FNS(IO_bound);
+ BFQ_BFQQ_FNS(constantly_seeky);
+ BFQ_BFQQ_FNS(coop);
+ BFQ_BFQQ_FNS(split_coop);
++BFQ_BFQQ_FNS(just_split);
+ BFQ_BFQQ_FNS(softrt_update);
+ #undef BFQ_BFQQ_FNS
+
+--
+2.0.3
+
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
2014-08-19 11:44 Mike Pagano
@ 2014-08-14 11:51 ` Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-08-14 11:51 UTC (permalink / raw
To: gentoo-commits
commit: a2032151afc204dbfddee6acc420e09c3295ece5
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Thu Aug 14 11:51:26 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Thu Aug 14 11:51:26 2014 +0000
URL: http://git.overlays.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=a2032151
Linux patch 3.16.1
---
0000_README | 3 +
1000_linux-3.16.1.patch | 507 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 510 insertions(+)
diff --git a/0000_README b/0000_README
index a6ec2e6..f57085e 100644
--- a/0000_README
+++ b/0000_README
@@ -42,6 +42,9 @@ EXPERIMENTAL
Individual Patch Descriptions:
--------------------------------------------------------------------------
+Patch: 1000_linux-3.16.1.patch
+From: http://www.kernel.org
+Desc: Linux 3.16.1
Patch: 2400_kcopy-patch-for-infiniband-driver.patch
From: Alexey Shvetsov <alexxy@gentoo.org>
diff --git a/1000_linux-3.16.1.patch b/1000_linux-3.16.1.patch
new file mode 100644
index 0000000..29ac346
--- /dev/null
+++ b/1000_linux-3.16.1.patch
@@ -0,0 +1,507 @@
+diff --git a/Makefile b/Makefile
+index d0901b46b4bf..87663a2d1d10 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,8 +1,8 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 0
++SUBLEVEL = 1
+ EXTRAVERSION =
+-NAME = Shuffling Zombie Juror
++NAME = Museum of Fishiegoodies
+
+ # *DOCUMENTATION*
+ # To see a list of typical targets execute "make help"
+diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
+index 816d8202fa0a..dea1cfa2122b 100644
+--- a/arch/sparc/include/asm/tlbflush_64.h
++++ b/arch/sparc/include/asm/tlbflush_64.h
+@@ -34,6 +34,8 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
+ {
+ }
+
++void flush_tlb_kernel_range(unsigned long start, unsigned long end);
++
+ #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+
+ void flush_tlb_pending(void);
+@@ -48,11 +50,6 @@ void __flush_tlb_kernel_range(unsigned long start, unsigned long end);
+
+ #ifndef CONFIG_SMP
+
+-#define flush_tlb_kernel_range(start,end) \
+-do { flush_tsb_kernel_range(start,end); \
+- __flush_tlb_kernel_range(start,end); \
+-} while (0)
+-
+ static inline void global_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
+ {
+ __flush_tlb_page(CTX_HWBITS(mm->context), vaddr);
+@@ -63,11 +60,6 @@ static inline void global_flush_tlb_page(struct mm_struct *mm, unsigned long vad
+ void smp_flush_tlb_kernel_range(unsigned long start, unsigned long end);
+ void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr);
+
+-#define flush_tlb_kernel_range(start, end) \
+-do { flush_tsb_kernel_range(start,end); \
+- smp_flush_tlb_kernel_range(start, end); \
+-} while (0)
+-
+ #define global_flush_tlb_page(mm, vaddr) \
+ smp_flush_tlb_page(mm, vaddr)
+
+diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
+index e01d75d40329..66dacd56bb10 100644
+--- a/arch/sparc/kernel/ldc.c
++++ b/arch/sparc/kernel/ldc.c
+@@ -1336,7 +1336,7 @@ int ldc_connect(struct ldc_channel *lp)
+ if (!(lp->flags & LDC_FLAG_ALLOCED_QUEUES) ||
+ !(lp->flags & LDC_FLAG_REGISTERED_QUEUES) ||
+ lp->hs_state != LDC_HS_OPEN)
+- err = -EINVAL;
++ err = ((lp->hs_state > LDC_HS_OPEN) ? 0 : -EINVAL);
+ else
+ err = start_handshake(lp);
+
+diff --git a/arch/sparc/math-emu/math_32.c b/arch/sparc/math-emu/math_32.c
+index aa4d55b0bdf0..5ce8f2f64604 100644
+--- a/arch/sparc/math-emu/math_32.c
++++ b/arch/sparc/math-emu/math_32.c
+@@ -499,7 +499,7 @@ static int do_one_mathemu(u32 insn, unsigned long *pfsr, unsigned long *fregs)
+ case 0: fsr = *pfsr;
+ if (IR == -1) IR = 2;
+ /* fcc is always fcc0 */
+- fsr &= ~0xc00; fsr |= (IR << 10); break;
++ fsr &= ~0xc00; fsr |= (IR << 10);
+ *pfsr = fsr;
+ break;
+ case 1: rd->s = IR; break;
+diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
+index 16b58ff11e65..2cfb0f25e0ed 100644
+--- a/arch/sparc/mm/init_64.c
++++ b/arch/sparc/mm/init_64.c
+@@ -351,6 +351,10 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *
+
+ mm = vma->vm_mm;
+
++ /* Don't insert a non-valid PTE into the TSB, we'll deadlock. */
++ if (!pte_accessible(mm, pte))
++ return;
++
+ spin_lock_irqsave(&mm->context.lock, flags);
+
+ #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+@@ -2619,6 +2623,10 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
+
+ pte = pmd_val(entry);
+
++ /* Don't insert a non-valid PMD into the TSB, we'll deadlock. */
++ if (!(pte & _PAGE_VALID))
++ return;
++
+ /* We are fabricating 8MB pages using 4MB real hw pages. */
+ pte |= (addr & (1UL << REAL_HPAGE_SHIFT));
+
+@@ -2699,3 +2707,26 @@ void hugetlb_setup(struct pt_regs *regs)
+ }
+ }
+ #endif
++
++#ifdef CONFIG_SMP
++#define do_flush_tlb_kernel_range smp_flush_tlb_kernel_range
++#else
++#define do_flush_tlb_kernel_range __flush_tlb_kernel_range
++#endif
++
++void flush_tlb_kernel_range(unsigned long start, unsigned long end)
++{
++ if (start < HI_OBP_ADDRESS && end > LOW_OBP_ADDRESS) {
++ if (start < LOW_OBP_ADDRESS) {
++ flush_tsb_kernel_range(start, LOW_OBP_ADDRESS);
++ do_flush_tlb_kernel_range(start, LOW_OBP_ADDRESS);
++ }
++ if (end > HI_OBP_ADDRESS) {
++ flush_tsb_kernel_range(end, HI_OBP_ADDRESS);
++ do_flush_tlb_kernel_range(end, HI_OBP_ADDRESS);
++ }
++ } else {
++ flush_tsb_kernel_range(start, end);
++ do_flush_tlb_kernel_range(start, end);
++ }
++}
+diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
+index 8afa579e7c40..a3dd5dc64f4c 100644
+--- a/drivers/net/ethernet/broadcom/tg3.c
++++ b/drivers/net/ethernet/broadcom/tg3.c
+@@ -7830,17 +7830,18 @@ static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
+
+ static netdev_tx_t tg3_start_xmit(struct sk_buff *, struct net_device *);
+
+-/* Use GSO to workaround a rare TSO bug that may be triggered when the
+- * TSO header is greater than 80 bytes.
++/* Use GSO to workaround all TSO packets that meet HW bug conditions
++ * indicated in tg3_tx_frag_set()
+ */
+-static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
++static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
++ struct netdev_queue *txq, struct sk_buff *skb)
+ {
+ struct sk_buff *segs, *nskb;
+ u32 frag_cnt_est = skb_shinfo(skb)->gso_segs * 3;
+
+ /* Estimate the number of fragments in the worst case */
+- if (unlikely(tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)) {
+- netif_stop_queue(tp->dev);
++ if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {
++ netif_tx_stop_queue(txq);
+
+ /* netif_tx_stop_queue() must be done before checking
+ * checking tx index in tg3_tx_avail() below, because in
+@@ -7848,13 +7849,14 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
+ * netif_tx_queue_stopped().
+ */
+ smp_mb();
+- if (tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)
++ if (tg3_tx_avail(tnapi) <= frag_cnt_est)
+ return NETDEV_TX_BUSY;
+
+- netif_wake_queue(tp->dev);
++ netif_tx_wake_queue(txq);
+ }
+
+- segs = skb_gso_segment(skb, tp->dev->features & ~(NETIF_F_TSO | NETIF_F_TSO6));
++ segs = skb_gso_segment(skb, tp->dev->features &
++ ~(NETIF_F_TSO | NETIF_F_TSO6));
+ if (IS_ERR(segs) || !segs)
+ goto tg3_tso_bug_end;
+
+@@ -7930,7 +7932,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ if (!skb_is_gso_v6(skb)) {
+ if (unlikely((ETH_HLEN + hdr_len) > 80) &&
+ tg3_flag(tp, TSO_BUG))
+- return tg3_tso_bug(tp, skb);
++ return tg3_tso_bug(tp, tnapi, txq, skb);
+
+ ip_csum = iph->check;
+ ip_tot_len = iph->tot_len;
+@@ -8061,7 +8063,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ iph->tot_len = ip_tot_len;
+ }
+ tcph->check = tcp_csum;
+- return tg3_tso_bug(tp, skb);
++ return tg3_tso_bug(tp, tnapi, txq, skb);
+ }
+
+ /* If the workaround fails due to memory/mapping
+diff --git a/drivers/net/ethernet/brocade/bna/bnad.c b/drivers/net/ethernet/brocade/bna/bnad.c
+index 3a77f9ead004..556aab75f490 100644
+--- a/drivers/net/ethernet/brocade/bna/bnad.c
++++ b/drivers/net/ethernet/brocade/bna/bnad.c
+@@ -600,9 +600,9 @@ bnad_cq_process(struct bnad *bnad, struct bna_ccb *ccb, int budget)
+ prefetch(bnad->netdev);
+
+ cq = ccb->sw_q;
+- cmpl = &cq[ccb->producer_index];
+
+ while (packets < budget) {
++ cmpl = &cq[ccb->producer_index];
+ if (!cmpl->valid)
+ break;
+ /* The 'valid' field is set by the adapter, only after writing
+diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
+index 958df383068a..ef8a5c20236a 100644
+--- a/drivers/net/macvlan.c
++++ b/drivers/net/macvlan.c
+@@ -646,6 +646,7 @@ static int macvlan_init(struct net_device *dev)
+ (lowerdev->state & MACVLAN_STATE_MASK);
+ dev->features = lowerdev->features & MACVLAN_FEATURES;
+ dev->features |= ALWAYS_ON_FEATURES;
++ dev->vlan_features = lowerdev->vlan_features & MACVLAN_FEATURES;
+ dev->gso_max_size = lowerdev->gso_max_size;
+ dev->iflink = lowerdev->ifindex;
+ dev->hard_header_len = lowerdev->hard_header_len;
+diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
+index 203651ebccb0..4eaadcfcb0fe 100644
+--- a/drivers/net/phy/mdio_bus.c
++++ b/drivers/net/phy/mdio_bus.c
+@@ -255,7 +255,6 @@ int mdiobus_register(struct mii_bus *bus)
+
+ bus->dev.parent = bus->parent;
+ bus->dev.class = &mdio_bus_class;
+- bus->dev.driver = bus->parent->driver;
+ bus->dev.groups = NULL;
+ dev_set_name(&bus->dev, "%s", bus->id);
+
+diff --git a/drivers/sbus/char/bbc_envctrl.c b/drivers/sbus/char/bbc_envctrl.c
+index 160e7510aca6..0787b9756165 100644
+--- a/drivers/sbus/char/bbc_envctrl.c
++++ b/drivers/sbus/char/bbc_envctrl.c
+@@ -452,6 +452,9 @@ static void attach_one_temp(struct bbc_i2c_bus *bp, struct platform_device *op,
+ if (!tp)
+ return;
+
++ INIT_LIST_HEAD(&tp->bp_list);
++ INIT_LIST_HEAD(&tp->glob_list);
++
+ tp->client = bbc_i2c_attach(bp, op);
+ if (!tp->client) {
+ kfree(tp);
+@@ -497,6 +500,9 @@ static void attach_one_fan(struct bbc_i2c_bus *bp, struct platform_device *op,
+ if (!fp)
+ return;
+
++ INIT_LIST_HEAD(&fp->bp_list);
++ INIT_LIST_HEAD(&fp->glob_list);
++
+ fp->client = bbc_i2c_attach(bp, op);
+ if (!fp->client) {
+ kfree(fp);
+diff --git a/drivers/sbus/char/bbc_i2c.c b/drivers/sbus/char/bbc_i2c.c
+index c7763e482eb2..812b5f0361b6 100644
+--- a/drivers/sbus/char/bbc_i2c.c
++++ b/drivers/sbus/char/bbc_i2c.c
+@@ -300,13 +300,18 @@ static struct bbc_i2c_bus * attach_one_i2c(struct platform_device *op, int index
+ if (!bp)
+ return NULL;
+
++ INIT_LIST_HEAD(&bp->temps);
++ INIT_LIST_HEAD(&bp->fans);
++
+ bp->i2c_control_regs = of_ioremap(&op->resource[0], 0, 0x2, "bbc_i2c_regs");
+ if (!bp->i2c_control_regs)
+ goto fail;
+
+- bp->i2c_bussel_reg = of_ioremap(&op->resource[1], 0, 0x1, "bbc_i2c_bussel");
+- if (!bp->i2c_bussel_reg)
+- goto fail;
++ if (op->num_resources == 2) {
++ bp->i2c_bussel_reg = of_ioremap(&op->resource[1], 0, 0x1, "bbc_i2c_bussel");
++ if (!bp->i2c_bussel_reg)
++ goto fail;
++ }
+
+ bp->waiting = 0;
+ init_waitqueue_head(&bp->wq);
+diff --git a/drivers/tty/serial/sunsab.c b/drivers/tty/serial/sunsab.c
+index 2f57df9a71d9..a1e09c0d46f2 100644
+--- a/drivers/tty/serial/sunsab.c
++++ b/drivers/tty/serial/sunsab.c
+@@ -157,6 +157,15 @@ receive_chars(struct uart_sunsab_port *up,
+ (up->port.line == up->port.cons->index))
+ saw_console_brk = 1;
+
++ if (count == 0) {
++ if (unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) {
++ stat->sreg.isr0 &= ~(SAB82532_ISR0_PERR |
++ SAB82532_ISR0_FERR);
++ up->port.icount.brk++;
++ uart_handle_break(&up->port);
++ }
++ }
++
+ for (i = 0; i < count; i++) {
+ unsigned char ch = buf[i], flag;
+
+diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
+index a4daf9eb8562..8dd8cab88b87 100644
+--- a/include/net/ip_tunnels.h
++++ b/include/net/ip_tunnels.h
+@@ -40,6 +40,7 @@ struct ip_tunnel_prl_entry {
+
+ struct ip_tunnel_dst {
+ struct dst_entry __rcu *dst;
++ __be32 saddr;
+ };
+
+ struct ip_tunnel {
+diff --git a/lib/iovec.c b/lib/iovec.c
+index 7a7c2da4cddf..df3abd1eaa4a 100644
+--- a/lib/iovec.c
++++ b/lib/iovec.c
+@@ -85,6 +85,10 @@ EXPORT_SYMBOL(memcpy_toiovecend);
+ int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov,
+ int offset, int len)
+ {
++ /* No data? Done! */
++ if (len == 0)
++ return 0;
++
+ /* Skip over the finished iovecs */
+ while (offset >= iov->iov_len) {
+ offset -= iov->iov_len;
+diff --git a/net/batman-adv/fragmentation.c b/net/batman-adv/fragmentation.c
+index f14e54a05691..022d18ab27a6 100644
+--- a/net/batman-adv/fragmentation.c
++++ b/net/batman-adv/fragmentation.c
+@@ -128,6 +128,7 @@ static bool batadv_frag_insert_packet(struct batadv_orig_node *orig_node,
+ {
+ struct batadv_frag_table_entry *chain;
+ struct batadv_frag_list_entry *frag_entry_new = NULL, *frag_entry_curr;
++ struct batadv_frag_list_entry *frag_entry_last = NULL;
+ struct batadv_frag_packet *frag_packet;
+ uint8_t bucket;
+ uint16_t seqno, hdr_size = sizeof(struct batadv_frag_packet);
+@@ -180,11 +181,14 @@ static bool batadv_frag_insert_packet(struct batadv_orig_node *orig_node,
+ ret = true;
+ goto out;
+ }
++
++ /* store current entry because it could be the last in list */
++ frag_entry_last = frag_entry_curr;
+ }
+
+- /* Reached the end of the list, so insert after 'frag_entry_curr'. */
+- if (likely(frag_entry_curr)) {
+- hlist_add_after(&frag_entry_curr->list, &frag_entry_new->list);
++ /* Reached the end of the list, so insert after 'frag_entry_last'. */
++ if (likely(frag_entry_last)) {
++ hlist_add_after(&frag_entry_last->list, &frag_entry_new->list);
+ chain->size += skb->len - hdr_size;
+ chain->timestamp = jiffies;
+ ret = true;
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..58ff88edbefd 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -2976,9 +2976,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
+ tail = nskb;
+
+ __copy_skb_header(nskb, head_skb);
+- nskb->mac_len = head_skb->mac_len;
+
+ skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
++ skb_reset_mac_len(nskb);
+
+ skb_copy_from_linear_data_offset(head_skb, -tnl_hlen,
+ nskb->data - tnl_hlen,
+diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
+index 6f9de61dce5f..45920d928341 100644
+--- a/net/ipv4/ip_tunnel.c
++++ b/net/ipv4/ip_tunnel.c
+@@ -69,23 +69,25 @@ static unsigned int ip_tunnel_hash(__be32 key, __be32 remote)
+ }
+
+ static void __tunnel_dst_set(struct ip_tunnel_dst *idst,
+- struct dst_entry *dst)
++ struct dst_entry *dst, __be32 saddr)
+ {
+ struct dst_entry *old_dst;
+
+ dst_clone(dst);
+ old_dst = xchg((__force struct dst_entry **)&idst->dst, dst);
+ dst_release(old_dst);
++ idst->saddr = saddr;
+ }
+
+-static void tunnel_dst_set(struct ip_tunnel *t, struct dst_entry *dst)
++static void tunnel_dst_set(struct ip_tunnel *t,
++ struct dst_entry *dst, __be32 saddr)
+ {
+- __tunnel_dst_set(this_cpu_ptr(t->dst_cache), dst);
++ __tunnel_dst_set(this_cpu_ptr(t->dst_cache), dst, saddr);
+ }
+
+ static void tunnel_dst_reset(struct ip_tunnel *t)
+ {
+- tunnel_dst_set(t, NULL);
++ tunnel_dst_set(t, NULL, 0);
+ }
+
+ void ip_tunnel_dst_reset_all(struct ip_tunnel *t)
+@@ -93,20 +95,25 @@ void ip_tunnel_dst_reset_all(struct ip_tunnel *t)
+ int i;
+
+ for_each_possible_cpu(i)
+- __tunnel_dst_set(per_cpu_ptr(t->dst_cache, i), NULL);
++ __tunnel_dst_set(per_cpu_ptr(t->dst_cache, i), NULL, 0);
+ }
+ EXPORT_SYMBOL(ip_tunnel_dst_reset_all);
+
+-static struct rtable *tunnel_rtable_get(struct ip_tunnel *t, u32 cookie)
++static struct rtable *tunnel_rtable_get(struct ip_tunnel *t,
++ u32 cookie, __be32 *saddr)
+ {
++ struct ip_tunnel_dst *idst;
+ struct dst_entry *dst;
+
+ rcu_read_lock();
+- dst = rcu_dereference(this_cpu_ptr(t->dst_cache)->dst);
++ idst = this_cpu_ptr(t->dst_cache);
++ dst = rcu_dereference(idst->dst);
+ if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+ dst = NULL;
+ if (dst) {
+- if (dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
++ if (!dst->obsolete || dst->ops->check(dst, cookie)) {
++ *saddr = idst->saddr;
++ } else {
+ tunnel_dst_reset(t);
+ dst_release(dst);
+ dst = NULL;
+@@ -367,7 +374,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev)
+
+ if (!IS_ERR(rt)) {
+ tdev = rt->dst.dev;
+- tunnel_dst_set(tunnel, &rt->dst);
++ tunnel_dst_set(tunnel, &rt->dst, fl4.saddr);
+ ip_rt_put(rt);
+ }
+ if (dev->type != ARPHRD_ETHER)
+@@ -610,7 +617,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
+ init_tunnel_flow(&fl4, protocol, dst, tnl_params->saddr,
+ tunnel->parms.o_key, RT_TOS(tos), tunnel->parms.link);
+
+- rt = connected ? tunnel_rtable_get(tunnel, 0) : NULL;
++ rt = connected ? tunnel_rtable_get(tunnel, 0, &fl4.saddr) : NULL;
+
+ if (!rt) {
+ rt = ip_route_output_key(tunnel->net, &fl4);
+@@ -620,7 +627,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
+ goto tx_error;
+ }
+ if (connected)
+- tunnel_dst_set(tunnel, &rt->dst);
++ tunnel_dst_set(tunnel, &rt->dst, fl4.saddr);
+ }
+
+ if (rt->dst.dev == dev) {
+diff --git a/net/ipv4/tcp_vegas.c b/net/ipv4/tcp_vegas.c
+index 9a5e05f27f4f..b40ad897f945 100644
+--- a/net/ipv4/tcp_vegas.c
++++ b/net/ipv4/tcp_vegas.c
+@@ -218,7 +218,8 @@ static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+ * This is:
+ * (actual rate in segments) * baseRTT
+ */
+- target_cwnd = tp->snd_cwnd * vegas->baseRTT / rtt;
++ target_cwnd = (u64)tp->snd_cwnd * vegas->baseRTT;
++ do_div(target_cwnd, rtt);
+
+ /* Calculate the difference between the window we had,
+ * and the window we would like to have. This quantity
+diff --git a/net/ipv4/tcp_veno.c b/net/ipv4/tcp_veno.c
+index 27b9825753d1..8276977d2c85 100644
+--- a/net/ipv4/tcp_veno.c
++++ b/net/ipv4/tcp_veno.c
+@@ -144,7 +144,7 @@ static void tcp_veno_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+
+ rtt = veno->minrtt;
+
+- target_cwnd = (tp->snd_cwnd * veno->basertt);
++ target_cwnd = (u64)tp->snd_cwnd * veno->basertt;
+ target_cwnd <<= V_PARAM_SHIFT;
+ do_div(target_cwnd, rtt);
+
+diff --git a/net/sctp/output.c b/net/sctp/output.c
+index 01ab8e0723f0..407ae2bf97b0 100644
+--- a/net/sctp/output.c
++++ b/net/sctp/output.c
+@@ -599,7 +599,7 @@ out:
+ return err;
+ no_route:
+ kfree_skb(nskb);
+- IP_INC_STATS_BH(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
++ IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
+
+ /* FIXME: Returning the 'err' will effect all the associations
+ * associated with a socket, although only one of the paths of the
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-08-19 11:44 Mike Pagano
2014-08-14 11:51 ` Mike Pagano
0 siblings, 1 reply; 26+ messages in thread
From: Mike Pagano @ 2014-08-19 11:44 UTC (permalink / raw
To: gentoo-commits
commit: a2032151afc204dbfddee6acc420e09c3295ece5
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Thu Aug 14 11:51:26 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Thu Aug 14 11:51:26 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=a2032151
Linux patch 3.16.1
---
0000_README | 3 +
1000_linux-3.16.1.patch | 507 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 510 insertions(+)
diff --git a/0000_README b/0000_README
index a6ec2e6..f57085e 100644
--- a/0000_README
+++ b/0000_README
@@ -42,6 +42,9 @@ EXPERIMENTAL
Individual Patch Descriptions:
--------------------------------------------------------------------------
+Patch: 1000_linux-3.16.1.patch
+From: http://www.kernel.org
+Desc: Linux 3.16.1
Patch: 2400_kcopy-patch-for-infiniband-driver.patch
From: Alexey Shvetsov <alexxy@gentoo.org>
diff --git a/1000_linux-3.16.1.patch b/1000_linux-3.16.1.patch
new file mode 100644
index 0000000..29ac346
--- /dev/null
+++ b/1000_linux-3.16.1.patch
@@ -0,0 +1,507 @@
+diff --git a/Makefile b/Makefile
+index d0901b46b4bf..87663a2d1d10 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,8 +1,8 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 0
++SUBLEVEL = 1
+ EXTRAVERSION =
+-NAME = Shuffling Zombie Juror
++NAME = Museum of Fishiegoodies
+
+ # *DOCUMENTATION*
+ # To see a list of typical targets execute "make help"
+diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
+index 816d8202fa0a..dea1cfa2122b 100644
+--- a/arch/sparc/include/asm/tlbflush_64.h
++++ b/arch/sparc/include/asm/tlbflush_64.h
+@@ -34,6 +34,8 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
+ {
+ }
+
++void flush_tlb_kernel_range(unsigned long start, unsigned long end);
++
+ #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+
+ void flush_tlb_pending(void);
+@@ -48,11 +50,6 @@ void __flush_tlb_kernel_range(unsigned long start, unsigned long end);
+
+ #ifndef CONFIG_SMP
+
+-#define flush_tlb_kernel_range(start,end) \
+-do { flush_tsb_kernel_range(start,end); \
+- __flush_tlb_kernel_range(start,end); \
+-} while (0)
+-
+ static inline void global_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
+ {
+ __flush_tlb_page(CTX_HWBITS(mm->context), vaddr);
+@@ -63,11 +60,6 @@ static inline void global_flush_tlb_page(struct mm_struct *mm, unsigned long vad
+ void smp_flush_tlb_kernel_range(unsigned long start, unsigned long end);
+ void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr);
+
+-#define flush_tlb_kernel_range(start, end) \
+-do { flush_tsb_kernel_range(start,end); \
+- smp_flush_tlb_kernel_range(start, end); \
+-} while (0)
+-
+ #define global_flush_tlb_page(mm, vaddr) \
+ smp_flush_tlb_page(mm, vaddr)
+
+diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
+index e01d75d40329..66dacd56bb10 100644
+--- a/arch/sparc/kernel/ldc.c
++++ b/arch/sparc/kernel/ldc.c
+@@ -1336,7 +1336,7 @@ int ldc_connect(struct ldc_channel *lp)
+ if (!(lp->flags & LDC_FLAG_ALLOCED_QUEUES) ||
+ !(lp->flags & LDC_FLAG_REGISTERED_QUEUES) ||
+ lp->hs_state != LDC_HS_OPEN)
+- err = -EINVAL;
++ err = ((lp->hs_state > LDC_HS_OPEN) ? 0 : -EINVAL);
+ else
+ err = start_handshake(lp);
+
+diff --git a/arch/sparc/math-emu/math_32.c b/arch/sparc/math-emu/math_32.c
+index aa4d55b0bdf0..5ce8f2f64604 100644
+--- a/arch/sparc/math-emu/math_32.c
++++ b/arch/sparc/math-emu/math_32.c
+@@ -499,7 +499,7 @@ static int do_one_mathemu(u32 insn, unsigned long *pfsr, unsigned long *fregs)
+ case 0: fsr = *pfsr;
+ if (IR == -1) IR = 2;
+ /* fcc is always fcc0 */
+- fsr &= ~0xc00; fsr |= (IR << 10); break;
++ fsr &= ~0xc00; fsr |= (IR << 10);
+ *pfsr = fsr;
+ break;
+ case 1: rd->s = IR; break;
+diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
+index 16b58ff11e65..2cfb0f25e0ed 100644
+--- a/arch/sparc/mm/init_64.c
++++ b/arch/sparc/mm/init_64.c
+@@ -351,6 +351,10 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *
+
+ mm = vma->vm_mm;
+
++ /* Don't insert a non-valid PTE into the TSB, we'll deadlock. */
++ if (!pte_accessible(mm, pte))
++ return;
++
+ spin_lock_irqsave(&mm->context.lock, flags);
+
+ #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+@@ -2619,6 +2623,10 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
+
+ pte = pmd_val(entry);
+
++ /* Don't insert a non-valid PMD into the TSB, we'll deadlock. */
++ if (!(pte & _PAGE_VALID))
++ return;
++
+ /* We are fabricating 8MB pages using 4MB real hw pages. */
+ pte |= (addr & (1UL << REAL_HPAGE_SHIFT));
+
+@@ -2699,3 +2707,26 @@ void hugetlb_setup(struct pt_regs *regs)
+ }
+ }
+ #endif
++
++#ifdef CONFIG_SMP
++#define do_flush_tlb_kernel_range smp_flush_tlb_kernel_range
++#else
++#define do_flush_tlb_kernel_range __flush_tlb_kernel_range
++#endif
++
++void flush_tlb_kernel_range(unsigned long start, unsigned long end)
++{
++ if (start < HI_OBP_ADDRESS && end > LOW_OBP_ADDRESS) {
++ if (start < LOW_OBP_ADDRESS) {
++ flush_tsb_kernel_range(start, LOW_OBP_ADDRESS);
++ do_flush_tlb_kernel_range(start, LOW_OBP_ADDRESS);
++ }
++ if (end > HI_OBP_ADDRESS) {
++ flush_tsb_kernel_range(end, HI_OBP_ADDRESS);
++ do_flush_tlb_kernel_range(end, HI_OBP_ADDRESS);
++ }
++ } else {
++ flush_tsb_kernel_range(start, end);
++ do_flush_tlb_kernel_range(start, end);
++ }
++}
+diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
+index 8afa579e7c40..a3dd5dc64f4c 100644
+--- a/drivers/net/ethernet/broadcom/tg3.c
++++ b/drivers/net/ethernet/broadcom/tg3.c
+@@ -7830,17 +7830,18 @@ static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
+
+ static netdev_tx_t tg3_start_xmit(struct sk_buff *, struct net_device *);
+
+-/* Use GSO to workaround a rare TSO bug that may be triggered when the
+- * TSO header is greater than 80 bytes.
++/* Use GSO to workaround all TSO packets that meet HW bug conditions
++ * indicated in tg3_tx_frag_set()
+ */
+-static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
++static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
++ struct netdev_queue *txq, struct sk_buff *skb)
+ {
+ struct sk_buff *segs, *nskb;
+ u32 frag_cnt_est = skb_shinfo(skb)->gso_segs * 3;
+
+ /* Estimate the number of fragments in the worst case */
+- if (unlikely(tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)) {
+- netif_stop_queue(tp->dev);
++ if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {
++ netif_tx_stop_queue(txq);
+
+ /* netif_tx_stop_queue() must be done before checking
+ * checking tx index in tg3_tx_avail() below, because in
+@@ -7848,13 +7849,14 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
+ * netif_tx_queue_stopped().
+ */
+ smp_mb();
+- if (tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)
++ if (tg3_tx_avail(tnapi) <= frag_cnt_est)
+ return NETDEV_TX_BUSY;
+
+- netif_wake_queue(tp->dev);
++ netif_tx_wake_queue(txq);
+ }
+
+- segs = skb_gso_segment(skb, tp->dev->features & ~(NETIF_F_TSO | NETIF_F_TSO6));
++ segs = skb_gso_segment(skb, tp->dev->features &
++ ~(NETIF_F_TSO | NETIF_F_TSO6));
+ if (IS_ERR(segs) || !segs)
+ goto tg3_tso_bug_end;
+
+@@ -7930,7 +7932,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ if (!skb_is_gso_v6(skb)) {
+ if (unlikely((ETH_HLEN + hdr_len) > 80) &&
+ tg3_flag(tp, TSO_BUG))
+- return tg3_tso_bug(tp, skb);
++ return tg3_tso_bug(tp, tnapi, txq, skb);
+
+ ip_csum = iph->check;
+ ip_tot_len = iph->tot_len;
+@@ -8061,7 +8063,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ iph->tot_len = ip_tot_len;
+ }
+ tcph->check = tcp_csum;
+- return tg3_tso_bug(tp, skb);
++ return tg3_tso_bug(tp, tnapi, txq, skb);
+ }
+
+ /* If the workaround fails due to memory/mapping
+diff --git a/drivers/net/ethernet/brocade/bna/bnad.c b/drivers/net/ethernet/brocade/bna/bnad.c
+index 3a77f9ead004..556aab75f490 100644
+--- a/drivers/net/ethernet/brocade/bna/bnad.c
++++ b/drivers/net/ethernet/brocade/bna/bnad.c
+@@ -600,9 +600,9 @@ bnad_cq_process(struct bnad *bnad, struct bna_ccb *ccb, int budget)
+ prefetch(bnad->netdev);
+
+ cq = ccb->sw_q;
+- cmpl = &cq[ccb->producer_index];
+
+ while (packets < budget) {
++ cmpl = &cq[ccb->producer_index];
+ if (!cmpl->valid)
+ break;
+ /* The 'valid' field is set by the adapter, only after writing
+diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
+index 958df383068a..ef8a5c20236a 100644
+--- a/drivers/net/macvlan.c
++++ b/drivers/net/macvlan.c
+@@ -646,6 +646,7 @@ static int macvlan_init(struct net_device *dev)
+ (lowerdev->state & MACVLAN_STATE_MASK);
+ dev->features = lowerdev->features & MACVLAN_FEATURES;
+ dev->features |= ALWAYS_ON_FEATURES;
++ dev->vlan_features = lowerdev->vlan_features & MACVLAN_FEATURES;
+ dev->gso_max_size = lowerdev->gso_max_size;
+ dev->iflink = lowerdev->ifindex;
+ dev->hard_header_len = lowerdev->hard_header_len;
+diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
+index 203651ebccb0..4eaadcfcb0fe 100644
+--- a/drivers/net/phy/mdio_bus.c
++++ b/drivers/net/phy/mdio_bus.c
+@@ -255,7 +255,6 @@ int mdiobus_register(struct mii_bus *bus)
+
+ bus->dev.parent = bus->parent;
+ bus->dev.class = &mdio_bus_class;
+- bus->dev.driver = bus->parent->driver;
+ bus->dev.groups = NULL;
+ dev_set_name(&bus->dev, "%s", bus->id);
+
+diff --git a/drivers/sbus/char/bbc_envctrl.c b/drivers/sbus/char/bbc_envctrl.c
+index 160e7510aca6..0787b9756165 100644
+--- a/drivers/sbus/char/bbc_envctrl.c
++++ b/drivers/sbus/char/bbc_envctrl.c
+@@ -452,6 +452,9 @@ static void attach_one_temp(struct bbc_i2c_bus *bp, struct platform_device *op,
+ if (!tp)
+ return;
+
++ INIT_LIST_HEAD(&tp->bp_list);
++ INIT_LIST_HEAD(&tp->glob_list);
++
+ tp->client = bbc_i2c_attach(bp, op);
+ if (!tp->client) {
+ kfree(tp);
+@@ -497,6 +500,9 @@ static void attach_one_fan(struct bbc_i2c_bus *bp, struct platform_device *op,
+ if (!fp)
+ return;
+
++ INIT_LIST_HEAD(&fp->bp_list);
++ INIT_LIST_HEAD(&fp->glob_list);
++
+ fp->client = bbc_i2c_attach(bp, op);
+ if (!fp->client) {
+ kfree(fp);
+diff --git a/drivers/sbus/char/bbc_i2c.c b/drivers/sbus/char/bbc_i2c.c
+index c7763e482eb2..812b5f0361b6 100644
+--- a/drivers/sbus/char/bbc_i2c.c
++++ b/drivers/sbus/char/bbc_i2c.c
+@@ -300,13 +300,18 @@ static struct bbc_i2c_bus * attach_one_i2c(struct platform_device *op, int index
+ if (!bp)
+ return NULL;
+
++ INIT_LIST_HEAD(&bp->temps);
++ INIT_LIST_HEAD(&bp->fans);
++
+ bp->i2c_control_regs = of_ioremap(&op->resource[0], 0, 0x2, "bbc_i2c_regs");
+ if (!bp->i2c_control_regs)
+ goto fail;
+
+- bp->i2c_bussel_reg = of_ioremap(&op->resource[1], 0, 0x1, "bbc_i2c_bussel");
+- if (!bp->i2c_bussel_reg)
+- goto fail;
++ if (op->num_resources == 2) {
++ bp->i2c_bussel_reg = of_ioremap(&op->resource[1], 0, 0x1, "bbc_i2c_bussel");
++ if (!bp->i2c_bussel_reg)
++ goto fail;
++ }
+
+ bp->waiting = 0;
+ init_waitqueue_head(&bp->wq);
+diff --git a/drivers/tty/serial/sunsab.c b/drivers/tty/serial/sunsab.c
+index 2f57df9a71d9..a1e09c0d46f2 100644
+--- a/drivers/tty/serial/sunsab.c
++++ b/drivers/tty/serial/sunsab.c
+@@ -157,6 +157,15 @@ receive_chars(struct uart_sunsab_port *up,
+ (up->port.line == up->port.cons->index))
+ saw_console_brk = 1;
+
++ if (count == 0) {
++ if (unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) {
++ stat->sreg.isr0 &= ~(SAB82532_ISR0_PERR |
++ SAB82532_ISR0_FERR);
++ up->port.icount.brk++;
++ uart_handle_break(&up->port);
++ }
++ }
++
+ for (i = 0; i < count; i++) {
+ unsigned char ch = buf[i], flag;
+
+diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
+index a4daf9eb8562..8dd8cab88b87 100644
+--- a/include/net/ip_tunnels.h
++++ b/include/net/ip_tunnels.h
+@@ -40,6 +40,7 @@ struct ip_tunnel_prl_entry {
+
+ struct ip_tunnel_dst {
+ struct dst_entry __rcu *dst;
++ __be32 saddr;
+ };
+
+ struct ip_tunnel {
+diff --git a/lib/iovec.c b/lib/iovec.c
+index 7a7c2da4cddf..df3abd1eaa4a 100644
+--- a/lib/iovec.c
++++ b/lib/iovec.c
+@@ -85,6 +85,10 @@ EXPORT_SYMBOL(memcpy_toiovecend);
+ int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov,
+ int offset, int len)
+ {
++ /* No data? Done! */
++ if (len == 0)
++ return 0;
++
+ /* Skip over the finished iovecs */
+ while (offset >= iov->iov_len) {
+ offset -= iov->iov_len;
+diff --git a/net/batman-adv/fragmentation.c b/net/batman-adv/fragmentation.c
+index f14e54a05691..022d18ab27a6 100644
+--- a/net/batman-adv/fragmentation.c
++++ b/net/batman-adv/fragmentation.c
+@@ -128,6 +128,7 @@ static bool batadv_frag_insert_packet(struct batadv_orig_node *orig_node,
+ {
+ struct batadv_frag_table_entry *chain;
+ struct batadv_frag_list_entry *frag_entry_new = NULL, *frag_entry_curr;
++ struct batadv_frag_list_entry *frag_entry_last = NULL;
+ struct batadv_frag_packet *frag_packet;
+ uint8_t bucket;
+ uint16_t seqno, hdr_size = sizeof(struct batadv_frag_packet);
+@@ -180,11 +181,14 @@ static bool batadv_frag_insert_packet(struct batadv_orig_node *orig_node,
+ ret = true;
+ goto out;
+ }
++
++ /* store current entry because it could be the last in list */
++ frag_entry_last = frag_entry_curr;
+ }
+
+- /* Reached the end of the list, so insert after 'frag_entry_curr'. */
+- if (likely(frag_entry_curr)) {
+- hlist_add_after(&frag_entry_curr->list, &frag_entry_new->list);
++ /* Reached the end of the list, so insert after 'frag_entry_last'. */
++ if (likely(frag_entry_last)) {
++ hlist_add_after(&frag_entry_last->list, &frag_entry_new->list);
+ chain->size += skb->len - hdr_size;
+ chain->timestamp = jiffies;
+ ret = true;
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..58ff88edbefd 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -2976,9 +2976,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
+ tail = nskb;
+
+ __copy_skb_header(nskb, head_skb);
+- nskb->mac_len = head_skb->mac_len;
+
+ skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
++ skb_reset_mac_len(nskb);
+
+ skb_copy_from_linear_data_offset(head_skb, -tnl_hlen,
+ nskb->data - tnl_hlen,
+diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
+index 6f9de61dce5f..45920d928341 100644
+--- a/net/ipv4/ip_tunnel.c
++++ b/net/ipv4/ip_tunnel.c
+@@ -69,23 +69,25 @@ static unsigned int ip_tunnel_hash(__be32 key, __be32 remote)
+ }
+
+ static void __tunnel_dst_set(struct ip_tunnel_dst *idst,
+- struct dst_entry *dst)
++ struct dst_entry *dst, __be32 saddr)
+ {
+ struct dst_entry *old_dst;
+
+ dst_clone(dst);
+ old_dst = xchg((__force struct dst_entry **)&idst->dst, dst);
+ dst_release(old_dst);
++ idst->saddr = saddr;
+ }
+
+-static void tunnel_dst_set(struct ip_tunnel *t, struct dst_entry *dst)
++static void tunnel_dst_set(struct ip_tunnel *t,
++ struct dst_entry *dst, __be32 saddr)
+ {
+- __tunnel_dst_set(this_cpu_ptr(t->dst_cache), dst);
++ __tunnel_dst_set(this_cpu_ptr(t->dst_cache), dst, saddr);
+ }
+
+ static void tunnel_dst_reset(struct ip_tunnel *t)
+ {
+- tunnel_dst_set(t, NULL);
++ tunnel_dst_set(t, NULL, 0);
+ }
+
+ void ip_tunnel_dst_reset_all(struct ip_tunnel *t)
+@@ -93,20 +95,25 @@ void ip_tunnel_dst_reset_all(struct ip_tunnel *t)
+ int i;
+
+ for_each_possible_cpu(i)
+- __tunnel_dst_set(per_cpu_ptr(t->dst_cache, i), NULL);
++ __tunnel_dst_set(per_cpu_ptr(t->dst_cache, i), NULL, 0);
+ }
+ EXPORT_SYMBOL(ip_tunnel_dst_reset_all);
+
+-static struct rtable *tunnel_rtable_get(struct ip_tunnel *t, u32 cookie)
++static struct rtable *tunnel_rtable_get(struct ip_tunnel *t,
++ u32 cookie, __be32 *saddr)
+ {
++ struct ip_tunnel_dst *idst;
+ struct dst_entry *dst;
+
+ rcu_read_lock();
+- dst = rcu_dereference(this_cpu_ptr(t->dst_cache)->dst);
++ idst = this_cpu_ptr(t->dst_cache);
++ dst = rcu_dereference(idst->dst);
+ if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+ dst = NULL;
+ if (dst) {
+- if (dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
++ if (!dst->obsolete || dst->ops->check(dst, cookie)) {
++ *saddr = idst->saddr;
++ } else {
+ tunnel_dst_reset(t);
+ dst_release(dst);
+ dst = NULL;
+@@ -367,7 +374,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev)
+
+ if (!IS_ERR(rt)) {
+ tdev = rt->dst.dev;
+- tunnel_dst_set(tunnel, &rt->dst);
++ tunnel_dst_set(tunnel, &rt->dst, fl4.saddr);
+ ip_rt_put(rt);
+ }
+ if (dev->type != ARPHRD_ETHER)
+@@ -610,7 +617,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
+ init_tunnel_flow(&fl4, protocol, dst, tnl_params->saddr,
+ tunnel->parms.o_key, RT_TOS(tos), tunnel->parms.link);
+
+- rt = connected ? tunnel_rtable_get(tunnel, 0) : NULL;
++ rt = connected ? tunnel_rtable_get(tunnel, 0, &fl4.saddr) : NULL;
+
+ if (!rt) {
+ rt = ip_route_output_key(tunnel->net, &fl4);
+@@ -620,7 +627,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
+ goto tx_error;
+ }
+ if (connected)
+- tunnel_dst_set(tunnel, &rt->dst);
++ tunnel_dst_set(tunnel, &rt->dst, fl4.saddr);
+ }
+
+ if (rt->dst.dev == dev) {
+diff --git a/net/ipv4/tcp_vegas.c b/net/ipv4/tcp_vegas.c
+index 9a5e05f27f4f..b40ad897f945 100644
+--- a/net/ipv4/tcp_vegas.c
++++ b/net/ipv4/tcp_vegas.c
+@@ -218,7 +218,8 @@ static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+ * This is:
+ * (actual rate in segments) * baseRTT
+ */
+- target_cwnd = tp->snd_cwnd * vegas->baseRTT / rtt;
++ target_cwnd = (u64)tp->snd_cwnd * vegas->baseRTT;
++ do_div(target_cwnd, rtt);
+
+ /* Calculate the difference between the window we had,
+ * and the window we would like to have. This quantity
+diff --git a/net/ipv4/tcp_veno.c b/net/ipv4/tcp_veno.c
+index 27b9825753d1..8276977d2c85 100644
+--- a/net/ipv4/tcp_veno.c
++++ b/net/ipv4/tcp_veno.c
+@@ -144,7 +144,7 @@ static void tcp_veno_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+
+ rtt = veno->minrtt;
+
+- target_cwnd = (tp->snd_cwnd * veno->basertt);
++ target_cwnd = (u64)tp->snd_cwnd * veno->basertt;
+ target_cwnd <<= V_PARAM_SHIFT;
+ do_div(target_cwnd, rtt);
+
+diff --git a/net/sctp/output.c b/net/sctp/output.c
+index 01ab8e0723f0..407ae2bf97b0 100644
+--- a/net/sctp/output.c
++++ b/net/sctp/output.c
+@@ -599,7 +599,7 @@ out:
+ return err;
+ no_route:
+ kfree_skb(nskb);
+- IP_INC_STATS_BH(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
++ IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
+
+ /* FIXME: Returning the 'err' will effect all the associations
+ * associated with a socket, although only one of the paths of the
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
2014-08-08 19:48 Mike Pagano
@ 2014-08-19 11:44 ` Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-08-19 11:44 UTC (permalink / raw
To: gentoo-commits
commit: 9df8c18cd85acf5655794c6de5da3a0690675965
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Fri Aug 8 19:48:09 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Fri Aug 8 19:48:09 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=9df8c18c
BFQ patch for 3.16
---
0000_README | 11 +
...-cgroups-kconfig-build-bits-for-v7r5-3.16.patch | 104 +
...ck-introduce-the-v7r5-I-O-sched-for-3.16.patch1 | 6635 ++++++++++++++++++++
...add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch | 1188 ++++
4 files changed, 7938 insertions(+)
diff --git a/0000_README b/0000_README
index da7da0d..a6ec2e6 100644
--- a/0000_README
+++ b/0000_README
@@ -75,3 +75,14 @@ Patch: 5000_enable-additional-cpu-optimizations-for-gcc.patch
From: https://github.com/graysky2/kernel_gcc_patch/
Desc: Kernel patch enables gcc optimizations for additional CPUs.
+Patch: 5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
+From: http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc: BFQ v7r5 patch 1 for 3.16: Build, cgroups and kconfig bits
+
+Patch: 5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
+From: http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc: BFQ v7r5 patch 2 for 3.16: BFQ Scheduler
+
+Patch: 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
+From: http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc: BFQ v7r5 patch 3 for 3.16: Early Queue Merge (EQM)
diff --git a/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
new file mode 100644
index 0000000..088bd05
--- /dev/null
+++ b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
@@ -0,0 +1,104 @@
+From 6519e5beef1063a86d3fc917cff2592cb599e824 Mon Sep 17 00:00:00 2001
+From: Paolo Valente <paolo.valente@unimore.it>
+Date: Thu, 22 May 2014 11:59:35 +0200
+Subject: [PATCH 1/3] block: cgroups, kconfig, build bits for BFQ-v7r5-3.16
+
+Update Kconfig.iosched and do the related Makefile changes to include
+kernel configuration options for BFQ. Also add the bfqio controller
+to the cgroups subsystem.
+
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+---
+ block/Kconfig.iosched | 32 ++++++++++++++++++++++++++++++++
+ block/Makefile | 1 +
+ include/linux/cgroup_subsys.h | 4 ++++
+ 3 files changed, 37 insertions(+)
+
+diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
+index 421bef9..0ee5f0f 100644
+--- a/block/Kconfig.iosched
++++ b/block/Kconfig.iosched
+@@ -39,6 +39,27 @@ config CFQ_GROUP_IOSCHED
+ ---help---
+ Enable group IO scheduling in CFQ.
+
++config IOSCHED_BFQ
++ tristate "BFQ I/O scheduler"
++ default n
++ ---help---
++ The BFQ I/O scheduler tries to distribute bandwidth among
++ all processes according to their weights.
++ It aims at distributing the bandwidth as desired, independently of
++ the disk parameters and with any workload. It also tries to
++ guarantee low latency to interactive and soft real-time
++ applications. If compiled built-in (saying Y here), BFQ can
++ be configured to support hierarchical scheduling.
++
++config CGROUP_BFQIO
++ bool "BFQ hierarchical scheduling support"
++ depends on CGROUPS && IOSCHED_BFQ=y
++ default n
++ ---help---
++ Enable hierarchical scheduling in BFQ, using the cgroups
++ filesystem interface. The name of the subsystem will be
++ bfqio.
++
+ choice
+ prompt "Default I/O scheduler"
+ default DEFAULT_CFQ
+@@ -52,6 +73,16 @@ choice
+ config DEFAULT_CFQ
+ bool "CFQ" if IOSCHED_CFQ=y
+
++ config DEFAULT_BFQ
++ bool "BFQ" if IOSCHED_BFQ=y
++ help
++ Selects BFQ as the default I/O scheduler which will be
++ used by default for all block devices.
++ The BFQ I/O scheduler aims at distributing the bandwidth
++ as desired, independently of the disk parameters and with
++ any workload. It also tries to guarantee low latency to
++ interactive and soft real-time applications.
++
+ config DEFAULT_NOOP
+ bool "No-op"
+
+@@ -61,6 +92,7 @@ config DEFAULT_IOSCHED
+ string
+ default "deadline" if DEFAULT_DEADLINE
+ default "cfq" if DEFAULT_CFQ
++ default "bfq" if DEFAULT_BFQ
+ default "noop" if DEFAULT_NOOP
+
+ endmenu
+diff --git a/block/Makefile b/block/Makefile
+index a2ce6ac..a0fc06a 100644
+--- a/block/Makefile
++++ b/block/Makefile
+@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
+ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
+ obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
+ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
++obj-$(CONFIG_IOSCHED_BFQ) += bfq-iosched.o
+
+ obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+ obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
+diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
+index 98c4f9b..13b010d 100644
+--- a/include/linux/cgroup_subsys.h
++++ b/include/linux/cgroup_subsys.h
+@@ -35,6 +35,10 @@ SUBSYS(net_cls)
+ SUBSYS(blkio)
+ #endif
+
++#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
++SUBSYS(bfqio)
++#endif
++
+ #if IS_ENABLED(CONFIG_CGROUP_PERF)
+ SUBSYS(perf_event)
+ #endif
+--
+2.0.3
+
diff --git a/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1 b/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
new file mode 100644
index 0000000..6f630ba
--- /dev/null
+++ b/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
@@ -0,0 +1,6635 @@
+From c56e6c5db41f7137d3e0b38063ef0c944eec1898 Mon Sep 17 00:00:00 2001
+From: Paolo Valente <paolo.valente@unimore.it>
+Date: Thu, 9 May 2013 19:10:02 +0200
+Subject: [PATCH 2/3] block: introduce the BFQ-v7r5 I/O sched for 3.16
+
+Add the BFQ-v7r5 I/O scheduler to 3.16.
+The general structure is borrowed from CFQ, as much of the code for
+handling I/O contexts. Over time, several useful features have been
+ported from CFQ as well (details in the changelog in README.BFQ). A
+(bfq_)queue is associated to each task doing I/O on a device, and each
+time a scheduling decision has to be made a queue is selected and served
+until it expires.
+
+ - Slices are given in the service domain: tasks are assigned
+ budgets, measured in number of sectors. Once got the disk, a task
+ must however consume its assigned budget within a configurable
+ maximum time (by default, the maximum possible value of the
+ budgets is automatically computed to comply with this timeout).
+ This allows the desired latency vs "throughput boosting" tradeoff
+ to be set.
+
+ - Budgets are scheduled according to a variant of WF2Q+, implemented
+ using an augmented rb-tree to take eligibility into account while
+ preserving an O(log N) overall complexity.
+
+ - A low-latency tunable is provided; if enabled, both interactive
+ and soft real-time applications are guaranteed a very low latency.
+
+ - Latency guarantees are preserved also in the presence of NCQ.
+
+ - Also with flash-based devices, a high throughput is achieved
+ while still preserving latency guarantees.
+
+ - BFQ features Early Queue Merge (EQM), a sort of fusion of the
+ cooperating-queue-merging and the preemption mechanisms present
+ in CFQ. EQM is in fact a unified mechanism that tries to get a
+ sequential read pattern, and hence a high throughput, with any
+ set of processes performing interleaved I/O over a contiguous
+ sequence of sectors.
+
+ - BFQ supports full hierarchical scheduling, exporting a cgroups
+ interface. Since each node has a full scheduler, each group can
+ be assigned its own weight.
+
+ - If the cgroups interface is not used, only I/O priorities can be
+ assigned to processes, with ioprio values mapped to weights
+ with the relation weight = IOPRIO_BE_NR - ioprio.
+
+ - ioprio classes are served in strict priority order, i.e., lower
+ priority queues are not served as long as there are higher
+ priority queues. Among queues in the same class the bandwidth is
+ distributed in proportion to the weight of each queue. A very
+ thin extra bandwidth is however guaranteed to the Idle class, to
+ prevent it from starving.
+
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+---
+ block/bfq-cgroup.c | 930 +++++++++++++
+ block/bfq-ioc.c | 36 +
+ block/bfq-iosched.c | 3617 +++++++++++++++++++++++++++++++++++++++++++++++++++
+ block/bfq-sched.c | 1207 +++++++++++++++++
+ block/bfq.h | 742 +++++++++++
+ 5 files changed, 6532 insertions(+)
+ create mode 100644 block/bfq-cgroup.c
+ create mode 100644 block/bfq-ioc.c
+ create mode 100644 block/bfq-iosched.c
+ create mode 100644 block/bfq-sched.c
+ create mode 100644 block/bfq.h
+
+diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
+new file mode 100644
+index 0000000..f742806
+--- /dev/null
++++ b/block/bfq-cgroup.c
+@@ -0,0 +1,930 @@
++/*
++ * BFQ: CGROUPS support.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
++ * file.
++ */
++
++#ifdef CONFIG_CGROUP_BFQIO
++
++static DEFINE_MUTEX(bfqio_mutex);
++
++static bool bfqio_is_removed(struct bfqio_cgroup *bgrp)
++{
++ return bgrp ? !bgrp->online : false;
++}
++
++static struct bfqio_cgroup bfqio_root_cgroup = {
++ .weight = BFQ_DEFAULT_GRP_WEIGHT,
++ .ioprio = BFQ_DEFAULT_GRP_IOPRIO,
++ .ioprio_class = BFQ_DEFAULT_GRP_CLASS,
++};
++
++static inline void bfq_init_entity(struct bfq_entity *entity,
++ struct bfq_group *bfqg)
++{
++ entity->weight = entity->new_weight;
++ entity->orig_weight = entity->new_weight;
++ entity->ioprio = entity->new_ioprio;
++ entity->ioprio_class = entity->new_ioprio_class;
++ entity->parent = bfqg->my_entity;
++ entity->sched_data = &bfqg->sched_data;
++}
++
++static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css)
++{
++ return css ? container_of(css, struct bfqio_cgroup, css) : NULL;
++}
++
++/*
++ * Search the bfq_group for bfqd into the hash table (by now only a list)
++ * of bgrp. Must be called under rcu_read_lock().
++ */
++static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp,
++ struct bfq_data *bfqd)
++{
++ struct bfq_group *bfqg;
++ void *key;
++
++ hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) {
++ key = rcu_dereference(bfqg->bfqd);
++ if (key == bfqd)
++ return bfqg;
++ }
++
++ return NULL;
++}
++
++static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
++ struct bfq_group *bfqg)
++{
++ struct bfq_entity *entity = &bfqg->entity;
++
++ /*
++ * If the weight of the entity has never been set via the sysfs
++ * interface, then bgrp->weight == 0. In this case we initialize
++ * the weight from the current ioprio value. Otherwise, the group
++ * weight, if set, has priority over the ioprio value.
++ */
++ if (bgrp->weight == 0) {
++ entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio);
++ entity->new_ioprio = bgrp->ioprio;
++ } else {
++ entity->new_weight = bgrp->weight;
++ entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight);
++ }
++ entity->orig_weight = entity->weight = entity->new_weight;
++ entity->ioprio = entity->new_ioprio;
++ entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
++ entity->my_sched_data = &bfqg->sched_data;
++ bfqg->active_entities = 0;
++}
++
++static inline void bfq_group_set_parent(struct bfq_group *bfqg,
++ struct bfq_group *parent)
++{
++ struct bfq_entity *entity;
++
++ BUG_ON(parent == NULL);
++ BUG_ON(bfqg == NULL);
++
++ entity = &bfqg->entity;
++ entity->parent = parent->my_entity;
++ entity->sched_data = &parent->sched_data;
++}
++
++/**
++ * bfq_group_chain_alloc - allocate a chain of groups.
++ * @bfqd: queue descriptor.
++ * @css: the leaf cgroup_subsys_state this chain starts from.
++ *
++ * Allocate a chain of groups starting from the one belonging to
++ * @cgroup up to the root cgroup. Stop if a cgroup on the chain
++ * to the root has already an allocated group on @bfqd.
++ */
++static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd,
++ struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp;
++ struct bfq_group *bfqg, *prev = NULL, *leaf = NULL;
++
++ for (; css != NULL; css = css->parent) {
++ bgrp = css_to_bfqio(css);
++
++ bfqg = bfqio_lookup_group(bgrp, bfqd);
++ if (bfqg != NULL) {
++ /*
++ * All the cgroups in the path from there to the
++ * root must have a bfq_group for bfqd, so we don't
++ * need any more allocations.
++ */
++ break;
++ }
++
++ bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC);
++ if (bfqg == NULL)
++ goto cleanup;
++
++ bfq_group_init_entity(bgrp, bfqg);
++ bfqg->my_entity = &bfqg->entity;
++
++ if (leaf == NULL) {
++ leaf = bfqg;
++ prev = leaf;
++ } else {
++ bfq_group_set_parent(prev, bfqg);
++ /*
++ * Build a list of allocated nodes using the bfqd
++ * filed, that is still unused and will be
++ * initialized only after the node will be
++ * connected.
++ */
++ prev->bfqd = bfqg;
++ prev = bfqg;
++ }
++ }
++
++ return leaf;
++
++cleanup:
++ while (leaf != NULL) {
++ prev = leaf;
++ leaf = leaf->bfqd;
++ kfree(prev);
++ }
++
++ return NULL;
++}
++
++/**
++ * bfq_group_chain_link - link an allocated group chain to a cgroup
++ * hierarchy.
++ * @bfqd: the queue descriptor.
++ * @css: the leaf cgroup_subsys_state to start from.
++ * @leaf: the leaf group (to be associated to @cgroup).
++ *
++ * Try to link a chain of groups to a cgroup hierarchy, connecting the
++ * nodes bottom-up, so we can be sure that when we find a cgroup in the
++ * hierarchy that already as a group associated to @bfqd all the nodes
++ * in the path to the root cgroup have one too.
++ *
++ * On locking: the queue lock protects the hierarchy (there is a hierarchy
++ * per device) while the bfqio_cgroup lock protects the list of groups
++ * belonging to the same cgroup.
++ */
++static void bfq_group_chain_link(struct bfq_data *bfqd,
++ struct cgroup_subsys_state *css,
++ struct bfq_group *leaf)
++{
++ struct bfqio_cgroup *bgrp;
++ struct bfq_group *bfqg, *next, *prev = NULL;
++ unsigned long flags;
++
++ assert_spin_locked(bfqd->queue->queue_lock);
++
++ for (; css != NULL && leaf != NULL; css = css->parent) {
++ bgrp = css_to_bfqio(css);
++ next = leaf->bfqd;
++
++ bfqg = bfqio_lookup_group(bgrp, bfqd);
++ BUG_ON(bfqg != NULL);
++
++ spin_lock_irqsave(&bgrp->lock, flags);
++
++ rcu_assign_pointer(leaf->bfqd, bfqd);
++ hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
++ hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
++
++ spin_unlock_irqrestore(&bgrp->lock, flags);
++
++ prev = leaf;
++ leaf = next;
++ }
++
++ BUG_ON(css == NULL && leaf != NULL);
++ if (css != NULL && prev != NULL) {
++ bgrp = css_to_bfqio(css);
++ bfqg = bfqio_lookup_group(bgrp, bfqd);
++ bfq_group_set_parent(prev, bfqg);
++ }
++}
++
++/**
++ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
++ * @bfqd: queue descriptor.
++ * @cgroup: cgroup being searched for.
++ *
++ * Return a group associated to @bfqd in @cgroup, allocating one if
++ * necessary. When a group is returned all the cgroups in the path
++ * to the root have a group associated to @bfqd.
++ *
++ * If the allocation fails, return the root group: this breaks guarantees
++ * but is a safe fallback. If this loss becomes a problem it can be
++ * mitigated using the equivalent weight (given by the product of the
++ * weights of the groups in the path from @group to the root) in the
++ * root scheduler.
++ *
++ * We allocate all the missing nodes in the path from the leaf cgroup
++ * to the root and we connect the nodes only after all the allocations
++ * have been successful.
++ */
++static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
++ struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++ struct bfq_group *bfqg;
++
++ bfqg = bfqio_lookup_group(bgrp, bfqd);
++ if (bfqg != NULL)
++ return bfqg;
++
++ bfqg = bfq_group_chain_alloc(bfqd, css);
++ if (bfqg != NULL)
++ bfq_group_chain_link(bfqd, css, bfqg);
++ else
++ bfqg = bfqd->root_group;
++
++ return bfqg;
++}
++
++/**
++ * bfq_bfqq_move - migrate @bfqq to @bfqg.
++ * @bfqd: queue descriptor.
++ * @bfqq: the queue to move.
++ * @entity: @bfqq's entity.
++ * @bfqg: the group to move to.
++ *
++ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
++ * it on the new one. Avoid putting the entity on the old group idle tree.
++ *
++ * Must be called under the queue lock; the cgroup owning @bfqg must
++ * not disappear (by now this just means that we are called under
++ * rcu_read_lock()).
++ */
++static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ struct bfq_entity *entity, struct bfq_group *bfqg)
++{
++ int busy, resume;
++
++ busy = bfq_bfqq_busy(bfqq);
++ resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
++
++ BUG_ON(resume && !entity->on_st);
++ BUG_ON(busy && !resume && entity->on_st &&
++ bfqq != bfqd->in_service_queue);
++
++ if (busy) {
++ BUG_ON(atomic_read(&bfqq->ref) < 2);
++
++ if (!resume)
++ bfq_del_bfqq_busy(bfqd, bfqq, 0);
++ else
++ bfq_deactivate_bfqq(bfqd, bfqq, 0);
++ } else if (entity->on_st)
++ bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
++
++ /*
++ * Here we use a reference to bfqg. We don't need a refcounter
++ * as the cgroup reference will not be dropped, so that its
++ * destroy() callback will not be invoked.
++ */
++ entity->parent = bfqg->my_entity;
++ entity->sched_data = &bfqg->sched_data;
++
++ if (busy && resume)
++ bfq_activate_bfqq(bfqd, bfqq);
++
++ if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver)
++ bfq_schedule_dispatch(bfqd);
++}
++
++/**
++ * __bfq_bic_change_cgroup - move @bic to @cgroup.
++ * @bfqd: the queue descriptor.
++ * @bic: the bic to move.
++ * @cgroup: the cgroup to move to.
++ *
++ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller
++ * has to make sure that the reference to cgroup is valid across the call.
++ *
++ * NOTE: an alternative approach might have been to store the current
++ * cgroup in bfqq and getting a reference to it, reducing the lookup
++ * time here, at the price of slightly more complex code.
++ */
++static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
++ struct bfq_io_cq *bic,
++ struct cgroup_subsys_state *css)
++{
++ struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
++ struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
++ struct bfq_entity *entity;
++ struct bfq_group *bfqg;
++ struct bfqio_cgroup *bgrp;
++
++ bgrp = css_to_bfqio(css);
++
++ bfqg = bfq_find_alloc_group(bfqd, css);
++ if (async_bfqq != NULL) {
++ entity = &async_bfqq->entity;
++
++ if (entity->sched_data != &bfqg->sched_data) {
++ bic_set_bfqq(bic, NULL, 0);
++ bfq_log_bfqq(bfqd, async_bfqq,
++ "bic_change_group: %p %d",
++ async_bfqq, atomic_read(&async_bfqq->ref));
++ bfq_put_queue(async_bfqq);
++ }
++ }
++
++ if (sync_bfqq != NULL) {
++ entity = &sync_bfqq->entity;
++ if (entity->sched_data != &bfqg->sched_data)
++ bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
++ }
++
++ return bfqg;
++}
++
++/**
++ * bfq_bic_change_cgroup - move @bic to @cgroup.
++ * @bic: the bic being migrated.
++ * @cgroup: the destination cgroup.
++ *
++ * When the task owning @bic is moved to @cgroup, @bic is immediately
++ * moved into its new parent group.
++ */
++static void bfq_bic_change_cgroup(struct bfq_io_cq *bic,
++ struct cgroup_subsys_state *css)
++{
++ struct bfq_data *bfqd;
++ unsigned long uninitialized_var(flags);
++
++ bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
++ &flags);
++ if (bfqd != NULL) {
++ __bfq_bic_change_cgroup(bfqd, bic, css);
++ bfq_put_bfqd_unlock(bfqd, &flags);
++ }
++}
++
++/**
++ * bfq_bic_update_cgroup - update the cgroup of @bic.
++ * @bic: the @bic to update.
++ *
++ * Make sure that @bic is enqueued in the cgroup of the current task.
++ * We need this in addition to moving bics during the cgroup attach
++ * phase because the task owning @bic could be at its first disk
++ * access or we may end up in the root cgroup as the result of a
++ * memory allocation failure and here we try to move to the right
++ * group.
++ *
++ * Must be called under the queue lock. It is safe to use the returned
++ * value even after the rcu_read_unlock() as the migration/destruction
++ * paths act under the queue lock too. IOW it is impossible to race with
++ * group migration/destruction and end up with an invalid group as:
++ * a) here cgroup has not yet been destroyed, nor its destroy callback
++ * has started execution, as current holds a reference to it,
++ * b) if it is destroyed after rcu_read_unlock() [after current is
++ * migrated to a different cgroup] its attach() callback will have
++ * taken care of remove all the references to the old cgroup data.
++ */
++static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic)
++{
++ struct bfq_data *bfqd = bic_to_bfqd(bic);
++ struct bfq_group *bfqg;
++ struct cgroup_subsys_state *css;
++
++ BUG_ON(bfqd == NULL);
++
++ rcu_read_lock();
++ css = task_css(current, bfqio_cgrp_id);
++ bfqg = __bfq_bic_change_cgroup(bfqd, bic, css);
++ rcu_read_unlock();
++
++ return bfqg;
++}
++
++/**
++ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
++ * @st: the service tree being flushed.
++ */
++static inline void bfq_flush_idle_tree(struct bfq_service_tree *st)
++{
++ struct bfq_entity *entity = st->first_idle;
++
++ for (; entity != NULL; entity = st->first_idle)
++ __bfq_deactivate_entity(entity, 0);
++}
++
++/**
++ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
++ * @bfqd: the device data structure with the root group.
++ * @entity: the entity to move.
++ */
++static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++ BUG_ON(bfqq == NULL);
++ bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
++ return;
++}
++
++/**
++ * bfq_reparent_active_entities - move to the root group all active
++ * entities.
++ * @bfqd: the device data structure with the root group.
++ * @bfqg: the group to move from.
++ * @st: the service tree with the entities.
++ *
++ * Needs queue_lock to be taken and reference to be valid over the call.
++ */
++static inline void bfq_reparent_active_entities(struct bfq_data *bfqd,
++ struct bfq_group *bfqg,
++ struct bfq_service_tree *st)
++{
++ struct rb_root *active = &st->active;
++ struct bfq_entity *entity = NULL;
++
++ if (!RB_EMPTY_ROOT(&st->active))
++ entity = bfq_entity_of(rb_first(active));
++
++ for (; entity != NULL; entity = bfq_entity_of(rb_first(active)))
++ bfq_reparent_leaf_entity(bfqd, entity);
++
++ if (bfqg->sched_data.in_service_entity != NULL)
++ bfq_reparent_leaf_entity(bfqd,
++ bfqg->sched_data.in_service_entity);
++
++ return;
++}
++
++/**
++ * bfq_destroy_group - destroy @bfqg.
++ * @bgrp: the bfqio_cgroup containing @bfqg.
++ * @bfqg: the group being destroyed.
++ *
++ * Destroy @bfqg, making sure that it is not referenced from its parent.
++ */
++static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
++{
++ struct bfq_data *bfqd;
++ struct bfq_service_tree *st;
++ struct bfq_entity *entity = bfqg->my_entity;
++ unsigned long uninitialized_var(flags);
++ int i;
++
++ hlist_del(&bfqg->group_node);
++
++ /*
++ * Empty all service_trees belonging to this group before
++ * deactivating the group itself.
++ */
++ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
++ st = bfqg->sched_data.service_tree + i;
++
++ /*
++ * The idle tree may still contain bfq_queues belonging
++ * to exited task because they never migrated to a different
++ * cgroup from the one being destroyed now. No one else
++ * can access them so it's safe to act without any lock.
++ */
++ bfq_flush_idle_tree(st);
++
++ /*
++ * It may happen that some queues are still active
++ * (busy) upon group destruction (if the corresponding
++ * processes have been forced to terminate). We move
++ * all the leaf entities corresponding to these queues
++ * to the root_group.
++ * Also, it may happen that the group has an entity
++ * in service, which is disconnected from the active
++ * tree: it must be moved, too.
++ * There is no need to put the sync queues, as the
++ * scheduler has taken no reference.
++ */
++ bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
++ if (bfqd != NULL) {
++ bfq_reparent_active_entities(bfqd, bfqg, st);
++ bfq_put_bfqd_unlock(bfqd, &flags);
++ }
++ BUG_ON(!RB_EMPTY_ROOT(&st->active));
++ BUG_ON(!RB_EMPTY_ROOT(&st->idle));
++ }
++ BUG_ON(bfqg->sched_data.next_in_service != NULL);
++ BUG_ON(bfqg->sched_data.in_service_entity != NULL);
++
++ /*
++ * We may race with device destruction, take extra care when
++ * dereferencing bfqg->bfqd.
++ */
++ bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
++ if (bfqd != NULL) {
++ hlist_del(&bfqg->bfqd_node);
++ __bfq_deactivate_entity(entity, 0);
++ bfq_put_async_queues(bfqd, bfqg);
++ bfq_put_bfqd_unlock(bfqd, &flags);
++ }
++ BUG_ON(entity->tree != NULL);
++
++ /*
++ * No need to defer the kfree() to the end of the RCU grace
++ * period: we are called from the destroy() callback of our
++ * cgroup, so we can be sure that no one is a) still using
++ * this cgroup or b) doing lookups in it.
++ */
++ kfree(bfqg);
++}
++
++static void bfq_end_wr_async(struct bfq_data *bfqd)
++{
++ struct hlist_node *tmp;
++ struct bfq_group *bfqg;
++
++ hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node)
++ bfq_end_wr_async_queues(bfqd, bfqg);
++ bfq_end_wr_async_queues(bfqd, bfqd->root_group);
++}
++
++/**
++ * bfq_disconnect_groups - disconnect @bfqd from all its groups.
++ * @bfqd: the device descriptor being exited.
++ *
++ * When the device exits we just make sure that no lookup can return
++ * the now unused group structures. They will be deallocated on cgroup
++ * destruction.
++ */
++static void bfq_disconnect_groups(struct bfq_data *bfqd)
++{
++ struct hlist_node *tmp;
++ struct bfq_group *bfqg;
++
++ bfq_log(bfqd, "disconnect_groups beginning");
++ hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) {
++ hlist_del(&bfqg->bfqd_node);
++
++ __bfq_deactivate_entity(bfqg->my_entity, 0);
++
++ /*
++ * Don't remove from the group hash, just set an
++ * invalid key. No lookups can race with the
++ * assignment as bfqd is being destroyed; this
++ * implies also that new elements cannot be added
++ * to the list.
++ */
++ rcu_assign_pointer(bfqg->bfqd, NULL);
++
++ bfq_log(bfqd, "disconnect_groups: put async for group %p",
++ bfqg);
++ bfq_put_async_queues(bfqd, bfqg);
++ }
++}
++
++static inline void bfq_free_root_group(struct bfq_data *bfqd)
++{
++ struct bfqio_cgroup *bgrp = &bfqio_root_cgroup;
++ struct bfq_group *bfqg = bfqd->root_group;
++
++ bfq_put_async_queues(bfqd, bfqg);
++
++ spin_lock_irq(&bgrp->lock);
++ hlist_del_rcu(&bfqg->group_node);
++ spin_unlock_irq(&bgrp->lock);
++
++ /*
++ * No need to synchronize_rcu() here: since the device is gone
++ * there cannot be any read-side access to its root_group.
++ */
++ kfree(bfqg);
++}
++
++static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
++{
++ struct bfq_group *bfqg;
++ struct bfqio_cgroup *bgrp;
++ int i;
++
++ bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node);
++ if (bfqg == NULL)
++ return NULL;
++
++ bfqg->entity.parent = NULL;
++ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
++ bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
++
++ bgrp = &bfqio_root_cgroup;
++ spin_lock_irq(&bgrp->lock);
++ rcu_assign_pointer(bfqg->bfqd, bfqd);
++ hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data);
++ spin_unlock_irq(&bgrp->lock);
++
++ return bfqg;
++}
++
++#define SHOW_FUNCTION(__VAR) \
++static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \
++ struct cftype *cftype) \
++{ \
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css); \
++ u64 ret = -ENODEV; \
++ \
++ mutex_lock(&bfqio_mutex); \
++ if (bfqio_is_removed(bgrp)) \
++ goto out_unlock; \
++ \
++ spin_lock_irq(&bgrp->lock); \
++ ret = bgrp->__VAR; \
++ spin_unlock_irq(&bgrp->lock); \
++ \
++out_unlock: \
++ mutex_unlock(&bfqio_mutex); \
++ return ret; \
++}
++
++SHOW_FUNCTION(weight);
++SHOW_FUNCTION(ioprio);
++SHOW_FUNCTION(ioprio_class);
++#undef SHOW_FUNCTION
++
++#define STORE_FUNCTION(__VAR, __MIN, __MAX) \
++static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\
++ struct cftype *cftype, \
++ u64 val) \
++{ \
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css); \
++ struct bfq_group *bfqg; \
++ int ret = -EINVAL; \
++ \
++ if (val < (__MIN) || val > (__MAX)) \
++ return ret; \
++ \
++ ret = -ENODEV; \
++ mutex_lock(&bfqio_mutex); \
++ if (bfqio_is_removed(bgrp)) \
++ goto out_unlock; \
++ ret = 0; \
++ \
++ spin_lock_irq(&bgrp->lock); \
++ bgrp->__VAR = (unsigned short)val; \
++ hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) { \
++ /* \
++ * Setting the ioprio_changed flag of the entity \
++ * to 1 with new_##__VAR == ##__VAR would re-set \
++ * the value of the weight to its ioprio mapping. \
++ * Set the flag only if necessary. \
++ */ \
++ if ((unsigned short)val != bfqg->entity.new_##__VAR) { \
++ bfqg->entity.new_##__VAR = (unsigned short)val; \
++ /* \
++ * Make sure that the above new value has been \
++ * stored in bfqg->entity.new_##__VAR before \
++ * setting the ioprio_changed flag. In fact, \
++ * this flag may be read asynchronously (in \
++ * critical sections protected by a different \
++ * lock than that held here), and finding this \
++ * flag set may cause the execution of the code \
++ * for updating parameters whose value may \
++ * depend also on bfqg->entity.new_##__VAR (in \
++ * __bfq_entity_update_weight_prio). \
++ * This barrier makes sure that the new value \
++ * of bfqg->entity.new_##__VAR is correctly \
++ * seen in that code. \
++ */ \
++ smp_wmb(); \
++ bfqg->entity.ioprio_changed = 1; \
++ } \
++ } \
++ spin_unlock_irq(&bgrp->lock); \
++ \
++out_unlock: \
++ mutex_unlock(&bfqio_mutex); \
++ return ret; \
++}
++
++STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT);
++STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
++STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
++#undef STORE_FUNCTION
++
++static struct cftype bfqio_files[] = {
++ {
++ .name = "weight",
++ .read_u64 = bfqio_cgroup_weight_read,
++ .write_u64 = bfqio_cgroup_weight_write,
++ },
++ {
++ .name = "ioprio",
++ .read_u64 = bfqio_cgroup_ioprio_read,
++ .write_u64 = bfqio_cgroup_ioprio_write,
++ },
++ {
++ .name = "ioprio_class",
++ .read_u64 = bfqio_cgroup_ioprio_class_read,
++ .write_u64 = bfqio_cgroup_ioprio_class_write,
++ },
++ { }, /* terminate */
++};
++
++static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state
++ *parent_css)
++{
++ struct bfqio_cgroup *bgrp;
++
++ if (parent_css != NULL) {
++ bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL);
++ if (bgrp == NULL)
++ return ERR_PTR(-ENOMEM);
++ } else
++ bgrp = &bfqio_root_cgroup;
++
++ spin_lock_init(&bgrp->lock);
++ INIT_HLIST_HEAD(&bgrp->group_data);
++ bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO;
++ bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS;
++
++ return &bgrp->css;
++}
++
++/*
++ * We cannot support shared io contexts, as we have no means to support
++ * two tasks with the same ioc in two different groups without major rework
++ * of the main bic/bfqq data structures. By now we allow a task to change
++ * its cgroup only if it's the only owner of its ioc; the drawback of this
++ * behavior is that a group containing a task that forked using CLONE_IO
++ * will not be destroyed until the tasks sharing the ioc die.
++ */
++static int bfqio_can_attach(struct cgroup_subsys_state *css,
++ struct cgroup_taskset *tset)
++{
++ struct task_struct *task;
++ struct io_context *ioc;
++ int ret = 0;
++
++ cgroup_taskset_for_each(task, tset) {
++ /*
++ * task_lock() is needed to avoid races with
++ * exit_io_context()
++ */
++ task_lock(task);
++ ioc = task->io_context;
++ if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
++ /*
++ * ioc == NULL means that the task is either too
++ * young or exiting: if it has still no ioc the
++ * ioc can't be shared, if the task is exiting the
++ * attach will fail anyway, no matter what we
++ * return here.
++ */
++ ret = -EINVAL;
++ task_unlock(task);
++ if (ret)
++ break;
++ }
++
++ return ret;
++}
++
++static void bfqio_attach(struct cgroup_subsys_state *css,
++ struct cgroup_taskset *tset)
++{
++ struct task_struct *task;
++ struct io_context *ioc;
++ struct io_cq *icq;
++
++ /*
++ * IMPORTANT NOTE: The move of more than one process at a time to a
++ * new group has not yet been tested.
++ */
++ cgroup_taskset_for_each(task, tset) {
++ ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
++ if (ioc) {
++ /*
++ * Handle cgroup change here.
++ */
++ rcu_read_lock();
++ hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node)
++ if (!strncmp(
++ icq->q->elevator->type->elevator_name,
++ "bfq", ELV_NAME_MAX))
++ bfq_bic_change_cgroup(icq_to_bic(icq),
++ css);
++ rcu_read_unlock();
++ put_io_context(ioc);
++ }
++ }
++}
++
++static void bfqio_destroy(struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++ struct hlist_node *tmp;
++ struct bfq_group *bfqg;
++
++ /*
++ * Since we are destroying the cgroup, there are no more tasks
++ * referencing it, and all the RCU grace periods that may have
++ * referenced it are ended (as the destruction of the parent
++ * cgroup is RCU-safe); bgrp->group_data will not be accessed by
++ * anything else and we don't need any synchronization.
++ */
++ hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node)
++ bfq_destroy_group(bgrp, bfqg);
++
++ BUG_ON(!hlist_empty(&bgrp->group_data));
++
++ kfree(bgrp);
++}
++
++static int bfqio_css_online(struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++
++ mutex_lock(&bfqio_mutex);
++ bgrp->online = true;
++ mutex_unlock(&bfqio_mutex);
++
++ return 0;
++}
++
++static void bfqio_css_offline(struct cgroup_subsys_state *css)
++{
++ struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++
++ mutex_lock(&bfqio_mutex);
++ bgrp->online = false;
++ mutex_unlock(&bfqio_mutex);
++}
++
++struct cgroup_subsys bfqio_cgrp_subsys = {
++ .css_alloc = bfqio_create,
++ .css_online = bfqio_css_online,
++ .css_offline = bfqio_css_offline,
++ .can_attach = bfqio_can_attach,
++ .attach = bfqio_attach,
++ .css_free = bfqio_destroy,
++ .base_cftypes = bfqio_files,
++};
++#else
++static inline void bfq_init_entity(struct bfq_entity *entity,
++ struct bfq_group *bfqg)
++{
++ entity->weight = entity->new_weight;
++ entity->orig_weight = entity->new_weight;
++ entity->ioprio = entity->new_ioprio;
++ entity->ioprio_class = entity->new_ioprio_class;
++ entity->sched_data = &bfqg->sched_data;
++}
++
++static inline struct bfq_group *
++bfq_bic_update_cgroup(struct bfq_io_cq *bic)
++{
++ struct bfq_data *bfqd = bic_to_bfqd(bic);
++ return bfqd->root_group;
++}
++
++static inline void bfq_bfqq_move(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ struct bfq_entity *entity,
++ struct bfq_group *bfqg)
++{
++}
++
++static void bfq_end_wr_async(struct bfq_data *bfqd)
++{
++ bfq_end_wr_async_queues(bfqd, bfqd->root_group);
++}
++
++static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
++{
++ bfq_put_async_queues(bfqd, bfqd->root_group);
++}
++
++static inline void bfq_free_root_group(struct bfq_data *bfqd)
++{
++ kfree(bfqd->root_group);
++}
++
++static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
++{
++ struct bfq_group *bfqg;
++ int i;
++
++ bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
++ if (bfqg == NULL)
++ return NULL;
++
++ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
++ bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
++
++ return bfqg;
++}
++#endif
+diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
+new file mode 100644
+index 0000000..7f6b000
+--- /dev/null
++++ b/block/bfq-ioc.c
+@@ -0,0 +1,36 @@
++/*
++ * BFQ: I/O context handling.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++/**
++ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
++ * @icq: the iocontext queue.
++ */
++static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
++{
++ /* bic->icq is the first member, %NULL will convert to %NULL */
++ return container_of(icq, struct bfq_io_cq, icq);
++}
++
++/**
++ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
++ * @bfqd: the lookup key.
++ * @ioc: the io_context of the process doing I/O.
++ *
++ * Queue lock must be held.
++ */
++static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
++ struct io_context *ioc)
++{
++ if (ioc)
++ return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
++ return NULL;
++}
+diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
+new file mode 100644
+index 0000000..0a0891b
+--- /dev/null
++++ b/block/bfq-iosched.c
+@@ -0,0 +1,3617 @@
++/*
++ * Budget Fair Queueing (BFQ) disk scheduler.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
++ * file.
++ *
++ * BFQ is a proportional-share storage-I/O scheduling algorithm based on
++ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
++ * measured in number of sectors, to processes instead of time slices. The
++ * device is not granted to the in-service process for a given time slice,
++ * but until it has exhausted its assigned budget. This change from the time
++ * to the service domain allows BFQ to distribute the device throughput
++ * among processes as desired, without any distortion due to ZBR, workload
++ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler,
++ * called B-WF2Q+, to schedule processes according to their budgets. More
++ * precisely, BFQ schedules queues associated to processes. Thanks to the
++ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
++ * I/O-bound processes issuing sequential requests (to boost the
++ * throughput), and yet guarantee a low latency to interactive and soft
++ * real-time applications.
++ *
++ * BFQ is described in [1], where also a reference to the initial, more
++ * theoretical paper on BFQ can be found. The interested reader can find
++ * in the latter paper full details on the main algorithm, as well as
++ * formulas of the guarantees and formal proofs of all the properties.
++ * With respect to the version of BFQ presented in these papers, this
++ * implementation adds a few more heuristics, such as the one that
++ * guarantees a low latency to soft real-time applications, and a
++ * hierarchical extension based on H-WF2Q+.
++ *
++ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
++ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
++ * complexity derives from the one introduced with EEVDF in [3].
++ *
++ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness
++ * with the BFQ Disk I/O Scheduler'',
++ * Proceedings of the 5th Annual International Systems and Storage
++ * Conference (SYSTOR '12), June 2012.
++ *
++ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
++ *
++ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
++ * Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
++ * Oct 1997.
++ *
++ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
++ *
++ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
++ * First: A Flexible and Accurate Mechanism for Proportional Share
++ * Resource Allocation,'' technical report.
++ *
++ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
++ */
++#include <linux/module.h>
++#include <linux/slab.h>
++#include <linux/blkdev.h>
++#include <linux/cgroup.h>
++#include <linux/elevator.h>
++#include <linux/jiffies.h>
++#include <linux/rbtree.h>
++#include <linux/ioprio.h>
++#include "bfq.h"
++#include "blk.h"
++
++/* Max number of dispatches in one round of service. */
++static const int bfq_quantum = 4;
++
++/* Expiration time of sync (0) and async (1) requests, in jiffies. */
++static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
++
++/* Maximum backwards seek, in KiB. */
++static const int bfq_back_max = 16 * 1024;
++
++/* Penalty of a backwards seek, in number of sectors. */
++static const int bfq_back_penalty = 2;
++
++/* Idling period duration, in jiffies. */
++static int bfq_slice_idle = HZ / 125;
++
++/* Default maximum budget values, in sectors and number of requests. */
++static const int bfq_default_max_budget = 16 * 1024;
++static const int bfq_max_budget_async_rq = 4;
++
++/*
++ * Async to sync throughput distribution is controlled as follows:
++ * when an async request is served, the entity is charged the number
++ * of sectors of the request, multiplied by the factor below
++ */
++static const int bfq_async_charge_factor = 10;
++
++/* Default timeout values, in jiffies, approximating CFQ defaults. */
++static const int bfq_timeout_sync = HZ / 8;
++static int bfq_timeout_async = HZ / 25;
++
++struct kmem_cache *bfq_pool;
++
++/* Below this threshold (in ms), we consider thinktime immediate. */
++#define BFQ_MIN_TT 2
++
++/* hw_tag detection: parallel requests threshold and min samples needed. */
++#define BFQ_HW_QUEUE_THRESHOLD 4
++#define BFQ_HW_QUEUE_SAMPLES 32
++
++#define BFQQ_SEEK_THR (sector_t)(8 * 1024)
++#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
++
++/* Min samples used for peak rate estimation (for autotuning). */
++#define BFQ_PEAK_RATE_SAMPLES 32
++
++/* Shift used for peak rate fixed precision calculations. */
++#define BFQ_RATE_SHIFT 16
++
++/*
++ * By default, BFQ computes the duration of the weight raising for
++ * interactive applications automatically, using the following formula:
++ * duration = (R / r) * T, where r is the peak rate of the device, and
++ * R and T are two reference parameters.
++ * In particular, R is the peak rate of the reference device (see below),
++ * and T is a reference time: given the systems that are likely to be
++ * installed on the reference device according to its speed class, T is
++ * about the maximum time needed, under BFQ and while reading two files in
++ * parallel, to load typical large applications on these systems.
++ * In practice, the slower/faster the device at hand is, the more/less it
++ * takes to load applications with respect to the reference device.
++ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
++ * applications.
++ *
++ * BFQ uses four different reference pairs (R, T), depending on:
++ * . whether the device is rotational or non-rotational;
++ * . whether the device is slow, such as old or portable HDDs, as well as
++ * SD cards, or fast, such as newer HDDs and SSDs.
++ *
++ * The device's speed class is dynamically (re)detected in
++ * bfq_update_peak_rate() every time the estimated peak rate is updated.
++ *
++ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0]
++ * are the reference values for a slow/fast rotational device, whereas
++ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for
++ * a slow/fast non-rotational device. Finally, device_speed_thresh are the
++ * thresholds used to switch between speed classes.
++ * Both the reference peak rates and the thresholds are measured in
++ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
++ */
++static int R_slow[2] = {1536, 10752};
++static int R_fast[2] = {17415, 34791};
++/*
++ * To improve readability, a conversion function is used to initialize the
++ * following arrays, which entails that they can be initialized only in a
++ * function.
++ */
++static int T_slow[2];
++static int T_fast[2];
++static int device_speed_thresh[2];
++
++#define BFQ_SERVICE_TREE_INIT ((struct bfq_service_tree) \
++ { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
++
++#define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0])
++#define RQ_BFQQ(rq) ((rq)->elv.priv[1])
++
++static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
++
++#include "bfq-ioc.c"
++#include "bfq-sched.c"
++#include "bfq-cgroup.c"
++
++#define bfq_class_idle(bfqq) ((bfqq)->entity.ioprio_class ==\
++ IOPRIO_CLASS_IDLE)
++#define bfq_class_rt(bfqq) ((bfqq)->entity.ioprio_class ==\
++ IOPRIO_CLASS_RT)
++
++#define bfq_sample_valid(samples) ((samples) > 80)
++
++/*
++ * We regard a request as SYNC, if either it's a read or has the SYNC bit
++ * set (in which case it could also be a direct WRITE).
++ */
++static inline int bfq_bio_sync(struct bio *bio)
++{
++ if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC))
++ return 1;
++
++ return 0;
++}
++
++/*
++ * Scheduler run of queue, if there are requests pending and no one in the
++ * driver that will restart queueing.
++ */
++static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
++{
++ if (bfqd->queued != 0) {
++ bfq_log(bfqd, "schedule dispatch");
++ kblockd_schedule_work(&bfqd->unplug_work);
++ }
++}
++
++/*
++ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
++ * We choose the request that is closesr to the head right now. Distance
++ * behind the head is penalized and only allowed to a certain extent.
++ */
++static struct request *bfq_choose_req(struct bfq_data *bfqd,
++ struct request *rq1,
++ struct request *rq2,
++ sector_t last)
++{
++ sector_t s1, s2, d1 = 0, d2 = 0;
++ unsigned long back_max;
++#define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */
++#define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */
++ unsigned wrap = 0; /* bit mask: requests behind the disk head? */
++
++ if (rq1 == NULL || rq1 == rq2)
++ return rq2;
++ if (rq2 == NULL)
++ return rq1;
++
++ if (rq_is_sync(rq1) && !rq_is_sync(rq2))
++ return rq1;
++ else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
++ return rq2;
++ if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
++ return rq1;
++ else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
++ return rq2;
++
++ s1 = blk_rq_pos(rq1);
++ s2 = blk_rq_pos(rq2);
++
++ /*
++ * By definition, 1KiB is 2 sectors.
++ */
++ back_max = bfqd->bfq_back_max * 2;
++
++ /*
++ * Strict one way elevator _except_ in the case where we allow
++ * short backward seeks which are biased as twice the cost of a
++ * similar forward seek.
++ */
++ if (s1 >= last)
++ d1 = s1 - last;
++ else if (s1 + back_max >= last)
++ d1 = (last - s1) * bfqd->bfq_back_penalty;
++ else
++ wrap |= BFQ_RQ1_WRAP;
++
++ if (s2 >= last)
++ d2 = s2 - last;
++ else if (s2 + back_max >= last)
++ d2 = (last - s2) * bfqd->bfq_back_penalty;
++ else
++ wrap |= BFQ_RQ2_WRAP;
++
++ /* Found required data */
++
++ /*
++ * By doing switch() on the bit mask "wrap" we avoid having to
++ * check two variables for all permutations: --> faster!
++ */
++ switch (wrap) {
++ case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
++ if (d1 < d2)
++ return rq1;
++ else if (d2 < d1)
++ return rq2;
++ else {
++ if (s1 >= s2)
++ return rq1;
++ else
++ return rq2;
++ }
++
++ case BFQ_RQ2_WRAP:
++ return rq1;
++ case BFQ_RQ1_WRAP:
++ return rq2;
++ case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
++ default:
++ /*
++ * Since both rqs are wrapped,
++ * start with the one that's further behind head
++ * (--> only *one* back seek required),
++ * since back seek takes more time than forward.
++ */
++ if (s1 <= s2)
++ return rq1;
++ else
++ return rq2;
++ }
++}
++
++static struct bfq_queue *
++bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
++ sector_t sector, struct rb_node **ret_parent,
++ struct rb_node ***rb_link)
++{
++ struct rb_node **p, *parent;
++ struct bfq_queue *bfqq = NULL;
++
++ parent = NULL;
++ p = &root->rb_node;
++ while (*p) {
++ struct rb_node **n;
++
++ parent = *p;
++ bfqq = rb_entry(parent, struct bfq_queue, pos_node);
++
++ /*
++ * Sort strictly based on sector. Smallest to the left,
++ * largest to the right.
++ */
++ if (sector > blk_rq_pos(bfqq->next_rq))
++ n = &(*p)->rb_right;
++ else if (sector < blk_rq_pos(bfqq->next_rq))
++ n = &(*p)->rb_left;
++ else
++ break;
++ p = n;
++ bfqq = NULL;
++ }
++
++ *ret_parent = parent;
++ if (rb_link)
++ *rb_link = p;
++
++ bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
++ (long long unsigned)sector,
++ bfqq != NULL ? bfqq->pid : 0);
++
++ return bfqq;
++}
++
++static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ struct rb_node **p, *parent;
++ struct bfq_queue *__bfqq;
++
++ if (bfqq->pos_root != NULL) {
++ rb_erase(&bfqq->pos_node, bfqq->pos_root);
++ bfqq->pos_root = NULL;
++ }
++
++ if (bfq_class_idle(bfqq))
++ return;
++ if (!bfqq->next_rq)
++ return;
++
++ bfqq->pos_root = &bfqd->rq_pos_tree;
++ __bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
++ blk_rq_pos(bfqq->next_rq), &parent, &p);
++ if (__bfqq == NULL) {
++ rb_link_node(&bfqq->pos_node, parent, p);
++ rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
++ } else
++ bfqq->pos_root = NULL;
++}
++
++/*
++ * Tell whether there are active queues or groups with differentiated weights.
++ */
++static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
++{
++ BUG_ON(!bfqd->hw_tag);
++ /*
++ * For weights to differ, at least one of the trees must contain
++ * at least two nodes.
++ */
++ return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
++ (bfqd->queue_weights_tree.rb_node->rb_left ||
++ bfqd->queue_weights_tree.rb_node->rb_right)
++#ifdef CONFIG_CGROUP_BFQIO
++ ) ||
++ (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
++ (bfqd->group_weights_tree.rb_node->rb_left ||
++ bfqd->group_weights_tree.rb_node->rb_right)
++#endif
++ );
++}
++
++/*
++ * If the weight-counter tree passed as input contains no counter for
++ * the weight of the input entity, then add that counter; otherwise just
++ * increment the existing counter.
++ *
++ * Note that weight-counter trees contain few nodes in mostly symmetric
++ * scenarios. For example, if all queues have the same weight, then the
++ * weight-counter tree for the queues may contain at most one node.
++ * This holds even if low_latency is on, because weight-raised queues
++ * are not inserted in the tree.
++ * In most scenarios, the rate at which nodes are created/destroyed
++ * should be low too.
++ */
++static void bfq_weights_tree_add(struct bfq_data *bfqd,
++ struct bfq_entity *entity,
++ struct rb_root *root)
++{
++ struct rb_node **new = &(root->rb_node), *parent = NULL;
++
++ /*
++ * Do not insert if:
++ * - the device does not support queueing;
++ * - the entity is already associated with a counter, which happens if:
++ * 1) the entity is associated with a queue, 2) a request arrival
++ * has caused the queue to become both non-weight-raised, and hence
++ * change its weight, and backlogged; in this respect, each
++ * of the two events causes an invocation of this function,
++ * 3) this is the invocation of this function caused by the second
++ * event. This second invocation is actually useless, and we handle
++ * this fact by exiting immediately. More efficient or clearer
++ * solutions might possibly be adopted.
++ */
++ if (!bfqd->hw_tag || entity->weight_counter)
++ return;
++
++ while (*new) {
++ struct bfq_weight_counter *__counter = container_of(*new,
++ struct bfq_weight_counter,
++ weights_node);
++ parent = *new;
++
++ if (entity->weight == __counter->weight) {
++ entity->weight_counter = __counter;
++ goto inc_counter;
++ }
++ if (entity->weight < __counter->weight)
++ new = &((*new)->rb_left);
++ else
++ new = &((*new)->rb_right);
++ }
++
++ entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
++ GFP_ATOMIC);
++ entity->weight_counter->weight = entity->weight;
++ rb_link_node(&entity->weight_counter->weights_node, parent, new);
++ rb_insert_color(&entity->weight_counter->weights_node, root);
++
++inc_counter:
++ entity->weight_counter->num_active++;
++}
++
++/*
++ * Decrement the weight counter associated with the entity, and, if the
++ * counter reaches 0, remove the counter from the tree.
++ * See the comments to the function bfq_weights_tree_add() for considerations
++ * about overhead.
++ */
++static void bfq_weights_tree_remove(struct bfq_data *bfqd,
++ struct bfq_entity *entity,
++ struct rb_root *root)
++{
++ /*
++ * Check whether the entity is actually associated with a counter.
++ * In fact, the device may not be considered NCQ-capable for a while,
++ * which implies that no insertion in the weight trees is performed,
++ * after which the device may start to be deemed NCQ-capable, and hence
++ * this function may start to be invoked. This may cause the function
++ * to be invoked for entities that are not associated with any counter.
++ */
++ if (!entity->weight_counter)
++ return;
++
++ BUG_ON(RB_EMPTY_ROOT(root));
++ BUG_ON(entity->weight_counter->weight != entity->weight);
++
++ BUG_ON(!entity->weight_counter->num_active);
++ entity->weight_counter->num_active--;
++ if (entity->weight_counter->num_active > 0)
++ goto reset_entity_pointer;
++
++ rb_erase(&entity->weight_counter->weights_node, root);
++ kfree(entity->weight_counter);
++
++reset_entity_pointer:
++ entity->weight_counter = NULL;
++}
++
++static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ struct request *last)
++{
++ struct rb_node *rbnext = rb_next(&last->rb_node);
++ struct rb_node *rbprev = rb_prev(&last->rb_node);
++ struct request *next = NULL, *prev = NULL;
++
++ BUG_ON(RB_EMPTY_NODE(&last->rb_node));
++
++ if (rbprev != NULL)
++ prev = rb_entry_rq(rbprev);
++
++ if (rbnext != NULL)
++ next = rb_entry_rq(rbnext);
++ else {
++ rbnext = rb_first(&bfqq->sort_list);
++ if (rbnext && rbnext != &last->rb_node)
++ next = rb_entry_rq(rbnext);
++ }
++
++ return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
++}
++
++/* see the definition of bfq_async_charge_factor for details */
++static inline unsigned long bfq_serv_to_charge(struct request *rq,
++ struct bfq_queue *bfqq)
++{
++ return blk_rq_sectors(rq) *
++ (1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
++ bfq_async_charge_factor));
++}
++
++/**
++ * bfq_updated_next_req - update the queue after a new next_rq selection.
++ * @bfqd: the device data the queue belongs to.
++ * @bfqq: the queue to update.
++ *
++ * If the first request of a queue changes we make sure that the queue
++ * has enough budget to serve at least its first request (if the
++ * request has grown). We do this because if the queue has not enough
++ * budget for its first request, it has to go through two dispatch
++ * rounds to actually get it dispatched.
++ */
++static void bfq_updated_next_req(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++ struct request *next_rq = bfqq->next_rq;
++ unsigned long new_budget;
++
++ if (next_rq == NULL)
++ return;
++
++ if (bfqq == bfqd->in_service_queue)
++ /*
++ * In order not to break guarantees, budgets cannot be
++ * changed after an entity has been selected.
++ */
++ return;
++
++ BUG_ON(entity->tree != &st->active);
++ BUG_ON(entity == entity->sched_data->in_service_entity);
++
++ new_budget = max_t(unsigned long, bfqq->max_budget,
++ bfq_serv_to_charge(next_rq, bfqq));
++ if (entity->budget != new_budget) {
++ entity->budget = new_budget;
++ bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
++ new_budget);
++ bfq_activate_bfqq(bfqd, bfqq);
++ }
++}
++
++static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
++{
++ u64 dur;
++
++ if (bfqd->bfq_wr_max_time > 0)
++ return bfqd->bfq_wr_max_time;
++
++ dur = bfqd->RT_prod;
++ do_div(dur, bfqd->peak_rate);
++
++ return dur;
++}
++
++static void bfq_add_request(struct request *rq)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++ struct bfq_entity *entity = &bfqq->entity;
++ struct bfq_data *bfqd = bfqq->bfqd;
++ struct request *next_rq, *prev;
++ unsigned long old_wr_coeff = bfqq->wr_coeff;
++ int idle_for_long_time = 0;
++
++ bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
++ bfqq->queued[rq_is_sync(rq)]++;
++ bfqd->queued++;
++
++ elv_rb_add(&bfqq->sort_list, rq);
++
++ /*
++ * Check if this request is a better next-serve candidate.
++ */
++ prev = bfqq->next_rq;
++ next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
++ BUG_ON(next_rq == NULL);
++ bfqq->next_rq = next_rq;
++
++ /*
++ * Adjust priority tree position, if next_rq changes.
++ */
++ if (prev != bfqq->next_rq)
++ bfq_rq_pos_tree_add(bfqd, bfqq);
++
++ if (!bfq_bfqq_busy(bfqq)) {
++ int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++ time_is_before_jiffies(bfqq->soft_rt_next_start);
++ idle_for_long_time = time_is_before_jiffies(
++ bfqq->budget_timeout +
++ bfqd->bfq_wr_min_idle_time);
++ entity->budget = max_t(unsigned long, bfqq->max_budget,
++ bfq_serv_to_charge(next_rq, bfqq));
++
++ if (!bfq_bfqq_IO_bound(bfqq)) {
++ if (time_before(jiffies,
++ RQ_BIC(rq)->ttime.last_end_request +
++ bfqd->bfq_slice_idle)) {
++ bfqq->requests_within_timer++;
++ if (bfqq->requests_within_timer >=
++ bfqd->bfq_requests_within_timer)
++ bfq_mark_bfqq_IO_bound(bfqq);
++ } else
++ bfqq->requests_within_timer = 0;
++ }
++
++ if (!bfqd->low_latency)
++ goto add_bfqq_busy;
++
++ /*
++ * If the queue is not being boosted and has been idle
++ * for enough time, start a weight-raising period
++ */
++ if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
++ if (idle_for_long_time)
++ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++ else
++ bfqq->wr_cur_max_time =
++ bfqd->bfq_wr_rt_max_time;
++ bfq_log_bfqq(bfqd, bfqq,
++ "wrais starting at %lu, rais_max_time %u",
++ jiffies,
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ } else if (old_wr_coeff > 1) {
++ if (idle_for_long_time)
++ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++ else if (bfqq->wr_cur_max_time ==
++ bfqd->bfq_wr_rt_max_time &&
++ !soft_rt) {
++ bfqq->wr_coeff = 1;
++ bfq_log_bfqq(bfqd, bfqq,
++ "wrais ending at %lu, rais_max_time %u",
++ jiffies,
++ jiffies_to_msecs(bfqq->
++ wr_cur_max_time));
++ } else if (time_before(
++ bfqq->last_wr_start_finish +
++ bfqq->wr_cur_max_time,
++ jiffies +
++ bfqd->bfq_wr_rt_max_time) &&
++ soft_rt) {
++ /*
++ *
++ * The remaining weight-raising time is lower
++ * than bfqd->bfq_wr_rt_max_time, which
++ * means that the application is enjoying
++ * weight raising either because deemed soft-
++ * rt in the near past, or because deemed
++ * interactive a long ago. In both cases,
++ * resetting now the current remaining weight-
++ * raising time for the application to the
++ * weight-raising duration for soft rt
++ * applications would not cause any latency
++ * increase for the application (as the new
++ * duration would be higher than the remaining
++ * time).
++ *
++ * In addition, the application is now meeting
++ * the requirements for being deemed soft rt.
++ * In the end we can correctly and safely
++ * (re)charge the weight-raising duration for
++ * the application with the weight-raising
++ * duration for soft rt applications.
++ *
++ * In particular, doing this recharge now, i.e.,
++ * before the weight-raising period for the
++ * application finishes, reduces the probability
++ * of the following negative scenario:
++ * 1) the weight of a soft rt application is
++ * raised at startup (as for any newly
++ * created application),
++ * 2) since the application is not interactive,
++ * at a certain time weight-raising is
++ * stopped for the application,
++ * 3) at that time the application happens to
++ * still have pending requests, and hence
++ * is destined to not have a chance to be
++ * deemed soft rt before these requests are
++ * completed (see the comments to the
++ * function bfq_bfqq_softrt_next_start()
++ * for details on soft rt detection),
++ * 4) these pending requests experience a high
++ * latency because the application is not
++ * weight-raised while they are pending.
++ */
++ bfqq->last_wr_start_finish = jiffies;
++ bfqq->wr_cur_max_time =
++ bfqd->bfq_wr_rt_max_time;
++ }
++ }
++ if (old_wr_coeff != bfqq->wr_coeff)
++ entity->ioprio_changed = 1;
++add_bfqq_busy:
++ bfqq->last_idle_bklogged = jiffies;
++ bfqq->service_from_backlogged = 0;
++ bfq_clear_bfqq_softrt_update(bfqq);
++ bfq_add_bfqq_busy(bfqd, bfqq);
++ } else {
++ if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
++ time_is_before_jiffies(
++ bfqq->last_wr_start_finish +
++ bfqd->bfq_wr_min_inter_arr_async)) {
++ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
++ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++
++ bfqd->wr_busy_queues++;
++ entity->ioprio_changed = 1;
++ bfq_log_bfqq(bfqd, bfqq,
++ "non-idle wrais starting at %lu, rais_max_time %u",
++ jiffies,
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ }
++ if (prev != bfqq->next_rq)
++ bfq_updated_next_req(bfqd, bfqq);
++ }
++
++ if (bfqd->low_latency &&
++ (old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
++ idle_for_long_time))
++ bfqq->last_wr_start_finish = jiffies;
++}
++
++static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
++ struct bio *bio)
++{
++ struct task_struct *tsk = current;
++ struct bfq_io_cq *bic;
++ struct bfq_queue *bfqq;
++
++ bic = bfq_bic_lookup(bfqd, tsk->io_context);
++ if (bic == NULL)
++ return NULL;
++
++ bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++ if (bfqq != NULL)
++ return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
++
++ return NULL;
++}
++
++static void bfq_activate_request(struct request_queue *q, struct request *rq)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++
++ bfqd->rq_in_driver++;
++ bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
++ bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
++ (long long unsigned)bfqd->last_position);
++}
++
++static inline void bfq_deactivate_request(struct request_queue *q,
++ struct request *rq)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++
++ BUG_ON(bfqd->rq_in_driver == 0);
++ bfqd->rq_in_driver--;
++}
++
++static void bfq_remove_request(struct request *rq)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++ struct bfq_data *bfqd = bfqq->bfqd;
++ const int sync = rq_is_sync(rq);
++
++ if (bfqq->next_rq == rq) {
++ bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
++ bfq_updated_next_req(bfqd, bfqq);
++ }
++
++ list_del_init(&rq->queuelist);
++ BUG_ON(bfqq->queued[sync] == 0);
++ bfqq->queued[sync]--;
++ bfqd->queued--;
++ elv_rb_del(&bfqq->sort_list, rq);
++
++ if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
++ if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
++ bfq_del_bfqq_busy(bfqd, bfqq, 1);
++ /*
++ * Remove queue from request-position tree as it is empty.
++ */
++ if (bfqq->pos_root != NULL) {
++ rb_erase(&bfqq->pos_node, bfqq->pos_root);
++ bfqq->pos_root = NULL;
++ }
++ }
++
++ if (rq->cmd_flags & REQ_META) {
++ BUG_ON(bfqq->meta_pending == 0);
++ bfqq->meta_pending--;
++ }
++}
++
++static int bfq_merge(struct request_queue *q, struct request **req,
++ struct bio *bio)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct request *__rq;
++
++ __rq = bfq_find_rq_fmerge(bfqd, bio);
++ if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) {
++ *req = __rq;
++ return ELEVATOR_FRONT_MERGE;
++ }
++
++ return ELEVATOR_NO_MERGE;
++}
++
++static void bfq_merged_request(struct request_queue *q, struct request *req,
++ int type)
++{
++ if (type == ELEVATOR_FRONT_MERGE &&
++ rb_prev(&req->rb_node) &&
++ blk_rq_pos(req) <
++ blk_rq_pos(container_of(rb_prev(&req->rb_node),
++ struct request, rb_node))) {
++ struct bfq_queue *bfqq = RQ_BFQQ(req);
++ struct bfq_data *bfqd = bfqq->bfqd;
++ struct request *prev, *next_rq;
++
++ /* Reposition request in its sort_list */
++ elv_rb_del(&bfqq->sort_list, req);
++ elv_rb_add(&bfqq->sort_list, req);
++ /* Choose next request to be served for bfqq */
++ prev = bfqq->next_rq;
++ next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
++ bfqd->last_position);
++ BUG_ON(next_rq == NULL);
++ bfqq->next_rq = next_rq;
++ /*
++ * If next_rq changes, update both the queue's budget to
++ * fit the new request and the queue's position in its
++ * rq_pos_tree.
++ */
++ if (prev != bfqq->next_rq) {
++ bfq_updated_next_req(bfqd, bfqq);
++ bfq_rq_pos_tree_add(bfqd, bfqq);
++ }
++ }
++}
++
++static void bfq_merged_requests(struct request_queue *q, struct request *rq,
++ struct request *next)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++ /*
++ * Reposition in fifo if next is older than rq.
++ */
++ if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
++ time_before(next->fifo_time, rq->fifo_time)) {
++ list_move(&rq->queuelist, &next->queuelist);
++ rq->fifo_time = next->fifo_time;
++ }
++
++ if (bfqq->next_rq == next)
++ bfqq->next_rq = rq;
++
++ bfq_remove_request(next);
++}
++
++/* Must be called with bfqq != NULL */
++static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
++{
++ BUG_ON(bfqq == NULL);
++ if (bfq_bfqq_busy(bfqq))
++ bfqq->bfqd->wr_busy_queues--;
++ bfqq->wr_coeff = 1;
++ bfqq->wr_cur_max_time = 0;
++ /* Trigger a weight change on the next activation of the queue */
++ bfqq->entity.ioprio_changed = 1;
++}
++
++static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
++ struct bfq_group *bfqg)
++{
++ int i, j;
++
++ for (i = 0; i < 2; i++)
++ for (j = 0; j < IOPRIO_BE_NR; j++)
++ if (bfqg->async_bfqq[i][j] != NULL)
++ bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
++ if (bfqg->async_idle_bfqq != NULL)
++ bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
++}
++
++static void bfq_end_wr(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq;
++
++ spin_lock_irq(bfqd->queue->queue_lock);
++
++ list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
++ bfq_bfqq_end_wr(bfqq);
++ list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
++ bfq_bfqq_end_wr(bfqq);
++ bfq_end_wr_async(bfqd);
++
++ spin_unlock_irq(bfqd->queue->queue_lock);
++}
++
++static int bfq_allow_merge(struct request_queue *q, struct request *rq,
++ struct bio *bio)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_io_cq *bic;
++ struct bfq_queue *bfqq;
++
++ /*
++ * Disallow merge of a sync bio into an async request.
++ */
++ if (bfq_bio_sync(bio) && !rq_is_sync(rq))
++ return 0;
++
++ /*
++ * Lookup the bfqq that this bio will be queued with. Allow
++ * merge only if rq is queued there.
++ * Queue lock is held here.
++ */
++ bic = bfq_bic_lookup(bfqd, current->io_context);
++ if (bic == NULL)
++ return 0;
++
++ bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++ return bfqq == RQ_BFQQ(rq);
++}
++
++static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ if (bfqq != NULL) {
++ bfq_mark_bfqq_must_alloc(bfqq);
++ bfq_mark_bfqq_budget_new(bfqq);
++ bfq_clear_bfqq_fifo_expire(bfqq);
++
++ bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "set_in_service_queue, cur-budget = %lu",
++ bfqq->entity.budget);
++ }
++
++ bfqd->in_service_queue = bfqq;
++}
++
++/*
++ * Get and set a new queue for service.
++ */
++static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ if (!bfqq)
++ bfqq = bfq_get_next_queue(bfqd);
++ else
++ bfq_get_next_queue_forced(bfqd, bfqq);
++
++ __bfq_set_in_service_queue(bfqd, bfqq);
++ return bfqq;
++}
++
++static inline sector_t bfq_dist_from_last(struct bfq_data *bfqd,
++ struct request *rq)
++{
++ if (blk_rq_pos(rq) >= bfqd->last_position)
++ return blk_rq_pos(rq) - bfqd->last_position;
++ else
++ return bfqd->last_position - blk_rq_pos(rq);
++}
++
++/*
++ * Return true if bfqq has no request pending and rq is close enough to
++ * bfqd->last_position, or if rq is closer to bfqd->last_position than
++ * bfqq->next_rq
++ */
++static inline int bfq_rq_close(struct bfq_data *bfqd, struct request *rq)
++{
++ return bfq_dist_from_last(bfqd, rq) <= BFQQ_SEEK_THR;
++}
++
++static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
++{
++ struct rb_root *root = &bfqd->rq_pos_tree;
++ struct rb_node *parent, *node;
++ struct bfq_queue *__bfqq;
++ sector_t sector = bfqd->last_position;
++
++ if (RB_EMPTY_ROOT(root))
++ return NULL;
++
++ /*
++ * First, if we find a request starting at the end of the last
++ * request, choose it.
++ */
++ __bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
++ if (__bfqq != NULL)
++ return __bfqq;
++
++ /*
++ * If the exact sector wasn't found, the parent of the NULL leaf
++ * will contain the closest sector (rq_pos_tree sorted by
++ * next_request position).
++ */
++ __bfqq = rb_entry(parent, struct bfq_queue, pos_node);
++ if (bfq_rq_close(bfqd, __bfqq->next_rq))
++ return __bfqq;
++
++ if (blk_rq_pos(__bfqq->next_rq) < sector)
++ node = rb_next(&__bfqq->pos_node);
++ else
++ node = rb_prev(&__bfqq->pos_node);
++ if (node == NULL)
++ return NULL;
++
++ __bfqq = rb_entry(node, struct bfq_queue, pos_node);
++ if (bfq_rq_close(bfqd, __bfqq->next_rq))
++ return __bfqq;
++
++ return NULL;
++}
++
++/*
++ * bfqd - obvious
++ * cur_bfqq - passed in so that we don't decide that the current queue
++ * is closely cooperating with itself.
++ *
++ * We are assuming that cur_bfqq has dispatched at least one request,
++ * and that bfqd->last_position reflects a position on the disk associated
++ * with the I/O issued by cur_bfqq.
++ */
++static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
++ struct bfq_queue *cur_bfqq)
++{
++ struct bfq_queue *bfqq;
++
++ if (bfq_class_idle(cur_bfqq))
++ return NULL;
++ if (!bfq_bfqq_sync(cur_bfqq))
++ return NULL;
++ if (BFQQ_SEEKY(cur_bfqq))
++ return NULL;
++
++ /* If device has only one backlogged bfq_queue, don't search. */
++ if (bfqd->busy_queues == 1)
++ return NULL;
++
++ /*
++ * We should notice if some of the queues are cooperating, e.g.
++ * working closely on the same area of the disk. In that case,
++ * we can group them together and don't waste time idling.
++ */
++ bfqq = bfqq_close(bfqd);
++ if (bfqq == NULL || bfqq == cur_bfqq)
++ return NULL;
++
++ /*
++ * Do not merge queues from different bfq_groups.
++ */
++ if (bfqq->entity.parent != cur_bfqq->entity.parent)
++ return NULL;
++
++ /*
++ * It only makes sense to merge sync queues.
++ */
++ if (!bfq_bfqq_sync(bfqq))
++ return NULL;
++ if (BFQQ_SEEKY(bfqq))
++ return NULL;
++
++ /*
++ * Do not merge queues of different priority classes.
++ */
++ if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
++ return NULL;
++
++ return bfqq;
++}
++
++/*
++ * If enough samples have been computed, return the current max budget
++ * stored in bfqd, which is dynamically updated according to the
++ * estimated disk peak rate; otherwise return the default max budget
++ */
++static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
++{
++ if (bfqd->budgets_assigned < 194)
++ return bfq_default_max_budget;
++ else
++ return bfqd->bfq_max_budget;
++}
++
++/*
++ * Return min budget, which is a fraction of the current or default
++ * max budget (trying with 1/32)
++ */
++static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
++{
++ if (bfqd->budgets_assigned < 194)
++ return bfq_default_max_budget / 32;
++ else
++ return bfqd->bfq_max_budget / 32;
++}
++
++static void bfq_arm_slice_timer(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq = bfqd->in_service_queue;
++ struct bfq_io_cq *bic;
++ unsigned long sl;
++
++ BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list));
++
++ /* Processes have exited, don't wait. */
++ bic = bfqd->in_service_bic;
++ if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0)
++ return;
++
++ bfq_mark_bfqq_wait_request(bfqq);
++
++ /*
++ * We don't want to idle for seeks, but we do want to allow
++ * fair distribution of slice time for a process doing back-to-back
++ * seeks. So allow a little bit of time for him to submit a new rq.
++ *
++ * To prevent processes with (partly) seeky workloads from
++ * being too ill-treated, grant them a small fraction of the
++ * assigned budget before reducing the waiting time to
++ * BFQ_MIN_TT. This happened to help reduce latency.
++ */
++ sl = bfqd->bfq_slice_idle;
++ /*
++ * Unless the queue is being weight-raised, grant only minimum idle
++ * time if the queue either has been seeky for long enough or has
++ * already proved to be constantly seeky.
++ */
++ if (bfq_sample_valid(bfqq->seek_samples) &&
++ ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
++ bfq_max_budget(bfqq->bfqd) / 8) ||
++ bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1)
++ sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
++ else if (bfqq->wr_coeff > 1)
++ sl = sl * 3;
++ bfqd->last_idling_start = ktime_get();
++ mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
++ bfq_log(bfqd, "arm idle: %u/%u ms",
++ jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle));
++}
++
++/*
++ * Set the maximum time for the in-service queue to consume its
++ * budget. This prevents seeky processes from lowering the disk
++ * throughput (always guaranteed with a time slice scheme as in CFQ).
++ */
++static void bfq_set_budget_timeout(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq = bfqd->in_service_queue;
++ unsigned int timeout_coeff;
++ if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
++ timeout_coeff = 1;
++ else
++ timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
++
++ bfqd->last_budget_start = ktime_get();
++
++ bfq_clear_bfqq_budget_new(bfqq);
++ bfqq->budget_timeout = jiffies +
++ bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff;
++
++ bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
++ jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] *
++ timeout_coeff));
++}
++
++/*
++ * Move request from internal lists to the request queue dispatch list.
++ */
++static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++ /*
++ * For consistency, the next instruction should have been executed
++ * after removing the request from the queue and dispatching it.
++ * We execute instead this instruction before bfq_remove_request()
++ * (and hence introduce a temporary inconsistency), for efficiency.
++ * In fact, in a forced_dispatch, this prevents two counters related
++ * to bfqq->dispatched to risk to be uselessly decremented if bfqq
++ * is not in service, and then to be incremented again after
++ * incrementing bfqq->dispatched.
++ */
++ bfqq->dispatched++;
++ bfq_remove_request(rq);
++ elv_dispatch_sort(q, rq);
++
++ if (bfq_bfqq_sync(bfqq))
++ bfqd->sync_flight++;
++}
++
++/*
++ * Return expired entry, or NULL to just start from scratch in rbtree.
++ */
++static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
++{
++ struct request *rq = NULL;
++
++ if (bfq_bfqq_fifo_expire(bfqq))
++ return NULL;
++
++ bfq_mark_bfqq_fifo_expire(bfqq);
++
++ if (list_empty(&bfqq->fifo))
++ return NULL;
++
++ rq = rq_entry_fifo(bfqq->fifo.next);
++
++ if (time_before(jiffies, rq->fifo_time))
++ return NULL;
++
++ return rq;
++}
++
++/*
++ * Must be called with the queue_lock held.
++ */
++static int bfqq_process_refs(struct bfq_queue *bfqq)
++{
++ int process_refs, io_refs;
++
++ io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
++ process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
++ BUG_ON(process_refs < 0);
++ return process_refs;
++}
++
++static void bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++ int process_refs, new_process_refs;
++ struct bfq_queue *__bfqq;
++
++ /*
++ * If there are no process references on the new_bfqq, then it is
++ * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
++ * may have dropped their last reference (not just their last process
++ * reference).
++ */
++ if (!bfqq_process_refs(new_bfqq))
++ return;
++
++ /* Avoid a circular list and skip interim queue merges. */
++ while ((__bfqq = new_bfqq->new_bfqq)) {
++ if (__bfqq == bfqq)
++ return;
++ new_bfqq = __bfqq;
++ }
++
++ process_refs = bfqq_process_refs(bfqq);
++ new_process_refs = bfqq_process_refs(new_bfqq);
++ /*
++ * If the process for the bfqq has gone away, there is no
++ * sense in merging the queues.
++ */
++ if (process_refs == 0 || new_process_refs == 0)
++ return;
++
++ /*
++ * Merge in the direction of the lesser amount of work.
++ */
++ if (new_process_refs >= process_refs) {
++ bfqq->new_bfqq = new_bfqq;
++ atomic_add(process_refs, &new_bfqq->ref);
++ } else {
++ new_bfqq->new_bfqq = bfqq;
++ atomic_add(new_process_refs, &bfqq->ref);
++ }
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
++ new_bfqq->pid);
++}
++
++static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++ return entity->budget - entity->service;
++}
++
++static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ BUG_ON(bfqq != bfqd->in_service_queue);
++
++ __bfq_bfqd_reset_in_service(bfqd);
++
++ /*
++ * If this bfqq is shared between multiple processes, check
++ * to make sure that those processes are still issuing I/Os
++ * within the mean seek distance. If not, it may be time to
++ * break the queues apart again.
++ */
++ if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
++ bfq_mark_bfqq_split_coop(bfqq);
++
++ if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
++ /*
++ * Overloading budget_timeout field to store the time
++ * at which the queue remains with no backlog; used by
++ * the weight-raising mechanism.
++ */
++ bfqq->budget_timeout = jiffies;
++ bfq_del_bfqq_busy(bfqd, bfqq, 1);
++ } else {
++ bfq_activate_bfqq(bfqd, bfqq);
++ /*
++ * Resort priority tree of potential close cooperators.
++ */
++ bfq_rq_pos_tree_add(bfqd, bfqq);
++ }
++}
++
++/**
++ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
++ * @bfqd: device data.
++ * @bfqq: queue to update.
++ * @reason: reason for expiration.
++ *
++ * Handle the feedback on @bfqq budget. See the body for detailed
++ * comments.
++ */
++static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ enum bfqq_expiration reason)
++{
++ struct request *next_rq;
++ unsigned long budget, min_budget;
++
++ budget = bfqq->max_budget;
++ min_budget = bfq_min_budget(bfqd);
++
++ BUG_ON(bfqq != bfqd->in_service_queue);
++
++ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
++ bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
++ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
++ budget, bfq_min_budget(bfqd));
++ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
++ bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
++
++ if (bfq_bfqq_sync(bfqq)) {
++ switch (reason) {
++ /*
++ * Caveat: in all the following cases we trade latency
++ * for throughput.
++ */
++ case BFQ_BFQQ_TOO_IDLE:
++ /*
++ * This is the only case where we may reduce
++ * the budget: if there is no request of the
++ * process still waiting for completion, then
++ * we assume (tentatively) that the timer has
++ * expired because the batch of requests of
++ * the process could have been served with a
++ * smaller budget. Hence, betting that
++ * process will behave in the same way when it
++ * becomes backlogged again, we reduce its
++ * next budget. As long as we guess right,
++ * this budget cut reduces the latency
++ * experienced by the process.
++ *
++ * However, if there are still outstanding
++ * requests, then the process may have not yet
++ * issued its next request just because it is
++ * still waiting for the completion of some of
++ * the still outstanding ones. So in this
++ * subcase we do not reduce its budget, on the
++ * contrary we increase it to possibly boost
++ * the throughput, as discussed in the
++ * comments to the BUDGET_TIMEOUT case.
++ */
++ if (bfqq->dispatched > 0) /* still outstanding reqs */
++ budget = min(budget * 2, bfqd->bfq_max_budget);
++ else {
++ if (budget > 5 * min_budget)
++ budget -= 4 * min_budget;
++ else
++ budget = min_budget;
++ }
++ break;
++ case BFQ_BFQQ_BUDGET_TIMEOUT:
++ /*
++ * We double the budget here because: 1) it
++ * gives the chance to boost the throughput if
++ * this is not a seeky process (which may have
++ * bumped into this timeout because of, e.g.,
++ * ZBR), 2) together with charge_full_budget
++ * it helps give seeky processes higher
++ * timestamps, and hence be served less
++ * frequently.
++ */
++ budget = min(budget * 2, bfqd->bfq_max_budget);
++ break;
++ case BFQ_BFQQ_BUDGET_EXHAUSTED:
++ /*
++ * The process still has backlog, and did not
++ * let either the budget timeout or the disk
++ * idling timeout expire. Hence it is not
++ * seeky, has a short thinktime and may be
++ * happy with a higher budget too. So
++ * definitely increase the budget of this good
++ * candidate to boost the disk throughput.
++ */
++ budget = min(budget * 4, bfqd->bfq_max_budget);
++ break;
++ case BFQ_BFQQ_NO_MORE_REQUESTS:
++ /*
++ * Leave the budget unchanged.
++ */
++ default:
++ return;
++ }
++ } else /* async queue */
++ /* async queues get always the maximum possible budget
++ * (their ability to dispatch is limited by
++ * @bfqd->bfq_max_budget_async_rq).
++ */
++ budget = bfqd->bfq_max_budget;
++
++ bfqq->max_budget = budget;
++
++ if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
++ bfqq->max_budget > bfqd->bfq_max_budget)
++ bfqq->max_budget = bfqd->bfq_max_budget;
++
++ /*
++ * Make sure that we have enough budget for the next request.
++ * Since the finish time of the bfqq must be kept in sync with
++ * the budget, be sure to call __bfq_bfqq_expire() after the
++ * update.
++ */
++ next_rq = bfqq->next_rq;
++ if (next_rq != NULL)
++ bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
++ bfq_serv_to_charge(next_rq, bfqq));
++ else
++ bfqq->entity.budget = bfqq->max_budget;
++
++ bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
++ next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
++ bfqq->entity.budget);
++}
++
++static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
++{
++ unsigned long max_budget;
++
++ /*
++ * The max_budget calculated when autotuning is equal to the
++ * amount of sectors transfered in timeout_sync at the
++ * estimated peak rate.
++ */
++ max_budget = (unsigned long)(peak_rate * 1000 *
++ timeout >> BFQ_RATE_SHIFT);
++
++ return max_budget;
++}
++
++/*
++ * In addition to updating the peak rate, checks whether the process
++ * is "slow", and returns 1 if so. This slow flag is used, in addition
++ * to the budget timeout, to reduce the amount of service provided to
++ * seeky processes, and hence reduce their chances to lower the
++ * throughput. See the code for more details.
++ */
++static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ int compensate, enum bfqq_expiration reason)
++{
++ u64 bw, usecs, expected, timeout;
++ ktime_t delta;
++ int update = 0;
++
++ if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
++ return 0;
++
++ if (compensate)
++ delta = bfqd->last_idling_start;
++ else
++ delta = ktime_get();
++ delta = ktime_sub(delta, bfqd->last_budget_start);
++ usecs = ktime_to_us(delta);
++
++ /* Don't trust short/unrealistic values. */
++ if (usecs < 100 || usecs >= LONG_MAX)
++ return 0;
++
++ /*
++ * Calculate the bandwidth for the last slice. We use a 64 bit
++ * value to store the peak rate, in sectors per usec in fixed
++ * point math. We do so to have enough precision in the estimate
++ * and to avoid overflows.
++ */
++ bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
++ do_div(bw, (unsigned long)usecs);
++
++ timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
++
++ /*
++ * Use only long (> 20ms) intervals to filter out spikes for
++ * the peak rate estimation.
++ */
++ if (usecs > 20000) {
++ if (bw > bfqd->peak_rate ||
++ (!BFQQ_SEEKY(bfqq) &&
++ reason == BFQ_BFQQ_BUDGET_TIMEOUT)) {
++ bfq_log(bfqd, "measured bw =%llu", bw);
++ /*
++ * To smooth oscillations use a low-pass filter with
++ * alpha=7/8, i.e.,
++ * new_rate = (7/8) * old_rate + (1/8) * bw
++ */
++ do_div(bw, 8);
++ if (bw == 0)
++ return 0;
++ bfqd->peak_rate *= 7;
++ do_div(bfqd->peak_rate, 8);
++ bfqd->peak_rate += bw;
++ update = 1;
++ bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate);
++ }
++
++ update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
++
++ if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
++ bfqd->peak_rate_samples++;
++
++ if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
++ update) {
++ int dev_type = blk_queue_nonrot(bfqd->queue);
++ if (bfqd->bfq_user_max_budget == 0) {
++ bfqd->bfq_max_budget =
++ bfq_calc_max_budget(bfqd->peak_rate,
++ timeout);
++ bfq_log(bfqd, "new max_budget=%lu",
++ bfqd->bfq_max_budget);
++ }
++ if (bfqd->device_speed == BFQ_BFQD_FAST &&
++ bfqd->peak_rate < device_speed_thresh[dev_type]) {
++ bfqd->device_speed = BFQ_BFQD_SLOW;
++ bfqd->RT_prod = R_slow[dev_type] *
++ T_slow[dev_type];
++ } else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
++ bfqd->peak_rate > device_speed_thresh[dev_type]) {
++ bfqd->device_speed = BFQ_BFQD_FAST;
++ bfqd->RT_prod = R_fast[dev_type] *
++ T_fast[dev_type];
++ }
++ }
++ }
++
++ /*
++ * If the process has been served for a too short time
++ * interval to let its possible sequential accesses prevail on
++ * the initial seek time needed to move the disk head on the
++ * first sector it requested, then give the process a chance
++ * and for the moment return false.
++ */
++ if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
++ return 0;
++
++ /*
++ * A process is considered ``slow'' (i.e., seeky, so that we
++ * cannot treat it fairly in the service domain, as it would
++ * slow down too much the other processes) if, when a slice
++ * ends for whatever reason, it has received service at a
++ * rate that would not be high enough to complete the budget
++ * before the budget timeout expiration.
++ */
++ expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
++
++ /*
++ * Caveat: processes doing IO in the slower disk zones will
++ * tend to be slow(er) even if not seeky. And the estimated
++ * peak rate will actually be an average over the disk
++ * surface. Hence, to not be too harsh with unlucky processes,
++ * we keep a budget/3 margin of safety before declaring a
++ * process slow.
++ */
++ return expected > (4 * bfqq->entity.budget) / 3;
++}
++
++/*
++ * To be deemed as soft real-time, an application must meet two
++ * requirements. First, the application must not require an average
++ * bandwidth higher than the approximate bandwidth required to playback or
++ * record a compressed high-definition video.
++ * The next function is invoked on the completion of the last request of a
++ * batch, to compute the next-start time instant, soft_rt_next_start, such
++ * that, if the next request of the application does not arrive before
++ * soft_rt_next_start, then the above requirement on the bandwidth is met.
++ *
++ * The second requirement is that the request pattern of the application is
++ * isochronous, i.e., that, after issuing a request or a batch of requests,
++ * the application stops issuing new requests until all its pending requests
++ * have been completed. After that, the application may issue a new batch,
++ * and so on.
++ * For this reason the next function is invoked to compute
++ * soft_rt_next_start only for applications that meet this requirement,
++ * whereas soft_rt_next_start is set to infinity for applications that do
++ * not.
++ *
++ * Unfortunately, even a greedy application may happen to behave in an
++ * isochronous way if the CPU load is high. In fact, the application may
++ * stop issuing requests while the CPUs are busy serving other processes,
++ * then restart, then stop again for a while, and so on. In addition, if
++ * the disk achieves a low enough throughput with the request pattern
++ * issued by the application (e.g., because the request pattern is random
++ * and/or the device is slow), then the application may meet the above
++ * bandwidth requirement too. To prevent such a greedy application to be
++ * deemed as soft real-time, a further rule is used in the computation of
++ * soft_rt_next_start: soft_rt_next_start must be higher than the current
++ * time plus the maximum time for which the arrival of a request is waited
++ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
++ * This filters out greedy applications, as the latter issue instead their
++ * next request as soon as possible after the last one has been completed
++ * (in contrast, when a batch of requests is completed, a soft real-time
++ * application spends some time processing data).
++ *
++ * Unfortunately, the last filter may easily generate false positives if
++ * only bfqd->bfq_slice_idle is used as a reference time interval and one
++ * or both the following cases occur:
++ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
++ * than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
++ * HZ=100.
++ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
++ * for a while, then suddenly 'jump' by several units to recover the lost
++ * increments. This seems to happen, e.g., inside virtual machines.
++ * To address this issue, we do not use as a reference time interval just
++ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
++ * particular we add the minimum number of jiffies for which the filter
++ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
++ * machines.
++ */
++static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ return max(bfqq->last_idle_bklogged +
++ HZ * bfqq->service_from_backlogged /
++ bfqd->bfq_wr_max_softrt_rate,
++ jiffies + bfqq->bfqd->bfq_slice_idle + 4);
++}
++
++/*
++ * Return the largest-possible time instant such that, for as long as possible,
++ * the current time will be lower than this time instant according to the macro
++ * time_is_before_jiffies().
++ */
++static inline unsigned long bfq_infinity_from_now(unsigned long now)
++{
++ return now + ULONG_MAX / 2;
++}
++
++/**
++ * bfq_bfqq_expire - expire a queue.
++ * @bfqd: device owning the queue.
++ * @bfqq: the queue to expire.
++ * @compensate: if true, compensate for the time spent idling.
++ * @reason: the reason causing the expiration.
++ *
++ *
++ * If the process associated to the queue is slow (i.e., seeky), or in
++ * case of budget timeout, or, finally, if it is async, we
++ * artificially charge it an entire budget (independently of the
++ * actual service it received). As a consequence, the queue will get
++ * higher timestamps than the correct ones upon reactivation, and
++ * hence it will be rescheduled as if it had received more service
++ * than what it actually received. In the end, this class of processes
++ * will receive less service in proportion to how slowly they consume
++ * their budgets (and hence how seriously they tend to lower the
++ * throughput).
++ *
++ * In contrast, when a queue expires because it has been idling for
++ * too much or because it exhausted its budget, we do not touch the
++ * amount of service it has received. Hence when the queue will be
++ * reactivated and its timestamps updated, the latter will be in sync
++ * with the actual service received by the queue until expiration.
++ *
++ * Charging a full budget to the first type of queues and the exact
++ * service to the others has the effect of using the WF2Q+ policy to
++ * schedule the former on a timeslice basis, without violating the
++ * service domain guarantees of the latter.
++ */
++static void bfq_bfqq_expire(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ int compensate,
++ enum bfqq_expiration reason)
++{
++ int slow;
++ BUG_ON(bfqq != bfqd->in_service_queue);
++
++ /* Update disk peak rate for autotuning and check whether the
++ * process is slow (see bfq_update_peak_rate).
++ */
++ slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason);
++
++ /*
++ * As above explained, 'punish' slow (i.e., seeky), timed-out
++ * and async queues, to favor sequential sync workloads.
++ *
++ * Processes doing I/O in the slower disk zones will tend to be
++ * slow(er) even if not seeky. Hence, since the estimated peak
++ * rate is actually an average over the disk surface, these
++ * processes may timeout just for bad luck. To avoid punishing
++ * them we do not charge a full budget to a process that
++ * succeeded in consuming at least 2/3 of its budget.
++ */
++ if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
++ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3))
++ bfq_bfqq_charge_full_budget(bfqq);
++
++ bfqq->service_from_backlogged += bfqq->entity.service;
++
++ if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
++ !bfq_bfqq_constantly_seeky(bfqq)) {
++ bfq_mark_bfqq_constantly_seeky(bfqq);
++ if (!blk_queue_nonrot(bfqd->queue))
++ bfqd->const_seeky_busy_in_flight_queues++;
++ }
++
++ if (reason == BFQ_BFQQ_TOO_IDLE &&
++ bfqq->entity.service <= 2 * bfqq->entity.budget / 10 )
++ bfq_clear_bfqq_IO_bound(bfqq);
++
++ if (bfqd->low_latency && bfqq->wr_coeff == 1)
++ bfqq->last_wr_start_finish = jiffies;
++
++ if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
++ RB_EMPTY_ROOT(&bfqq->sort_list)) {
++ /*
++ * If we get here, and there are no outstanding requests,
++ * then the request pattern is isochronous (see the comments
++ * to the function bfq_bfqq_softrt_next_start()). Hence we
++ * can compute soft_rt_next_start. If, instead, the queue
++ * still has outstanding requests, then we have to wait
++ * for the completion of all the outstanding requests to
++ * discover whether the request pattern is actually
++ * isochronous.
++ */
++ if (bfqq->dispatched == 0)
++ bfqq->soft_rt_next_start =
++ bfq_bfqq_softrt_next_start(bfqd, bfqq);
++ else {
++ /*
++ * The application is still waiting for the
++ * completion of one or more requests:
++ * prevent it from possibly being incorrectly
++ * deemed as soft real-time by setting its
++ * soft_rt_next_start to infinity. In fact,
++ * without this assignment, the application
++ * would be incorrectly deemed as soft
++ * real-time if:
++ * 1) it issued a new request before the
++ * completion of all its in-flight
++ * requests, and
++ * 2) at that time, its soft_rt_next_start
++ * happened to be in the past.
++ */
++ bfqq->soft_rt_next_start =
++ bfq_infinity_from_now(jiffies);
++ /*
++ * Schedule an update of soft_rt_next_start to when
++ * the task may be discovered to be isochronous.
++ */
++ bfq_mark_bfqq_softrt_update(bfqq);
++ }
++ }
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
++ slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
++
++ /*
++ * Increase, decrease or leave budget unchanged according to
++ * reason.
++ */
++ __bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
++ __bfq_bfqq_expire(bfqd, bfqq);
++}
++
++/*
++ * Budget timeout is not implemented through a dedicated timer, but
++ * just checked on request arrivals and completions, as well as on
++ * idle timer expirations.
++ */
++static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
++{
++ if (bfq_bfqq_budget_new(bfqq) ||
++ time_before(jiffies, bfqq->budget_timeout))
++ return 0;
++ return 1;
++}
++
++/*
++ * If we expire a queue that is waiting for the arrival of a new
++ * request, we may prevent the fictitious timestamp back-shifting that
++ * allows the guarantees of the queue to be preserved (see [1] for
++ * this tricky aspect). Hence we return true only if this condition
++ * does not hold, or if the queue is slow enough to deserve only to be
++ * kicked off for preserving a high throughput.
++*/
++static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
++{
++ bfq_log_bfqq(bfqq->bfqd, bfqq,
++ "may_budget_timeout: wait_request %d left %d timeout %d",
++ bfq_bfqq_wait_request(bfqq),
++ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3,
++ bfq_bfqq_budget_timeout(bfqq));
++
++ return (!bfq_bfqq_wait_request(bfqq) ||
++ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3)
++ &&
++ bfq_bfqq_budget_timeout(bfqq);
++}
++
++/*
++ * Device idling is allowed only for the queues for which this function
++ * returns true. For this reason, the return value of this function plays a
++ * critical role for both throughput boosting and service guarantees. The
++ * return value is computed through a logical expression. In this rather
++ * long comment, we try to briefly describe all the details and motivations
++ * behind the components of this logical expression.
++ *
++ * First, the expression may be true only for sync queues. Besides, if
++ * bfqq is also being weight-raised, then the expression always evaluates
++ * to true, as device idling is instrumental for preserving low-latency
++ * guarantees (see [1]). Otherwise, the expression evaluates to true only
++ * if bfqq has a non-null idle window and at least one of the following
++ * two conditions holds. The first condition is that the device is not
++ * performing NCQ, because idling the device most certainly boosts the
++ * throughput if this condition holds and bfqq has been granted a non-null
++ * idle window. The second compound condition is made of the logical AND of
++ * two components.
++ *
++ * The first component is true only if there is no weight-raised busy
++ * queue. This guarantees that the device is not idled for a sync non-
++ * weight-raised queue when there are busy weight-raised queues. The former
++ * is then expired immediately if empty. Combined with the timestamping
++ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
++ * queues to get a lower number of requests served, and hence to ask for a
++ * lower number of requests from the request pool, before the busy weight-
++ * raised queues get served again.
++ *
++ * This is beneficial for the processes associated with weight-raised
++ * queues, when the request pool is saturated (e.g., in the presence of
++ * write hogs). In fact, if the processes associated with the other queues
++ * ask for requests at a lower rate, then weight-raised processes have a
++ * higher probability to get a request from the pool immediately (or at
++ * least soon) when they need one. Hence they have a higher probability to
++ * actually get a fraction of the disk throughput proportional to their
++ * high weight. This is especially true with NCQ-capable drives, which
++ * enqueue several requests in advance and further reorder internally-
++ * queued requests.
++ *
++ * In the end, mistreating non-weight-raised queues when there are busy
++ * weight-raised queues seems to mitigate starvation problems in the
++ * presence of heavy write workloads and NCQ, and hence to guarantee a
++ * higher application and system responsiveness in these hostile scenarios.
++ *
++ * If the first component of the compound condition is instead true, i.e.,
++ * there is no weight-raised busy queue, then the second component of the
++ * compound condition takes into account service-guarantee and throughput
++ * issues related to NCQ (recall that the compound condition is evaluated
++ * only if the device is detected as supporting NCQ).
++ *
++ * As for service guarantees, allowing the drive to enqueue more than one
++ * request at a time, and hence delegating de facto final scheduling
++ * decisions to the drive's internal scheduler, causes loss of control on
++ * the actual request service order. In this respect, when the drive is
++ * allowed to enqueue more than one request at a time, the service
++ * distribution enforced by the drive's internal scheduler is likely to
++ * coincide with the desired device-throughput distribution only in the
++ * following, perfectly symmetric, scenario:
++ * 1) all active queues have the same weight,
++ * 2) all active groups at the same level in the groups tree have the same
++ * weight,
++ * 3) all active groups at the same level in the groups tree have the same
++ * number of children.
++ *
++ * Even in such a scenario, sequential I/O may still receive a preferential
++ * treatment, but this is not likely to be a big issue with flash-based
++ * devices, because of their non-dramatic loss of throughput with random
++ * I/O. Things do differ with HDDs, for which additional care is taken, as
++ * explained after completing the discussion for flash-based devices.
++ *
++ * Unfortunately, keeping the necessary state for evaluating exactly the
++ * above symmetry conditions would be quite complex and time-consuming.
++ * Therefore BFQ evaluates instead the following stronger sub-conditions,
++ * for which it is much easier to maintain the needed state:
++ * 1) all active queues have the same weight,
++ * 2) all active groups have the same weight,
++ * 3) all active groups have at most one active child each.
++ * In particular, the last two conditions are always true if hierarchical
++ * support and the cgroups interface are not enabled, hence no state needs
++ * to be maintained in this case.
++ *
++ * According to the above considerations, the second component of the
++ * compound condition evaluates to true if any of the above symmetry
++ * sub-condition does not hold, or the device is not flash-based. Therefore,
++ * if also the first component is true, then idling is allowed for a sync
++ * queue. These are the only sub-conditions considered if the device is
++ * flash-based, as, for such a device, it is sensible to force idling only
++ * for service-guarantee issues. In fact, as for throughput, idling
++ * NCQ-capable flash-based devices would not boost the throughput even
++ * with sequential I/O; rather it would lower the throughput in proportion
++ * to how fast the device is. In the end, (only) if all the three
++ * sub-conditions hold and the device is flash-based, the compound
++ * condition evaluates to false and therefore no idling is performed.
++ *
++ * As already said, things change with a rotational device, where idling
++ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
++ * such a device the second component of the compound condition evaluates
++ * to true also if the following additional sub-condition does not hold:
++ * the queue is constantly seeky. Unfortunately, this different behavior
++ * with respect to flash-based devices causes an additional asymmetry: if
++ * some sync queues enjoy idling and some other sync queues do not, then
++ * the latter get a low share of the device throughput, simply because the
++ * former get many requests served after being set as in service, whereas
++ * the latter do not. As a consequence, to guarantee the desired throughput
++ * distribution, on HDDs the compound expression evaluates to true (and
++ * hence device idling is performed) also if the following last symmetry
++ * condition does not hold: no other queue is benefiting from idling. Also
++ * this last condition is actually replaced with a simpler-to-maintain and
++ * stronger condition: there is no busy queue which is not constantly seeky
++ * (and hence may also benefit from idling).
++ *
++ * To sum up, when all the required symmetry and throughput-boosting
++ * sub-conditions hold, the second component of the compound condition
++ * evaluates to false, and hence no idling is performed. This helps to
++ * keep the drives' internal queues full on NCQ-capable devices, and hence
++ * to boost the throughput, without causing 'almost' any loss of service
++ * guarantees. The 'almost' follows from the fact that, if the internal
++ * queue of one such device is filled while all the sub-conditions hold,
++ * but at some point in time some sub-condition stops to hold, then it may
++ * become impossible to let requests be served in the new desired order
++ * until all the requests already queued in the device have been served.
++ */
++static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
++{
++ struct bfq_data *bfqd = bfqq->bfqd;
++#ifdef CONFIG_CGROUP_BFQIO
++#define symmetric_scenario (!bfqd->active_numerous_groups && \
++ !bfq_differentiated_weights(bfqd))
++#else
++#define symmetric_scenario (!bfq_differentiated_weights(bfqd))
++#endif
++#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
++ bfqd->busy_in_flight_queues == \
++ bfqd->const_seeky_busy_in_flight_queues)
++/*
++ * Condition for expiring a non-weight-raised queue (and hence not idling
++ * the device).
++ */
++#define cond_for_expiring_non_wr (bfqd->hw_tag && \
++ (bfqd->wr_busy_queues > 0 || \
++ (symmetric_scenario && \
++ (blk_queue_nonrot(bfqd->queue) || \
++ cond_for_seeky_on_ncq_hdd))))
++
++ return bfq_bfqq_sync(bfqq) &&
++ (bfq_bfqq_IO_bound(bfqq) || bfqq->wr_coeff > 1) &&
++ (bfqq->wr_coeff > 1 ||
++ (bfq_bfqq_idle_window(bfqq) &&
++ !cond_for_expiring_non_wr)
++ );
++}
++
++/*
++ * If the in-service queue is empty but sync, and the function
++ * bfq_bfqq_must_not_expire returns true, then:
++ * 1) the queue must remain in service and cannot be expired, and
++ * 2) the disk must be idled to wait for the possible arrival of a new
++ * request for the queue.
++ * See the comments to the function bfq_bfqq_must_not_expire for the reasons
++ * why performing device idling is the best choice to boost the throughput
++ * and preserve service guarantees when bfq_bfqq_must_not_expire itself
++ * returns true.
++ */
++static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
++{
++ struct bfq_data *bfqd = bfqq->bfqd;
++
++ return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
++ bfq_bfqq_must_not_expire(bfqq);
++}
++
++/*
++ * Select a queue for service. If we have a current queue in service,
++ * check whether to continue servicing it, or retrieve and set a new one.
++ */
++static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq, *new_bfqq = NULL;
++ struct request *next_rq;
++ enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
++
++ bfqq = bfqd->in_service_queue;
++ if (bfqq == NULL)
++ goto new_queue;
++
++ bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
++
++ /*
++ * If another queue has a request waiting within our mean seek
++ * distance, let it run. The expire code will check for close
++ * cooperators and put the close queue at the front of the
++ * service tree. If possible, merge the expiring queue with the
++ * new bfqq.
++ */
++ new_bfqq = bfq_close_cooperator(bfqd, bfqq);
++ if (new_bfqq != NULL && bfqq->new_bfqq == NULL)
++ bfq_setup_merge(bfqq, new_bfqq);
++
++ if (bfq_may_expire_for_budg_timeout(bfqq) &&
++ !timer_pending(&bfqd->idle_slice_timer) &&
++ !bfq_bfqq_must_idle(bfqq))
++ goto expire;
++
++ next_rq = bfqq->next_rq;
++ /*
++ * If bfqq has requests queued and it has enough budget left to
++ * serve them, keep the queue, otherwise expire it.
++ */
++ if (next_rq != NULL) {
++ if (bfq_serv_to_charge(next_rq, bfqq) >
++ bfq_bfqq_budget_left(bfqq)) {
++ reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
++ goto expire;
++ } else {
++ /*
++ * The idle timer may be pending because we may
++ * not disable disk idling even when a new request
++ * arrives.
++ */
++ if (timer_pending(&bfqd->idle_slice_timer)) {
++ /*
++ * If we get here: 1) at least a new request
++ * has arrived but we have not disabled the
++ * timer because the request was too small,
++ * 2) then the block layer has unplugged
++ * the device, causing the dispatch to be
++ * invoked.
++ *
++ * Since the device is unplugged, now the
++ * requests are probably large enough to
++ * provide a reasonable throughput.
++ * So we disable idling.
++ */
++ bfq_clear_bfqq_wait_request(bfqq);
++ del_timer(&bfqd->idle_slice_timer);
++ }
++ if (new_bfqq == NULL)
++ goto keep_queue;
++ else
++ goto expire;
++ }
++ }
++
++ /*
++ * No requests pending. If the in-service queue still has requests
++ * in flight (possibly waiting for a completion) or is idling for a
++ * new request, then keep it.
++ */
++ if (new_bfqq == NULL && (timer_pending(&bfqd->idle_slice_timer) ||
++ (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq)))) {
++ bfqq = NULL;
++ goto keep_queue;
++ } else if (new_bfqq != NULL && timer_pending(&bfqd->idle_slice_timer)) {
++ /*
++ * Expiring the queue because there is a close cooperator,
++ * cancel timer.
++ */
++ bfq_clear_bfqq_wait_request(bfqq);
++ del_timer(&bfqd->idle_slice_timer);
++ }
++
++ reason = BFQ_BFQQ_NO_MORE_REQUESTS;
++expire:
++ bfq_bfqq_expire(bfqd, bfqq, 0, reason);
++new_queue:
++ bfqq = bfq_set_in_service_queue(bfqd, new_bfqq);
++ bfq_log(bfqd, "select_queue: new queue %d returned",
++ bfqq != NULL ? bfqq->pid : 0);
++keep_queue:
++ return bfqq;
++}
++
++static void bfq_update_wr_data(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ if (bfqq->wr_coeff > 1) { /* queue is being boosted */
++ struct bfq_entity *entity = &bfqq->entity;
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "raising period dur %u/%u msec, old coeff %u, w %d(%d)",
++ jiffies_to_msecs(jiffies -
++ bfqq->last_wr_start_finish),
++ jiffies_to_msecs(bfqq->wr_cur_max_time),
++ bfqq->wr_coeff,
++ bfqq->entity.weight, bfqq->entity.orig_weight);
++
++ BUG_ON(bfqq != bfqd->in_service_queue && entity->weight !=
++ entity->orig_weight * bfqq->wr_coeff);
++ if (entity->ioprio_changed)
++ bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
++ /*
++ * If too much time has elapsed from the beginning
++ * of this weight-raising, stop it.
++ */
++ if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++ bfqq->wr_cur_max_time)) {
++ bfqq->last_wr_start_finish = jiffies;
++ bfq_log_bfqq(bfqd, bfqq,
++ "wrais ending at %lu, rais_max_time %u",
++ bfqq->last_wr_start_finish,
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ bfq_bfqq_end_wr(bfqq);
++ __bfq_entity_update_weight_prio(
++ bfq_entity_service_tree(entity),
++ entity);
++ }
++ }
++}
++
++/*
++ * Dispatch one request from bfqq, moving it to the request queue
++ * dispatch list.
++ */
++static int bfq_dispatch_request(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ int dispatched = 0;
++ struct request *rq;
++ unsigned long service_to_charge;
++
++ BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list));
++
++ /* Follow expired path, else get first next available. */
++ rq = bfq_check_fifo(bfqq);
++ if (rq == NULL)
++ rq = bfqq->next_rq;
++ service_to_charge = bfq_serv_to_charge(rq, bfqq);
++
++ if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
++ /*
++ * This may happen if the next rq is chosen in fifo order
++ * instead of sector order. The budget is properly
++ * dimensioned to be always sufficient to serve the next
++ * request only if it is chosen in sector order. The reason
++ * is that it would be quite inefficient and little useful
++ * to always make sure that the budget is large enough to
++ * serve even the possible next rq in fifo order.
++ * In fact, requests are seldom served in fifo order.
++ *
++ * Expire the queue for budget exhaustion, and make sure
++ * that the next act_budget is enough to serve the next
++ * request, even if it comes from the fifo expired path.
++ */
++ bfqq->next_rq = rq;
++ /*
++ * Since this dispatch is failed, make sure that
++ * a new one will be performed
++ */
++ if (!bfqd->rq_in_driver)
++ bfq_schedule_dispatch(bfqd);
++ goto expire;
++ }
++
++ /* Finally, insert request into driver dispatch list. */
++ bfq_bfqq_served(bfqq, service_to_charge);
++ bfq_dispatch_insert(bfqd->queue, rq);
++
++ bfq_update_wr_data(bfqd, bfqq);
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "dispatched %u sec req (%llu), budg left %lu",
++ blk_rq_sectors(rq),
++ (long long unsigned)blk_rq_pos(rq),
++ bfq_bfqq_budget_left(bfqq));
++
++ dispatched++;
++
++ if (bfqd->in_service_bic == NULL) {
++ atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
++ bfqd->in_service_bic = RQ_BIC(rq);
++ }
++
++ if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
++ dispatched >= bfqd->bfq_max_budget_async_rq) ||
++ bfq_class_idle(bfqq)))
++ goto expire;
++
++ return dispatched;
++
++expire:
++ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED);
++ return dispatched;
++}
++
++static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
++{
++ int dispatched = 0;
++
++ while (bfqq->next_rq != NULL) {
++ bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
++ dispatched++;
++ }
++
++ BUG_ON(!list_empty(&bfqq->fifo));
++ return dispatched;
++}
++
++/*
++ * Drain our current requests.
++ * Used for barriers and when switching io schedulers on-the-fly.
++ */
++static int bfq_forced_dispatch(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq, *n;
++ struct bfq_service_tree *st;
++ int dispatched = 0;
++
++ bfqq = bfqd->in_service_queue;
++ if (bfqq != NULL)
++ __bfq_bfqq_expire(bfqd, bfqq);
++
++ /*
++ * Loop through classes, and be careful to leave the scheduler
++ * in a consistent state, as feedback mechanisms and vtime
++ * updates cannot be disabled during the process.
++ */
++ list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
++ st = bfq_entity_service_tree(&bfqq->entity);
++
++ dispatched += __bfq_forced_dispatch_bfqq(bfqq);
++ bfqq->max_budget = bfq_max_budget(bfqd);
++
++ bfq_forget_idle(st);
++ }
++
++ BUG_ON(bfqd->busy_queues != 0);
++
++ return dispatched;
++}
++
++static int bfq_dispatch_requests(struct request_queue *q, int force)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_queue *bfqq;
++ int max_dispatch;
++
++ bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
++ if (bfqd->busy_queues == 0)
++ return 0;
++
++ if (unlikely(force))
++ return bfq_forced_dispatch(bfqd);
++
++ bfqq = bfq_select_queue(bfqd);
++ if (bfqq == NULL)
++ return 0;
++
++ max_dispatch = bfqd->bfq_quantum;
++ if (bfq_class_idle(bfqq))
++ max_dispatch = 1;
++
++ if (!bfq_bfqq_sync(bfqq))
++ max_dispatch = bfqd->bfq_max_budget_async_rq;
++
++ if (bfqq->dispatched >= max_dispatch) {
++ if (bfqd->busy_queues > 1)
++ return 0;
++ if (bfqq->dispatched >= 4 * max_dispatch)
++ return 0;
++ }
++
++ if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))
++ return 0;
++
++ bfq_clear_bfqq_wait_request(bfqq);
++ BUG_ON(timer_pending(&bfqd->idle_slice_timer));
++
++ if (!bfq_dispatch_request(bfqd, bfqq))
++ return 0;
++
++ bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
++ bfqq->pid, max_dispatch);
++
++ return 1;
++}
++
++/*
++ * Task holds one reference to the queue, dropped when task exits. Each rq
++ * in-flight on this queue also holds a reference, dropped when rq is freed.
++ *
++ * Queue lock must be held here.
++ */
++static void bfq_put_queue(struct bfq_queue *bfqq)
++{
++ struct bfq_data *bfqd = bfqq->bfqd;
++
++ BUG_ON(atomic_read(&bfqq->ref) <= 0);
++
++ bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq,
++ atomic_read(&bfqq->ref));
++ if (!atomic_dec_and_test(&bfqq->ref))
++ return;
++
++ BUG_ON(rb_first(&bfqq->sort_list) != NULL);
++ BUG_ON(bfqq->allocated[READ] + bfqq->allocated[WRITE] != 0);
++ BUG_ON(bfqq->entity.tree != NULL);
++ BUG_ON(bfq_bfqq_busy(bfqq));
++ BUG_ON(bfqd->in_service_queue == bfqq);
++
++ bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
++
++ kmem_cache_free(bfq_pool, bfqq);
++}
++
++static void bfq_put_cooperator(struct bfq_queue *bfqq)
++{
++ struct bfq_queue *__bfqq, *next;
++
++ /*
++ * If this queue was scheduled to merge with another queue, be
++ * sure to drop the reference taken on that queue (and others in
++ * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
++ */
++ __bfqq = bfqq->new_bfqq;
++ while (__bfqq) {
++ if (__bfqq == bfqq)
++ break;
++ next = __bfqq->new_bfqq;
++ bfq_put_queue(__bfqq);
++ __bfqq = next;
++ }
++}
++
++static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ if (bfqq == bfqd->in_service_queue) {
++ __bfq_bfqq_expire(bfqd, bfqq);
++ bfq_schedule_dispatch(bfqd);
++ }
++
++ bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
++ atomic_read(&bfqq->ref));
++
++ bfq_put_cooperator(bfqq);
++
++ bfq_put_queue(bfqq);
++}
++
++static inline void bfq_init_icq(struct io_cq *icq)
++{
++ struct bfq_io_cq *bic = icq_to_bic(icq);
++
++ bic->ttime.last_end_request = jiffies;
++}
++
++static void bfq_exit_icq(struct io_cq *icq)
++{
++ struct bfq_io_cq *bic = icq_to_bic(icq);
++ struct bfq_data *bfqd = bic_to_bfqd(bic);
++
++ if (bic->bfqq[BLK_RW_ASYNC]) {
++ bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]);
++ bic->bfqq[BLK_RW_ASYNC] = NULL;
++ }
++
++ if (bic->bfqq[BLK_RW_SYNC]) {
++ bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
++ bic->bfqq[BLK_RW_SYNC] = NULL;
++ }
++}
++
++/*
++ * Update the entity prio values; note that the new values will not
++ * be used until the next (re)activation.
++ */
++static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
++{
++ struct task_struct *tsk = current;
++ int ioprio_class;
++
++ if (!bfq_bfqq_prio_changed(bfqq))
++ return;
++
++ ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
++ switch (ioprio_class) {
++ default:
++ dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
++ "bfq: bad prio %x\n", ioprio_class);
++ case IOPRIO_CLASS_NONE:
++ /*
++ * No prio set, inherit CPU scheduling settings.
++ */
++ bfqq->entity.new_ioprio = task_nice_ioprio(tsk);
++ bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk);
++ break;
++ case IOPRIO_CLASS_RT:
++ bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT;
++ break;
++ case IOPRIO_CLASS_BE:
++ bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE;
++ break;
++ case IOPRIO_CLASS_IDLE:
++ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE;
++ bfqq->entity.new_ioprio = 7;
++ bfq_clear_bfqq_idle_window(bfqq);
++ break;
++ }
++
++ bfqq->entity.ioprio_changed = 1;
++
++ bfq_clear_bfqq_prio_changed(bfqq);
++}
++
++static void bfq_changed_ioprio(struct bfq_io_cq *bic)
++{
++ struct bfq_data *bfqd;
++ struct bfq_queue *bfqq, *new_bfqq;
++ struct bfq_group *bfqg;
++ unsigned long uninitialized_var(flags);
++ int ioprio = bic->icq.ioc->ioprio;
++
++ bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
++ &flags);
++ /*
++ * This condition may trigger on a newly created bic, be sure to
++ * drop the lock before returning.
++ */
++ if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio))
++ goto out;
++
++ bfqq = bic->bfqq[BLK_RW_ASYNC];
++ if (bfqq != NULL) {
++ bfqg = container_of(bfqq->entity.sched_data, struct bfq_group,
++ sched_data);
++ new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic,
++ GFP_ATOMIC);
++ if (new_bfqq != NULL) {
++ bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
++ bfq_log_bfqq(bfqd, bfqq,
++ "changed_ioprio: bfqq %p %d",
++ bfqq, atomic_read(&bfqq->ref));
++ bfq_put_queue(bfqq);
++ }
++ }
++
++ bfqq = bic->bfqq[BLK_RW_SYNC];
++ if (bfqq != NULL)
++ bfq_mark_bfqq_prio_changed(bfqq);
++
++ bic->ioprio = ioprio;
++
++out:
++ bfq_put_bfqd_unlock(bfqd, &flags);
++}
++
++static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ pid_t pid, int is_sync)
++{
++ RB_CLEAR_NODE(&bfqq->entity.rb_node);
++ INIT_LIST_HEAD(&bfqq->fifo);
++
++ atomic_set(&bfqq->ref, 0);
++ bfqq->bfqd = bfqd;
++
++ bfq_mark_bfqq_prio_changed(bfqq);
++
++ if (is_sync) {
++ if (!bfq_class_idle(bfqq))
++ bfq_mark_bfqq_idle_window(bfqq);
++ bfq_mark_bfqq_sync(bfqq);
++ }
++ bfq_mark_bfqq_IO_bound(bfqq);
++
++ /* Tentative initial value to trade off between thr and lat */
++ bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
++ bfqq->pid = pid;
++
++ bfqq->wr_coeff = 1;
++ bfqq->last_wr_start_finish = 0;
++ /*
++ * Set to the value for which bfqq will not be deemed as
++ * soft rt when it becomes backlogged.
++ */
++ bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies);
++}
++
++static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
++ struct bfq_group *bfqg,
++ int is_sync,
++ struct bfq_io_cq *bic,
++ gfp_t gfp_mask)
++{
++ struct bfq_queue *bfqq, *new_bfqq = NULL;
++
++retry:
++ /* bic always exists here */
++ bfqq = bic_to_bfqq(bic, is_sync);
++
++ /*
++ * Always try a new alloc if we fall back to the OOM bfqq
++ * originally, since it should just be a temporary situation.
++ */
++ if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
++ bfqq = NULL;
++ if (new_bfqq != NULL) {
++ bfqq = new_bfqq;
++ new_bfqq = NULL;
++ } else if (gfp_mask & __GFP_WAIT) {
++ spin_unlock_irq(bfqd->queue->queue_lock);
++ new_bfqq = kmem_cache_alloc_node(bfq_pool,
++ gfp_mask | __GFP_ZERO,
++ bfqd->queue->node);
++ spin_lock_irq(bfqd->queue->queue_lock);
++ if (new_bfqq != NULL)
++ goto retry;
++ } else {
++ bfqq = kmem_cache_alloc_node(bfq_pool,
++ gfp_mask | __GFP_ZERO,
++ bfqd->queue->node);
++ }
++
++ if (bfqq != NULL) {
++ bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync);
++ bfq_log_bfqq(bfqd, bfqq, "allocated");
++ } else {
++ bfqq = &bfqd->oom_bfqq;
++ bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
++ }
++
++ bfq_init_prio_data(bfqq, bic);
++ bfq_init_entity(&bfqq->entity, bfqg);
++ }
++
++ if (new_bfqq != NULL)
++ kmem_cache_free(bfq_pool, new_bfqq);
++
++ return bfqq;
++}
++
++static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
++ struct bfq_group *bfqg,
++ int ioprio_class, int ioprio)
++{
++ switch (ioprio_class) {
++ case IOPRIO_CLASS_RT:
++ return &bfqg->async_bfqq[0][ioprio];
++ case IOPRIO_CLASS_NONE:
++ ioprio = IOPRIO_NORM;
++ /* fall through */
++ case IOPRIO_CLASS_BE:
++ return &bfqg->async_bfqq[1][ioprio];
++ case IOPRIO_CLASS_IDLE:
++ return &bfqg->async_idle_bfqq;
++ default:
++ BUG();
++ }
++}
++
++static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
++ struct bfq_group *bfqg, int is_sync,
++ struct bfq_io_cq *bic, gfp_t gfp_mask)
++{
++ const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++ const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
++ struct bfq_queue **async_bfqq = NULL;
++ struct bfq_queue *bfqq = NULL;
++
++ if (!is_sync) {
++ async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
++ ioprio);
++ bfqq = *async_bfqq;
++ }
++
++ if (bfqq == NULL)
++ bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
++
++ /*
++ * Pin the queue now that it's allocated, scheduler exit will
++ * prune it.
++ */
++ if (!is_sync && *async_bfqq == NULL) {
++ atomic_inc(&bfqq->ref);
++ bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
++ bfqq, atomic_read(&bfqq->ref));
++ *async_bfqq = bfqq;
++ }
++
++ atomic_inc(&bfqq->ref);
++ bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq,
++ atomic_read(&bfqq->ref));
++ return bfqq;
++}
++
++static void bfq_update_io_thinktime(struct bfq_data *bfqd,
++ struct bfq_io_cq *bic)
++{
++ unsigned long elapsed = jiffies - bic->ttime.last_end_request;
++ unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle);
++
++ bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
++ bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8;
++ bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) /
++ bic->ttime.ttime_samples;
++}
++
++static void bfq_update_io_seektime(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ struct request *rq)
++{
++ sector_t sdist;
++ u64 total;
++
++ if (bfqq->last_request_pos < blk_rq_pos(rq))
++ sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
++ else
++ sdist = bfqq->last_request_pos - blk_rq_pos(rq);
++
++ /*
++ * Don't allow the seek distance to get too large from the
++ * odd fragment, pagein, etc.
++ */
++ if (bfqq->seek_samples == 0) /* first request, not really a seek */
++ sdist = 0;
++ else if (bfqq->seek_samples <= 60) /* second & third seek */
++ sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
++ else
++ sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
++
++ bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
++ bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
++ total = bfqq->seek_total + (bfqq->seek_samples/2);
++ do_div(total, bfqq->seek_samples);
++ bfqq->seek_mean = (sector_t)total;
++
++ bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist,
++ (u64)bfqq->seek_mean);
++}
++
++/*
++ * Disable idle window if the process thinks too long or seeks so much that
++ * it doesn't matter.
++ */
++static void bfq_update_idle_window(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq,
++ struct bfq_io_cq *bic)
++{
++ int enable_idle;
++
++ /* Don't idle for async or idle io prio class. */
++ if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
++ return;
++
++ enable_idle = bfq_bfqq_idle_window(bfqq);
++
++ if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
++ bfqd->bfq_slice_idle == 0 ||
++ (bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
++ bfqq->wr_coeff == 1))
++ enable_idle = 0;
++ else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
++ if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
++ bfqq->wr_coeff == 1)
++ enable_idle = 0;
++ else
++ enable_idle = 1;
++ }
++ bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
++ enable_idle);
++
++ if (enable_idle)
++ bfq_mark_bfqq_idle_window(bfqq);
++ else
++ bfq_clear_bfqq_idle_window(bfqq);
++}
++
++/*
++ * Called when a new fs request (rq) is added to bfqq. Check if there's
++ * something we should do about it.
++ */
++static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ struct request *rq)
++{
++ struct bfq_io_cq *bic = RQ_BIC(rq);
++
++ if (rq->cmd_flags & REQ_META)
++ bfqq->meta_pending++;
++
++ bfq_update_io_thinktime(bfqd, bic);
++ bfq_update_io_seektime(bfqd, bfqq, rq);
++ if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
++ bfq_clear_bfqq_constantly_seeky(bfqq);
++ if (!blk_queue_nonrot(bfqd->queue)) {
++ BUG_ON(!bfqd->const_seeky_busy_in_flight_queues);
++ bfqd->const_seeky_busy_in_flight_queues--;
++ }
++ }
++ if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
++ !BFQQ_SEEKY(bfqq))
++ bfq_update_idle_window(bfqd, bfqq, bic);
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
++ bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
++ (long long unsigned)bfqq->seek_mean);
++
++ bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
++
++ if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
++ int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
++ blk_rq_sectors(rq) < 32;
++ int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
++
++ /*
++ * There is just this request queued: if the request
++ * is small and the queue is not to be expired, then
++ * just exit.
++ *
++ * In this way, if the disk is being idled to wait for
++ * a new request from the in-service queue, we avoid
++ * unplugging the device and committing the disk to serve
++ * just a small request. On the contrary, we wait for
++ * the block layer to decide when to unplug the device:
++ * hopefully, new requests will be merged to this one
++ * quickly, then the device will be unplugged and
++ * larger requests will be dispatched.
++ */
++ if (small_req && !budget_timeout)
++ return;
++
++ /*
++ * A large enough request arrived, or the queue is to
++ * be expired: in both cases disk idling is to be
++ * stopped, so clear wait_request flag and reset
++ * timer.
++ */
++ bfq_clear_bfqq_wait_request(bfqq);
++ del_timer(&bfqd->idle_slice_timer);
++
++ /*
++ * The queue is not empty, because a new request just
++ * arrived. Hence we can safely expire the queue, in
++ * case of budget timeout, without risking that the
++ * timestamps of the queue are not updated correctly.
++ * See [1] for more details.
++ */
++ if (budget_timeout)
++ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
++
++ /*
++ * Let the request rip immediately, or let a new queue be
++ * selected if bfqq has just been expired.
++ */
++ __blk_run_queue(bfqd->queue);
++ }
++}
++
++static void bfq_insert_request(struct request_queue *q, struct request *rq)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++ assert_spin_locked(bfqd->queue->queue_lock);
++ bfq_init_prio_data(bfqq, RQ_BIC(rq));
++
++ bfq_add_request(rq);
++
++ rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
++ list_add_tail(&rq->queuelist, &bfqq->fifo);
++
++ bfq_rq_enqueued(bfqd, bfqq, rq);
++}
++
++static void bfq_update_hw_tag(struct bfq_data *bfqd)
++{
++ bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
++ bfqd->rq_in_driver);
++
++ if (bfqd->hw_tag == 1)
++ return;
++
++ /*
++ * This sample is valid if the number of outstanding requests
++ * is large enough to allow a queueing behavior. Note that the
++ * sum is not exact, as it's not taking into account deactivated
++ * requests.
++ */
++ if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
++ return;
++
++ if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
++ return;
++
++ bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
++ bfqd->max_rq_in_driver = 0;
++ bfqd->hw_tag_samples = 0;
++}
++
++static void bfq_completed_request(struct request_queue *q, struct request *rq)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++ struct bfq_data *bfqd = bfqq->bfqd;
++ bool sync = bfq_bfqq_sync(bfqq);
++
++ bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)",
++ blk_rq_sectors(rq), sync);
++
++ bfq_update_hw_tag(bfqd);
++
++ BUG_ON(!bfqd->rq_in_driver);
++ BUG_ON(!bfqq->dispatched);
++ bfqd->rq_in_driver--;
++ bfqq->dispatched--;
++
++ if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
++ bfq_weights_tree_remove(bfqd, &bfqq->entity,
++ &bfqd->queue_weights_tree);
++ if (!blk_queue_nonrot(bfqd->queue)) {
++ BUG_ON(!bfqd->busy_in_flight_queues);
++ bfqd->busy_in_flight_queues--;
++ if (bfq_bfqq_constantly_seeky(bfqq)) {
++ BUG_ON(!bfqd->
++ const_seeky_busy_in_flight_queues);
++ bfqd->const_seeky_busy_in_flight_queues--;
++ }
++ }
++ }
++
++ if (sync) {
++ bfqd->sync_flight--;
++ RQ_BIC(rq)->ttime.last_end_request = jiffies;
++ }
++
++ /*
++ * If we are waiting to discover whether the request pattern of the
++ * task associated with the queue is actually isochronous, and
++ * both requisites for this condition to hold are satisfied, then
++ * compute soft_rt_next_start (see the comments to the function
++ * bfq_bfqq_softrt_next_start()).
++ */
++ if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
++ RB_EMPTY_ROOT(&bfqq->sort_list))
++ bfqq->soft_rt_next_start =
++ bfq_bfqq_softrt_next_start(bfqd, bfqq);
++
++ /*
++ * If this is the in-service queue, check if it needs to be expired,
++ * or if we want to idle in case it has no pending requests.
++ */
++ if (bfqd->in_service_queue == bfqq) {
++ if (bfq_bfqq_budget_new(bfqq))
++ bfq_set_budget_timeout(bfqd);
++
++ if (bfq_bfqq_must_idle(bfqq)) {
++ bfq_arm_slice_timer(bfqd);
++ goto out;
++ } else if (bfq_may_expire_for_budg_timeout(bfqq))
++ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
++ else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
++ (bfqq->dispatched == 0 ||
++ !bfq_bfqq_must_not_expire(bfqq)))
++ bfq_bfqq_expire(bfqd, bfqq, 0,
++ BFQ_BFQQ_NO_MORE_REQUESTS);
++ }
++
++ if (!bfqd->rq_in_driver)
++ bfq_schedule_dispatch(bfqd);
++
++out:
++ return;
++}
++
++static inline int __bfq_may_queue(struct bfq_queue *bfqq)
++{
++ if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
++ bfq_clear_bfqq_must_alloc(bfqq);
++ return ELV_MQUEUE_MUST;
++ }
++
++ return ELV_MQUEUE_MAY;
++}
++
++static int bfq_may_queue(struct request_queue *q, int rw)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct task_struct *tsk = current;
++ struct bfq_io_cq *bic;
++ struct bfq_queue *bfqq;
++
++ /*
++ * Don't force setup of a queue from here, as a call to may_queue
++ * does not necessarily imply that a request actually will be
++ * queued. So just lookup a possibly existing queue, or return
++ * 'may queue' if that fails.
++ */
++ bic = bfq_bic_lookup(bfqd, tsk->io_context);
++ if (bic == NULL)
++ return ELV_MQUEUE_MAY;
++
++ bfqq = bic_to_bfqq(bic, rw_is_sync(rw));
++ if (bfqq != NULL) {
++ bfq_init_prio_data(bfqq, bic);
++
++ return __bfq_may_queue(bfqq);
++ }
++
++ return ELV_MQUEUE_MAY;
++}
++
++/*
++ * Queue lock held here.
++ */
++static void bfq_put_request(struct request *rq)
++{
++ struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++ if (bfqq != NULL) {
++ const int rw = rq_data_dir(rq);
++
++ BUG_ON(!bfqq->allocated[rw]);
++ bfqq->allocated[rw]--;
++
++ rq->elv.priv[0] = NULL;
++ rq->elv.priv[1] = NULL;
++
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
++ bfqq, atomic_read(&bfqq->ref));
++ bfq_put_queue(bfqq);
++ }
++}
++
++static struct bfq_queue *
++bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
++ struct bfq_queue *bfqq)
++{
++ bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
++ (long unsigned)bfqq->new_bfqq->pid);
++ bic_set_bfqq(bic, bfqq->new_bfqq, 1);
++ bfq_mark_bfqq_coop(bfqq->new_bfqq);
++ bfq_put_queue(bfqq);
++ return bic_to_bfqq(bic, 1);
++}
++
++/*
++ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
++ * was the last process referring to said bfqq.
++ */
++static struct bfq_queue *
++bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
++{
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
++ if (bfqq_process_refs(bfqq) == 1) {
++ bfqq->pid = current->pid;
++ bfq_clear_bfqq_coop(bfqq);
++ bfq_clear_bfqq_split_coop(bfqq);
++ return bfqq;
++ }
++
++ bic_set_bfqq(bic, NULL, 1);
++
++ bfq_put_cooperator(bfqq);
++
++ bfq_put_queue(bfqq);
++ return NULL;
++}
++
++/*
++ * Allocate bfq data structures associated with this request.
++ */
++static int bfq_set_request(struct request_queue *q, struct request *rq,
++ struct bio *bio, gfp_t gfp_mask)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
++ const int rw = rq_data_dir(rq);
++ const int is_sync = rq_is_sync(rq);
++ struct bfq_queue *bfqq;
++ struct bfq_group *bfqg;
++ unsigned long flags;
++
++ might_sleep_if(gfp_mask & __GFP_WAIT);
++
++ bfq_changed_ioprio(bic);
++
++ spin_lock_irqsave(q->queue_lock, flags);
++
++ if (bic == NULL)
++ goto queue_fail;
++
++ bfqg = bfq_bic_update_cgroup(bic);
++
++new_queue:
++ bfqq = bic_to_bfqq(bic, is_sync);
++ if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
++ bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
++ bic_set_bfqq(bic, bfqq, is_sync);
++ } else {
++ /*
++ * If the queue was seeky for too long, break it apart.
++ */
++ if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
++ bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
++ bfqq = bfq_split_bfqq(bic, bfqq);
++ if (!bfqq)
++ goto new_queue;
++ }
++
++ /*
++ * Check to see if this queue is scheduled to merge with
++ * another closely cooperating queue. The merging of queues
++ * happens here as it must be done in process context.
++ * The reference on new_bfqq was taken in merge_bfqqs.
++ */
++ if (bfqq->new_bfqq != NULL)
++ bfqq = bfq_merge_bfqqs(bfqd, bic, bfqq);
++ }
++
++ bfqq->allocated[rw]++;
++ atomic_inc(&bfqq->ref);
++ bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq,
++ atomic_read(&bfqq->ref));
++
++ rq->elv.priv[0] = bic;
++ rq->elv.priv[1] = bfqq;
++
++ spin_unlock_irqrestore(q->queue_lock, flags);
++
++ return 0;
++
++queue_fail:
++ bfq_schedule_dispatch(bfqd);
++ spin_unlock_irqrestore(q->queue_lock, flags);
++
++ return 1;
++}
++
++static void bfq_kick_queue(struct work_struct *work)
++{
++ struct bfq_data *bfqd =
++ container_of(work, struct bfq_data, unplug_work);
++ struct request_queue *q = bfqd->queue;
++
++ spin_lock_irq(q->queue_lock);
++ __blk_run_queue(q);
++ spin_unlock_irq(q->queue_lock);
++}
++
++/*
++ * Handler of the expiration of the timer running if the in-service queue
++ * is idling inside its time slice.
++ */
++static void bfq_idle_slice_timer(unsigned long data)
++{
++ struct bfq_data *bfqd = (struct bfq_data *)data;
++ struct bfq_queue *bfqq;
++ unsigned long flags;
++ enum bfqq_expiration reason;
++
++ spin_lock_irqsave(bfqd->queue->queue_lock, flags);
++
++ bfqq = bfqd->in_service_queue;
++ /*
++ * Theoretical race here: the in-service queue can be NULL or
++ * different from the queue that was idling if the timer handler
++ * spins on the queue_lock and a new request arrives for the
++ * current queue and there is a full dispatch cycle that changes
++ * the in-service queue. This can hardly happen, but in the worst
++ * case we just expire a queue too early.
++ */
++ if (bfqq != NULL) {
++ bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
++ if (bfq_bfqq_budget_timeout(bfqq))
++ /*
++ * Also here the queue can be safely expired
++ * for budget timeout without wasting
++ * guarantees
++ */
++ reason = BFQ_BFQQ_BUDGET_TIMEOUT;
++ else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
++ /*
++ * The queue may not be empty upon timer expiration,
++ * because we may not disable the timer when the
++ * first request of the in-service queue arrives
++ * during disk idling.
++ */
++ reason = BFQ_BFQQ_TOO_IDLE;
++ else
++ goto schedule_dispatch;
++
++ bfq_bfqq_expire(bfqd, bfqq, 1, reason);
++ }
++
++schedule_dispatch:
++ bfq_schedule_dispatch(bfqd);
++
++ spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
++}
++
++static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
++{
++ del_timer_sync(&bfqd->idle_slice_timer);
++ cancel_work_sync(&bfqd->unplug_work);
++}
++
++static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
++ struct bfq_queue **bfqq_ptr)
++{
++ struct bfq_group *root_group = bfqd->root_group;
++ struct bfq_queue *bfqq = *bfqq_ptr;
++
++ bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
++ if (bfqq != NULL) {
++ bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group);
++ bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
++ bfqq, atomic_read(&bfqq->ref));
++ bfq_put_queue(bfqq);
++ *bfqq_ptr = NULL;
++ }
++}
++
++/*
++ * Release all the bfqg references to its async queues. If we are
++ * deallocating the group these queues may still contain requests, so
++ * we reparent them to the root cgroup (i.e., the only one that will
++ * exist for sure until all the requests on a device are gone).
++ */
++static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
++{
++ int i, j;
++
++ for (i = 0; i < 2; i++)
++ for (j = 0; j < IOPRIO_BE_NR; j++)
++ __bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
++
++ __bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
++}
++
++static void bfq_exit_queue(struct elevator_queue *e)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ struct request_queue *q = bfqd->queue;
++ struct bfq_queue *bfqq, *n;
++
++ bfq_shutdown_timer_wq(bfqd);
++
++ spin_lock_irq(q->queue_lock);
++
++ BUG_ON(bfqd->in_service_queue != NULL);
++ list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
++ bfq_deactivate_bfqq(bfqd, bfqq, 0);
++
++ bfq_disconnect_groups(bfqd);
++ spin_unlock_irq(q->queue_lock);
++
++ bfq_shutdown_timer_wq(bfqd);
++
++ synchronize_rcu();
++
++ BUG_ON(timer_pending(&bfqd->idle_slice_timer));
++
++ bfq_free_root_group(bfqd);
++ kfree(bfqd);
++}
++
++static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
++{
++ struct bfq_group *bfqg;
++ struct bfq_data *bfqd;
++ struct elevator_queue *eq;
++
++ eq = elevator_alloc(q, e);
++ if (eq == NULL)
++ return -ENOMEM;
++
++ bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
++ if (bfqd == NULL) {
++ kobject_put(&eq->kobj);
++ return -ENOMEM;
++ }
++ eq->elevator_data = bfqd;
++
++ /*
++ * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
++ * Grab a permanent reference to it, so that the normal code flow
++ * will not attempt to free it.
++ */
++ bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0);
++ atomic_inc(&bfqd->oom_bfqq.ref);
++
++ bfqd->queue = q;
++
++ spin_lock_irq(q->queue_lock);
++ q->elevator = eq;
++ spin_unlock_irq(q->queue_lock);
++
++ bfqg = bfq_alloc_root_group(bfqd, q->node);
++ if (bfqg == NULL) {
++ kfree(bfqd);
++ kobject_put(&eq->kobj);
++ return -ENOMEM;
++ }
++
++ bfqd->root_group = bfqg;
++#ifdef CONFIG_CGROUP_BFQIO
++ bfqd->active_numerous_groups = 0;
++#endif
++
++ init_timer(&bfqd->idle_slice_timer);
++ bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
++ bfqd->idle_slice_timer.data = (unsigned long)bfqd;
++
++ bfqd->rq_pos_tree = RB_ROOT;
++ bfqd->queue_weights_tree = RB_ROOT;
++ bfqd->group_weights_tree = RB_ROOT;
++
++ INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
++
++ INIT_LIST_HEAD(&bfqd->active_list);
++ INIT_LIST_HEAD(&bfqd->idle_list);
++
++ bfqd->hw_tag = -1;
++
++ bfqd->bfq_max_budget = bfq_default_max_budget;
++
++ bfqd->bfq_quantum = bfq_quantum;
++ bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
++ bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
++ bfqd->bfq_back_max = bfq_back_max;
++ bfqd->bfq_back_penalty = bfq_back_penalty;
++ bfqd->bfq_slice_idle = bfq_slice_idle;
++ bfqd->bfq_class_idle_last_service = 0;
++ bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq;
++ bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
++ bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
++
++ bfqd->bfq_coop_thresh = 2;
++ bfqd->bfq_failed_cooperations = 7000;
++ bfqd->bfq_requests_within_timer = 120;
++
++ bfqd->low_latency = true;
++
++ bfqd->bfq_wr_coeff = 20;
++ bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
++ bfqd->bfq_wr_max_time = 0;
++ bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
++ bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
++ bfqd->bfq_wr_max_softrt_rate = 7000; /*
++ * Approximate rate required
++ * to playback or record a
++ * high-definition compressed
++ * video.
++ */
++ bfqd->wr_busy_queues = 0;
++ bfqd->busy_in_flight_queues = 0;
++ bfqd->const_seeky_busy_in_flight_queues = 0;
++
++ /*
++ * Begin by assuming, optimistically, that the device peak rate is
++ * equal to the highest reference rate.
++ */
++ bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
++ T_fast[blk_queue_nonrot(bfqd->queue)];
++ bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)];
++ bfqd->device_speed = BFQ_BFQD_FAST;
++
++ return 0;
++}
++
++static void bfq_slab_kill(void)
++{
++ if (bfq_pool != NULL)
++ kmem_cache_destroy(bfq_pool);
++}
++
++static int __init bfq_slab_setup(void)
++{
++ bfq_pool = KMEM_CACHE(bfq_queue, 0);
++ if (bfq_pool == NULL)
++ return -ENOMEM;
++ return 0;
++}
++
++static ssize_t bfq_var_show(unsigned int var, char *page)
++{
++ return sprintf(page, "%d\n", var);
++}
++
++static ssize_t bfq_var_store(unsigned long *var, const char *page,
++ size_t count)
++{
++ unsigned long new_val;
++ int ret = kstrtoul(page, 10, &new_val);
++
++ if (ret == 0)
++ *var = new_val;
++
++ return count;
++}
++
++static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
++ jiffies_to_msecs(bfqd->bfq_wr_max_time) :
++ jiffies_to_msecs(bfq_wr_duration(bfqd)));
++}
++
++static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
++{
++ struct bfq_queue *bfqq;
++ struct bfq_data *bfqd = e->elevator_data;
++ ssize_t num_char = 0;
++
++ num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
++ bfqd->queued);
++
++ spin_lock_irq(bfqd->queue->queue_lock);
++
++ num_char += sprintf(page + num_char, "Active:\n");
++ list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
++ num_char += sprintf(page + num_char,
++ "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n",
++ bfqq->pid,
++ bfqq->entity.weight,
++ bfqq->queued[0],
++ bfqq->queued[1],
++ jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ }
++
++ num_char += sprintf(page + num_char, "Idle:\n");
++ list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
++ num_char += sprintf(page + num_char,
++ "pid%d: weight %hu, dur %d/%u\n",
++ bfqq->pid,
++ bfqq->entity.weight,
++ jiffies_to_msecs(jiffies -
++ bfqq->last_wr_start_finish),
++ jiffies_to_msecs(bfqq->wr_cur_max_time));
++ }
++
++ spin_unlock_irq(bfqd->queue->queue_lock);
++
++ return num_char;
++}
++
++#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
++static ssize_t __FUNC(struct elevator_queue *e, char *page) \
++{ \
++ struct bfq_data *bfqd = e->elevator_data; \
++ unsigned int __data = __VAR; \
++ if (__CONV) \
++ __data = jiffies_to_msecs(__data); \
++ return bfq_var_show(__data, (page)); \
++}
++SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0);
++SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1);
++SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1);
++SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
++SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
++SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1);
++SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
++SHOW_FUNCTION(bfq_max_budget_async_rq_show,
++ bfqd->bfq_max_budget_async_rq, 0);
++SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
++SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
++SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
++SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
++SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
++SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
++SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
++ 1);
++SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
++#undef SHOW_FUNCTION
++
++#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
++static ssize_t \
++__FUNC(struct elevator_queue *e, const char *page, size_t count) \
++{ \
++ struct bfq_data *bfqd = e->elevator_data; \
++ unsigned long uninitialized_var(__data); \
++ int ret = bfq_var_store(&__data, (page), count); \
++ if (__data < (MIN)) \
++ __data = (MIN); \
++ else if (__data > (MAX)) \
++ __data = (MAX); \
++ if (__CONV) \
++ *(__PTR) = msecs_to_jiffies(__data); \
++ else \
++ *(__PTR) = __data; \
++ return ret; \
++}
++STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
++ INT_MAX, 1);
++STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
++ INT_MAX, 1);
++STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
++STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
++ INT_MAX, 0);
++STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
++ 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
++ INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
++ 1);
++STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
++ INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
++ &bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
++ INT_MAX, 0);
++#undef STORE_FUNCTION
++
++/* do nothing for the moment */
++static ssize_t bfq_weights_store(struct elevator_queue *e,
++ const char *page, size_t count)
++{
++ return count;
++}
++
++static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
++{
++ u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
++
++ if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
++ return bfq_calc_max_budget(bfqd->peak_rate, timeout);
++ else
++ return bfq_default_max_budget;
++}
++
++static ssize_t bfq_max_budget_store(struct elevator_queue *e,
++ const char *page, size_t count)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ unsigned long uninitialized_var(__data);
++ int ret = bfq_var_store(&__data, (page), count);
++
++ if (__data == 0)
++ bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
++ else {
++ if (__data > INT_MAX)
++ __data = INT_MAX;
++ bfqd->bfq_max_budget = __data;
++ }
++
++ bfqd->bfq_user_max_budget = __data;
++
++ return ret;
++}
++
++static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
++ const char *page, size_t count)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ unsigned long uninitialized_var(__data);
++ int ret = bfq_var_store(&__data, (page), count);
++
++ if (__data < 1)
++ __data = 1;
++ else if (__data > INT_MAX)
++ __data = INT_MAX;
++
++ bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data);
++ if (bfqd->bfq_user_max_budget == 0)
++ bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
++
++ return ret;
++}
++
++static ssize_t bfq_low_latency_store(struct elevator_queue *e,
++ const char *page, size_t count)
++{
++ struct bfq_data *bfqd = e->elevator_data;
++ unsigned long uninitialized_var(__data);
++ int ret = bfq_var_store(&__data, (page), count);
++
++ if (__data > 1)
++ __data = 1;
++ if (__data == 0 && bfqd->low_latency != 0)
++ bfq_end_wr(bfqd);
++ bfqd->low_latency = __data;
++
++ return ret;
++}
++
++#define BFQ_ATTR(name) \
++ __ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
++
++static struct elv_fs_entry bfq_attrs[] = {
++ BFQ_ATTR(quantum),
++ BFQ_ATTR(fifo_expire_sync),
++ BFQ_ATTR(fifo_expire_async),
++ BFQ_ATTR(back_seek_max),
++ BFQ_ATTR(back_seek_penalty),
++ BFQ_ATTR(slice_idle),
++ BFQ_ATTR(max_budget),
++ BFQ_ATTR(max_budget_async_rq),
++ BFQ_ATTR(timeout_sync),
++ BFQ_ATTR(timeout_async),
++ BFQ_ATTR(low_latency),
++ BFQ_ATTR(wr_coeff),
++ BFQ_ATTR(wr_max_time),
++ BFQ_ATTR(wr_rt_max_time),
++ BFQ_ATTR(wr_min_idle_time),
++ BFQ_ATTR(wr_min_inter_arr_async),
++ BFQ_ATTR(wr_max_softrt_rate),
++ BFQ_ATTR(weights),
++ __ATTR_NULL
++};
++
++static struct elevator_type iosched_bfq = {
++ .ops = {
++ .elevator_merge_fn = bfq_merge,
++ .elevator_merged_fn = bfq_merged_request,
++ .elevator_merge_req_fn = bfq_merged_requests,
++ .elevator_allow_merge_fn = bfq_allow_merge,
++ .elevator_dispatch_fn = bfq_dispatch_requests,
++ .elevator_add_req_fn = bfq_insert_request,
++ .elevator_activate_req_fn = bfq_activate_request,
++ .elevator_deactivate_req_fn = bfq_deactivate_request,
++ .elevator_completed_req_fn = bfq_completed_request,
++ .elevator_former_req_fn = elv_rb_former_request,
++ .elevator_latter_req_fn = elv_rb_latter_request,
++ .elevator_init_icq_fn = bfq_init_icq,
++ .elevator_exit_icq_fn = bfq_exit_icq,
++ .elevator_set_req_fn = bfq_set_request,
++ .elevator_put_req_fn = bfq_put_request,
++ .elevator_may_queue_fn = bfq_may_queue,
++ .elevator_init_fn = bfq_init_queue,
++ .elevator_exit_fn = bfq_exit_queue,
++ },
++ .icq_size = sizeof(struct bfq_io_cq),
++ .icq_align = __alignof__(struct bfq_io_cq),
++ .elevator_attrs = bfq_attrs,
++ .elevator_name = "bfq",
++ .elevator_owner = THIS_MODULE,
++};
++
++static int __init bfq_init(void)
++{
++ /*
++ * Can be 0 on HZ < 1000 setups.
++ */
++ if (bfq_slice_idle == 0)
++ bfq_slice_idle = 1;
++
++ if (bfq_timeout_async == 0)
++ bfq_timeout_async = 1;
++
++ if (bfq_slab_setup())
++ return -ENOMEM;
++
++ /*
++ * Times to load large popular applications for the typical systems
++ * installed on the reference devices (see the comments before the
++ * definitions of the two arrays).
++ */
++ T_slow[0] = msecs_to_jiffies(2600);
++ T_slow[1] = msecs_to_jiffies(1000);
++ T_fast[0] = msecs_to_jiffies(5500);
++ T_fast[1] = msecs_to_jiffies(2000);
++
++ /*
++ * Thresholds that determine the switch between speed classes (see
++ * the comments before the definition of the array).
++ */
++ device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2;
++ device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
++
++ elv_register(&iosched_bfq);
++ pr_info("BFQ I/O-scheduler version: v7r5");
++
++ return 0;
++}
++
++static void __exit bfq_exit(void)
++{
++ elv_unregister(&iosched_bfq);
++ bfq_slab_kill();
++}
++
++module_init(bfq_init);
++module_exit(bfq_exit);
++
++MODULE_AUTHOR("Fabio Checconi, Paolo Valente");
++MODULE_LICENSE("GPL");
+diff --git a/block/bfq-sched.c b/block/bfq-sched.c
+new file mode 100644
+index 0000000..c4831b7
+--- /dev/null
++++ b/block/bfq-sched.c
+@@ -0,0 +1,1207 @@
++/*
++ * BFQ: Hierarchical B-WF2Q+ scheduler.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++#ifdef CONFIG_CGROUP_BFQIO
++#define for_each_entity(entity) \
++ for (; entity != NULL; entity = entity->parent)
++
++#define for_each_entity_safe(entity, parent) \
++ for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
++
++static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
++ int extract,
++ struct bfq_data *bfqd);
++
++static inline void bfq_update_budget(struct bfq_entity *next_in_service)
++{
++ struct bfq_entity *bfqg_entity;
++ struct bfq_group *bfqg;
++ struct bfq_sched_data *group_sd;
++
++ BUG_ON(next_in_service == NULL);
++
++ group_sd = next_in_service->sched_data;
++
++ bfqg = container_of(group_sd, struct bfq_group, sched_data);
++ /*
++ * bfq_group's my_entity field is not NULL only if the group
++ * is not the root group. We must not touch the root entity
++ * as it must never become an in-service entity.
++ */
++ bfqg_entity = bfqg->my_entity;
++ if (bfqg_entity != NULL)
++ bfqg_entity->budget = next_in_service->budget;
++}
++
++static int bfq_update_next_in_service(struct bfq_sched_data *sd)
++{
++ struct bfq_entity *next_in_service;
++
++ if (sd->in_service_entity != NULL)
++ /* will update/requeue at the end of service */
++ return 0;
++
++ /*
++ * NOTE: this can be improved in many ways, such as returning
++ * 1 (and thus propagating upwards the update) only when the
++ * budget changes, or caching the bfqq that will be scheduled
++ * next from this subtree. By now we worry more about
++ * correctness than about performance...
++ */
++ next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
++ sd->next_in_service = next_in_service;
++
++ if (next_in_service != NULL)
++ bfq_update_budget(next_in_service);
++
++ return 1;
++}
++
++static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
++ struct bfq_entity *entity)
++{
++ BUG_ON(sd->next_in_service != entity);
++}
++#else
++#define for_each_entity(entity) \
++ for (; entity != NULL; entity = NULL)
++
++#define for_each_entity_safe(entity, parent) \
++ for (parent = NULL; entity != NULL; entity = parent)
++
++static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
++{
++ return 0;
++}
++
++static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
++ struct bfq_entity *entity)
++{
++}
++
++static inline void bfq_update_budget(struct bfq_entity *next_in_service)
++{
++}
++#endif
++
++/*
++ * Shift for timestamp calculations. This actually limits the maximum
++ * service allowed in one timestamp delta (small shift values increase it),
++ * the maximum total weight that can be used for the queues in the system
++ * (big shift values increase it), and the period of virtual time
++ * wraparounds.
++ */
++#define WFQ_SERVICE_SHIFT 22
++
++/**
++ * bfq_gt - compare two timestamps.
++ * @a: first ts.
++ * @b: second ts.
++ *
++ * Return @a > @b, dealing with wrapping correctly.
++ */
++static inline int bfq_gt(u64 a, u64 b)
++{
++ return (s64)(a - b) > 0;
++}
++
++static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = NULL;
++
++ BUG_ON(entity == NULL);
++
++ if (entity->my_sched_data == NULL)
++ bfqq = container_of(entity, struct bfq_queue, entity);
++
++ return bfqq;
++}
++
++
++/**
++ * bfq_delta - map service into the virtual time domain.
++ * @service: amount of service.
++ * @weight: scale factor (weight of an entity or weight sum).
++ */
++static inline u64 bfq_delta(unsigned long service,
++ unsigned long weight)
++{
++ u64 d = (u64)service << WFQ_SERVICE_SHIFT;
++
++ do_div(d, weight);
++ return d;
++}
++
++/**
++ * bfq_calc_finish - assign the finish time to an entity.
++ * @entity: the entity to act upon.
++ * @service: the service to be charged to the entity.
++ */
++static inline void bfq_calc_finish(struct bfq_entity *entity,
++ unsigned long service)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++ BUG_ON(entity->weight == 0);
++
++ entity->finish = entity->start +
++ bfq_delta(service, entity->weight);
++
++ if (bfqq != NULL) {
++ bfq_log_bfqq(bfqq->bfqd, bfqq,
++ "calc_finish: serv %lu, w %d",
++ service, entity->weight);
++ bfq_log_bfqq(bfqq->bfqd, bfqq,
++ "calc_finish: start %llu, finish %llu, delta %llu",
++ entity->start, entity->finish,
++ bfq_delta(service, entity->weight));
++ }
++}
++
++/**
++ * bfq_entity_of - get an entity from a node.
++ * @node: the node field of the entity.
++ *
++ * Convert a node pointer to the relative entity. This is used only
++ * to simplify the logic of some functions and not as the generic
++ * conversion mechanism because, e.g., in the tree walking functions,
++ * the check for a %NULL value would be redundant.
++ */
++static inline struct bfq_entity *bfq_entity_of(struct rb_node *node)
++{
++ struct bfq_entity *entity = NULL;
++
++ if (node != NULL)
++ entity = rb_entry(node, struct bfq_entity, rb_node);
++
++ return entity;
++}
++
++/**
++ * bfq_extract - remove an entity from a tree.
++ * @root: the tree root.
++ * @entity: the entity to remove.
++ */
++static inline void bfq_extract(struct rb_root *root,
++ struct bfq_entity *entity)
++{
++ BUG_ON(entity->tree != root);
++
++ entity->tree = NULL;
++ rb_erase(&entity->rb_node, root);
++}
++
++/**
++ * bfq_idle_extract - extract an entity from the idle tree.
++ * @st: the service tree of the owning @entity.
++ * @entity: the entity being removed.
++ */
++static void bfq_idle_extract(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct rb_node *next;
++
++ BUG_ON(entity->tree != &st->idle);
++
++ if (entity == st->first_idle) {
++ next = rb_next(&entity->rb_node);
++ st->first_idle = bfq_entity_of(next);
++ }
++
++ if (entity == st->last_idle) {
++ next = rb_prev(&entity->rb_node);
++ st->last_idle = bfq_entity_of(next);
++ }
++
++ bfq_extract(&st->idle, entity);
++
++ if (bfqq != NULL)
++ list_del(&bfqq->bfqq_list);
++}
++
++/**
++ * bfq_insert - generic tree insertion.
++ * @root: tree root.
++ * @entity: entity to insert.
++ *
++ * This is used for the idle and the active tree, since they are both
++ * ordered by finish time.
++ */
++static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
++{
++ struct bfq_entity *entry;
++ struct rb_node **node = &root->rb_node;
++ struct rb_node *parent = NULL;
++
++ BUG_ON(entity->tree != NULL);
++
++ while (*node != NULL) {
++ parent = *node;
++ entry = rb_entry(parent, struct bfq_entity, rb_node);
++
++ if (bfq_gt(entry->finish, entity->finish))
++ node = &parent->rb_left;
++ else
++ node = &parent->rb_right;
++ }
++
++ rb_link_node(&entity->rb_node, parent, node);
++ rb_insert_color(&entity->rb_node, root);
++
++ entity->tree = root;
++}
++
++/**
++ * bfq_update_min - update the min_start field of a entity.
++ * @entity: the entity to update.
++ * @node: one of its children.
++ *
++ * This function is called when @entity may store an invalid value for
++ * min_start due to updates to the active tree. The function assumes
++ * that the subtree rooted at @node (which may be its left or its right
++ * child) has a valid min_start value.
++ */
++static inline void bfq_update_min(struct bfq_entity *entity,
++ struct rb_node *node)
++{
++ struct bfq_entity *child;
++
++ if (node != NULL) {
++ child = rb_entry(node, struct bfq_entity, rb_node);
++ if (bfq_gt(entity->min_start, child->min_start))
++ entity->min_start = child->min_start;
++ }
++}
++
++/**
++ * bfq_update_active_node - recalculate min_start.
++ * @node: the node to update.
++ *
++ * @node may have changed position or one of its children may have moved,
++ * this function updates its min_start value. The left and right subtrees
++ * are assumed to hold a correct min_start value.
++ */
++static inline void bfq_update_active_node(struct rb_node *node)
++{
++ struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
++
++ entity->min_start = entity->start;
++ bfq_update_min(entity, node->rb_right);
++ bfq_update_min(entity, node->rb_left);
++}
++
++/**
++ * bfq_update_active_tree - update min_start for the whole active tree.
++ * @node: the starting node.
++ *
++ * @node must be the deepest modified node after an update. This function
++ * updates its min_start using the values held by its children, assuming
++ * that they did not change, and then updates all the nodes that may have
++ * changed in the path to the root. The only nodes that may have changed
++ * are the ones in the path or their siblings.
++ */
++static void bfq_update_active_tree(struct rb_node *node)
++{
++ struct rb_node *parent;
++
++up:
++ bfq_update_active_node(node);
++
++ parent = rb_parent(node);
++ if (parent == NULL)
++ return;
++
++ if (node == parent->rb_left && parent->rb_right != NULL)
++ bfq_update_active_node(parent->rb_right);
++ else if (parent->rb_left != NULL)
++ bfq_update_active_node(parent->rb_left);
++
++ node = parent;
++ goto up;
++}
++
++static void bfq_weights_tree_add(struct bfq_data *bfqd,
++ struct bfq_entity *entity,
++ struct rb_root *root);
++
++static void bfq_weights_tree_remove(struct bfq_data *bfqd,
++ struct bfq_entity *entity,
++ struct rb_root *root);
++
++
++/**
++ * bfq_active_insert - insert an entity in the active tree of its
++ * group/device.
++ * @st: the service tree of the entity.
++ * @entity: the entity being inserted.
++ *
++ * The active tree is ordered by finish time, but an extra key is kept
++ * per each node, containing the minimum value for the start times of
++ * its children (and the node itself), so it's possible to search for
++ * the eligible node with the lowest finish time in logarithmic time.
++ */
++static void bfq_active_insert(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct rb_node *node = &entity->rb_node;
++#ifdef CONFIG_CGROUP_BFQIO
++ struct bfq_sched_data *sd = NULL;
++ struct bfq_group *bfqg = NULL;
++ struct bfq_data *bfqd = NULL;
++#endif
++
++ bfq_insert(&st->active, entity);
++
++ if (node->rb_left != NULL)
++ node = node->rb_left;
++ else if (node->rb_right != NULL)
++ node = node->rb_right;
++
++ bfq_update_active_tree(node);
++
++#ifdef CONFIG_CGROUP_BFQIO
++ sd = entity->sched_data;
++ bfqg = container_of(sd, struct bfq_group, sched_data);
++ BUG_ON(!bfqg);
++ bfqd = (struct bfq_data *)bfqg->bfqd;
++#endif
++ if (bfqq != NULL)
++ list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
++#ifdef CONFIG_CGROUP_BFQIO
++ else { /* bfq_group */
++ BUG_ON(!bfqd);
++ bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
++ }
++ if (bfqg != bfqd->root_group) {
++ BUG_ON(!bfqg);
++ BUG_ON(!bfqd);
++ bfqg->active_entities++;
++ if (bfqg->active_entities == 2)
++ bfqd->active_numerous_groups++;
++ }
++#endif
++}
++
++/**
++ * bfq_ioprio_to_weight - calc a weight from an ioprio.
++ * @ioprio: the ioprio value to convert.
++ */
++static inline unsigned short bfq_ioprio_to_weight(int ioprio)
++{
++ BUG_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
++ return IOPRIO_BE_NR - ioprio;
++}
++
++/**
++ * bfq_weight_to_ioprio - calc an ioprio from a weight.
++ * @weight: the weight value to convert.
++ *
++ * To preserve as mush as possible the old only-ioprio user interface,
++ * 0 is used as an escape ioprio value for weights (numerically) equal or
++ * larger than IOPRIO_BE_NR
++ */
++static inline unsigned short bfq_weight_to_ioprio(int weight)
++{
++ BUG_ON(weight < BFQ_MIN_WEIGHT || weight > BFQ_MAX_WEIGHT);
++ return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight;
++}
++
++static inline void bfq_get_entity(struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++ if (bfqq != NULL) {
++ atomic_inc(&bfqq->ref);
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
++ bfqq, atomic_read(&bfqq->ref));
++ }
++}
++
++/**
++ * bfq_find_deepest - find the deepest node that an extraction can modify.
++ * @node: the node being removed.
++ *
++ * Do the first step of an extraction in an rb tree, looking for the
++ * node that will replace @node, and returning the deepest node that
++ * the following modifications to the tree can touch. If @node is the
++ * last node in the tree return %NULL.
++ */
++static struct rb_node *bfq_find_deepest(struct rb_node *node)
++{
++ struct rb_node *deepest;
++
++ if (node->rb_right == NULL && node->rb_left == NULL)
++ deepest = rb_parent(node);
++ else if (node->rb_right == NULL)
++ deepest = node->rb_left;
++ else if (node->rb_left == NULL)
++ deepest = node->rb_right;
++ else {
++ deepest = rb_next(node);
++ if (deepest->rb_right != NULL)
++ deepest = deepest->rb_right;
++ else if (rb_parent(deepest) != node)
++ deepest = rb_parent(deepest);
++ }
++
++ return deepest;
++}
++
++/**
++ * bfq_active_extract - remove an entity from the active tree.
++ * @st: the service_tree containing the tree.
++ * @entity: the entity being removed.
++ */
++static void bfq_active_extract(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct rb_node *node;
++#ifdef CONFIG_CGROUP_BFQIO
++ struct bfq_sched_data *sd = NULL;
++ struct bfq_group *bfqg = NULL;
++ struct bfq_data *bfqd = NULL;
++#endif
++
++ node = bfq_find_deepest(&entity->rb_node);
++ bfq_extract(&st->active, entity);
++
++ if (node != NULL)
++ bfq_update_active_tree(node);
++
++#ifdef CONFIG_CGROUP_BFQIO
++ sd = entity->sched_data;
++ bfqg = container_of(sd, struct bfq_group, sched_data);
++ BUG_ON(!bfqg);
++ bfqd = (struct bfq_data *)bfqg->bfqd;
++#endif
++ if (bfqq != NULL)
++ list_del(&bfqq->bfqq_list);
++#ifdef CONFIG_CGROUP_BFQIO
++ else { /* bfq_group */
++ BUG_ON(!bfqd);
++ bfq_weights_tree_remove(bfqd, entity,
++ &bfqd->group_weights_tree);
++ }
++ if (bfqg != bfqd->root_group) {
++ BUG_ON(!bfqg);
++ BUG_ON(!bfqd);
++ BUG_ON(!bfqg->active_entities);
++ bfqg->active_entities--;
++ if (bfqg->active_entities == 1) {
++ BUG_ON(!bfqd->active_numerous_groups);
++ bfqd->active_numerous_groups--;
++ }
++ }
++#endif
++}
++
++/**
++ * bfq_idle_insert - insert an entity into the idle tree.
++ * @st: the service tree containing the tree.
++ * @entity: the entity to insert.
++ */
++static void bfq_idle_insert(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct bfq_entity *first_idle = st->first_idle;
++ struct bfq_entity *last_idle = st->last_idle;
++
++ if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
++ st->first_idle = entity;
++ if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
++ st->last_idle = entity;
++
++ bfq_insert(&st->idle, entity);
++
++ if (bfqq != NULL)
++ list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
++}
++
++/**
++ * bfq_forget_entity - remove an entity from the wfq trees.
++ * @st: the service tree.
++ * @entity: the entity being removed.
++ *
++ * Update the device status and forget everything about @entity, putting
++ * the device reference to it, if it is a queue. Entities belonging to
++ * groups are not refcounted.
++ */
++static void bfq_forget_entity(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ struct bfq_sched_data *sd;
++
++ BUG_ON(!entity->on_st);
++
++ entity->on_st = 0;
++ st->wsum -= entity->weight;
++ if (bfqq != NULL) {
++ sd = entity->sched_data;
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
++ bfqq, atomic_read(&bfqq->ref));
++ bfq_put_queue(bfqq);
++ }
++}
++
++/**
++ * bfq_put_idle_entity - release the idle tree ref of an entity.
++ * @st: service tree for the entity.
++ * @entity: the entity being released.
++ */
++static void bfq_put_idle_entity(struct bfq_service_tree *st,
++ struct bfq_entity *entity)
++{
++ bfq_idle_extract(st, entity);
++ bfq_forget_entity(st, entity);
++}
++
++/**
++ * bfq_forget_idle - update the idle tree if necessary.
++ * @st: the service tree to act upon.
++ *
++ * To preserve the global O(log N) complexity we only remove one entry here;
++ * as the idle tree will not grow indefinitely this can be done safely.
++ */
++static void bfq_forget_idle(struct bfq_service_tree *st)
++{
++ struct bfq_entity *first_idle = st->first_idle;
++ struct bfq_entity *last_idle = st->last_idle;
++
++ if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
++ !bfq_gt(last_idle->finish, st->vtime)) {
++ /*
++ * Forget the whole idle tree, increasing the vtime past
++ * the last finish time of idle entities.
++ */
++ st->vtime = last_idle->finish;
++ }
++
++ if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
++ bfq_put_idle_entity(st, first_idle);
++}
++
++static struct bfq_service_tree *
++__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
++ struct bfq_entity *entity)
++{
++ struct bfq_service_tree *new_st = old_st;
++
++ if (entity->ioprio_changed) {
++ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++ unsigned short prev_weight, new_weight;
++ struct bfq_data *bfqd = NULL;
++ struct rb_root *root;
++#ifdef CONFIG_CGROUP_BFQIO
++ struct bfq_sched_data *sd;
++ struct bfq_group *bfqg;
++#endif
++
++ if (bfqq != NULL)
++ bfqd = bfqq->bfqd;
++#ifdef CONFIG_CGROUP_BFQIO
++ else {
++ sd = entity->my_sched_data;
++ bfqg = container_of(sd, struct bfq_group, sched_data);
++ BUG_ON(!bfqg);
++ bfqd = (struct bfq_data *)bfqg->bfqd;
++ BUG_ON(!bfqd);
++ }
++#endif
++
++ BUG_ON(old_st->wsum < entity->weight);
++ old_st->wsum -= entity->weight;
++
++ if (entity->new_weight != entity->orig_weight) {
++ entity->orig_weight = entity->new_weight;
++ entity->ioprio =
++ bfq_weight_to_ioprio(entity->orig_weight);
++ } else if (entity->new_ioprio != entity->ioprio) {
++ entity->ioprio = entity->new_ioprio;
++ entity->orig_weight =
++ bfq_ioprio_to_weight(entity->ioprio);
++ } else
++ entity->new_weight = entity->orig_weight =
++ bfq_ioprio_to_weight(entity->ioprio);
++
++ entity->ioprio_class = entity->new_ioprio_class;
++ entity->ioprio_changed = 0;
++
++ /*
++ * NOTE: here we may be changing the weight too early,
++ * this will cause unfairness. The correct approach
++ * would have required additional complexity to defer
++ * weight changes to the proper time instants (i.e.,
++ * when entity->finish <= old_st->vtime).
++ */
++ new_st = bfq_entity_service_tree(entity);
++
++ prev_weight = entity->weight;
++ new_weight = entity->orig_weight *
++ (bfqq != NULL ? bfqq->wr_coeff : 1);
++ /*
++ * If the weight of the entity changes, remove the entity
++ * from its old weight counter (if there is a counter
++ * associated with the entity), and add it to the counter
++ * associated with its new weight.
++ */
++ if (prev_weight != new_weight) {
++ root = bfqq ? &bfqd->queue_weights_tree :
++ &bfqd->group_weights_tree;
++ bfq_weights_tree_remove(bfqd, entity, root);
++ }
++ entity->weight = new_weight;
++ /*
++ * Add the entity to its weights tree only if it is
++ * not associated with a weight-raised queue.
++ */
++ if (prev_weight != new_weight &&
++ (bfqq ? bfqq->wr_coeff == 1 : 1))
++ /* If we get here, root has been initialized. */
++ bfq_weights_tree_add(bfqd, entity, root);
++
++ new_st->wsum += entity->weight;
++
++ if (new_st != old_st)
++ entity->start = new_st->vtime;
++ }
++
++ return new_st;
++}
++
++/**
++ * bfq_bfqq_served - update the scheduler status after selection for
++ * service.
++ * @bfqq: the queue being served.
++ * @served: bytes to transfer.
++ *
++ * NOTE: this can be optimized, as the timestamps of upper level entities
++ * are synchronized every time a new bfqq is selected for service. By now,
++ * we keep it to better check consistency.
++ */
++static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++ struct bfq_service_tree *st;
++
++ for_each_entity(entity) {
++ st = bfq_entity_service_tree(entity);
++
++ entity->service += served;
++ BUG_ON(entity->service > entity->budget);
++ BUG_ON(st->wsum == 0);
++
++ st->vtime += bfq_delta(served, st->wsum);
++ bfq_forget_idle(st);
++ }
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served);
++}
++
++/**
++ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
++ * @bfqq: the queue that needs a service update.
++ *
++ * When it's not possible to be fair in the service domain, because
++ * a queue is not consuming its budget fast enough (the meaning of
++ * fast depends on the timeout parameter), we charge it a full
++ * budget. In this way we should obtain a sort of time-domain
++ * fairness among all the seeky/slow queues.
++ */
++static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
++
++ bfq_bfqq_served(bfqq, entity->budget - entity->service);
++}
++
++/**
++ * __bfq_activate_entity - activate an entity.
++ * @entity: the entity being activated.
++ *
++ * Called whenever an entity is activated, i.e., it is not active and one
++ * of its children receives a new request, or has to be reactivated due to
++ * budget exhaustion. It uses the current budget of the entity (and the
++ * service received if @entity is active) of the queue to calculate its
++ * timestamps.
++ */
++static void __bfq_activate_entity(struct bfq_entity *entity)
++{
++ struct bfq_sched_data *sd = entity->sched_data;
++ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++
++ if (entity == sd->in_service_entity) {
++ BUG_ON(entity->tree != NULL);
++ /*
++ * If we are requeueing the current entity we have
++ * to take care of not charging to it service it has
++ * not received.
++ */
++ bfq_calc_finish(entity, entity->service);
++ entity->start = entity->finish;
++ sd->in_service_entity = NULL;
++ } else if (entity->tree == &st->active) {
++ /*
++ * Requeueing an entity due to a change of some
++ * next_in_service entity below it. We reuse the
++ * old start time.
++ */
++ bfq_active_extract(st, entity);
++ } else if (entity->tree == &st->idle) {
++ /*
++ * Must be on the idle tree, bfq_idle_extract() will
++ * check for that.
++ */
++ bfq_idle_extract(st, entity);
++ entity->start = bfq_gt(st->vtime, entity->finish) ?
++ st->vtime : entity->finish;
++ } else {
++ /*
++ * The finish time of the entity may be invalid, and
++ * it is in the past for sure, otherwise the queue
++ * would have been on the idle tree.
++ */
++ entity->start = st->vtime;
++ st->wsum += entity->weight;
++ bfq_get_entity(entity);
++
++ BUG_ON(entity->on_st);
++ entity->on_st = 1;
++ }
++
++ st = __bfq_entity_update_weight_prio(st, entity);
++ bfq_calc_finish(entity, entity->budget);
++ bfq_active_insert(st, entity);
++}
++
++/**
++ * bfq_activate_entity - activate an entity and its ancestors if necessary.
++ * @entity: the entity to activate.
++ *
++ * Activate @entity and all the entities on the path from it to the root.
++ */
++static void bfq_activate_entity(struct bfq_entity *entity)
++{
++ struct bfq_sched_data *sd;
++
++ for_each_entity(entity) {
++ __bfq_activate_entity(entity);
++
++ sd = entity->sched_data;
++ if (!bfq_update_next_in_service(sd))
++ /*
++ * No need to propagate the activation to the
++ * upper entities, as they will be updated when
++ * the in-service entity is rescheduled.
++ */
++ break;
++ }
++}
++
++/**
++ * __bfq_deactivate_entity - deactivate an entity from its service tree.
++ * @entity: the entity to deactivate.
++ * @requeue: if false, the entity will not be put into the idle tree.
++ *
++ * Deactivate an entity, independently from its previous state. If the
++ * entity was not on a service tree just return, otherwise if it is on
++ * any scheduler tree, extract it from that tree, and if necessary
++ * and if the caller did not specify @requeue, put it on the idle tree.
++ *
++ * Return %1 if the caller should update the entity hierarchy, i.e.,
++ * if the entity was in service or if it was the next_in_service for
++ * its sched_data; return %0 otherwise.
++ */
++static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
++{
++ struct bfq_sched_data *sd = entity->sched_data;
++ struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++ int was_in_service = entity == sd->in_service_entity;
++ int ret = 0;
++
++ if (!entity->on_st)
++ return 0;
++
++ BUG_ON(was_in_service && entity->tree != NULL);
++
++ if (was_in_service) {
++ bfq_calc_finish(entity, entity->service);
++ sd->in_service_entity = NULL;
++ } else if (entity->tree == &st->active)
++ bfq_active_extract(st, entity);
++ else if (entity->tree == &st->idle)
++ bfq_idle_extract(st, entity);
++ else if (entity->tree != NULL)
++ BUG();
++
++ if (was_in_service || sd->next_in_service == entity)
++ ret = bfq_update_next_in_service(sd);
++
++ if (!requeue || !bfq_gt(entity->finish, st->vtime))
++ bfq_forget_entity(st, entity);
++ else
++ bfq_idle_insert(st, entity);
++
++ BUG_ON(sd->in_service_entity == entity);
++ BUG_ON(sd->next_in_service == entity);
++
++ return ret;
++}
++
++/**
++ * bfq_deactivate_entity - deactivate an entity.
++ * @entity: the entity to deactivate.
++ * @requeue: true if the entity can be put on the idle tree
++ */
++static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
++{
++ struct bfq_sched_data *sd;
++ struct bfq_entity *parent;
++
++ for_each_entity_safe(entity, parent) {
++ sd = entity->sched_data;
++
++ if (!__bfq_deactivate_entity(entity, requeue))
++ /*
++ * The parent entity is still backlogged, and
++ * we don't need to update it as it is still
++ * in service.
++ */
++ break;
++
++ if (sd->next_in_service != NULL)
++ /*
++ * The parent entity is still backlogged and
++ * the budgets on the path towards the root
++ * need to be updated.
++ */
++ goto update;
++
++ /*
++ * If we reach there the parent is no more backlogged and
++ * we want to propagate the dequeue upwards.
++ */
++ requeue = 1;
++ }
++
++ return;
++
++update:
++ entity = parent;
++ for_each_entity(entity) {
++ __bfq_activate_entity(entity);
++
++ sd = entity->sched_data;
++ if (!bfq_update_next_in_service(sd))
++ break;
++ }
++}
++
++/**
++ * bfq_update_vtime - update vtime if necessary.
++ * @st: the service tree to act upon.
++ *
++ * If necessary update the service tree vtime to have at least one
++ * eligible entity, skipping to its start time. Assumes that the
++ * active tree of the device is not empty.
++ *
++ * NOTE: this hierarchical implementation updates vtimes quite often,
++ * we may end up with reactivated processes getting timestamps after a
++ * vtime skip done because we needed a ->first_active entity on some
++ * intermediate node.
++ */
++static void bfq_update_vtime(struct bfq_service_tree *st)
++{
++ struct bfq_entity *entry;
++ struct rb_node *node = st->active.rb_node;
++
++ entry = rb_entry(node, struct bfq_entity, rb_node);
++ if (bfq_gt(entry->min_start, st->vtime)) {
++ st->vtime = entry->min_start;
++ bfq_forget_idle(st);
++ }
++}
++
++/**
++ * bfq_first_active_entity - find the eligible entity with
++ * the smallest finish time
++ * @st: the service tree to select from.
++ *
++ * This function searches the first schedulable entity, starting from the
++ * root of the tree and going on the left every time on this side there is
++ * a subtree with at least one eligible (start >= vtime) entity. The path on
++ * the right is followed only if a) the left subtree contains no eligible
++ * entities and b) no eligible entity has been found yet.
++ */
++static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
++{
++ struct bfq_entity *entry, *first = NULL;
++ struct rb_node *node = st->active.rb_node;
++
++ while (node != NULL) {
++ entry = rb_entry(node, struct bfq_entity, rb_node);
++left:
++ if (!bfq_gt(entry->start, st->vtime))
++ first = entry;
++
++ BUG_ON(bfq_gt(entry->min_start, st->vtime));
++
++ if (node->rb_left != NULL) {
++ entry = rb_entry(node->rb_left,
++ struct bfq_entity, rb_node);
++ if (!bfq_gt(entry->min_start, st->vtime)) {
++ node = node->rb_left;
++ goto left;
++ }
++ }
++ if (first != NULL)
++ break;
++ node = node->rb_right;
++ }
++
++ BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
++ return first;
++}
++
++/**
++ * __bfq_lookup_next_entity - return the first eligible entity in @st.
++ * @st: the service tree.
++ *
++ * Update the virtual time in @st and return the first eligible entity
++ * it contains.
++ */
++static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
++ bool force)
++{
++ struct bfq_entity *entity, *new_next_in_service = NULL;
++
++ if (RB_EMPTY_ROOT(&st->active))
++ return NULL;
++
++ bfq_update_vtime(st);
++ entity = bfq_first_active_entity(st);
++ BUG_ON(bfq_gt(entity->start, st->vtime));
++
++ /*
++ * If the chosen entity does not match with the sched_data's
++ * next_in_service and we are forcedly serving the IDLE priority
++ * class tree, bubble up budget update.
++ */
++ if (unlikely(force && entity != entity->sched_data->next_in_service)) {
++ new_next_in_service = entity;
++ for_each_entity(new_next_in_service)
++ bfq_update_budget(new_next_in_service);
++ }
++
++ return entity;
++}
++
++/**
++ * bfq_lookup_next_entity - return the first eligible entity in @sd.
++ * @sd: the sched_data.
++ * @extract: if true the returned entity will be also extracted from @sd.
++ *
++ * NOTE: since we cache the next_in_service entity at each level of the
++ * hierarchy, the complexity of the lookup can be decreased with
++ * absolutely no effort just returning the cached next_in_service value;
++ * we prefer to do full lookups to test the consistency of * the data
++ * structures.
++ */
++static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
++ int extract,
++ struct bfq_data *bfqd)
++{
++ struct bfq_service_tree *st = sd->service_tree;
++ struct bfq_entity *entity;
++ int i = 0;
++
++ BUG_ON(sd->in_service_entity != NULL);
++
++ if (bfqd != NULL &&
++ jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) {
++ entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
++ true);
++ if (entity != NULL) {
++ i = BFQ_IOPRIO_CLASSES - 1;
++ bfqd->bfq_class_idle_last_service = jiffies;
++ sd->next_in_service = entity;
++ }
++ }
++ for (; i < BFQ_IOPRIO_CLASSES; i++) {
++ entity = __bfq_lookup_next_entity(st + i, false);
++ if (entity != NULL) {
++ if (extract) {
++ bfq_check_next_in_service(sd, entity);
++ bfq_active_extract(st + i, entity);
++ sd->in_service_entity = entity;
++ sd->next_in_service = NULL;
++ }
++ break;
++ }
++ }
++
++ return entity;
++}
++
++/*
++ * Get next queue for service.
++ */
++static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
++{
++ struct bfq_entity *entity = NULL;
++ struct bfq_sched_data *sd;
++ struct bfq_queue *bfqq;
++
++ BUG_ON(bfqd->in_service_queue != NULL);
++
++ if (bfqd->busy_queues == 0)
++ return NULL;
++
++ sd = &bfqd->root_group->sched_data;
++ for (; sd != NULL; sd = entity->my_sched_data) {
++ entity = bfq_lookup_next_entity(sd, 1, bfqd);
++ BUG_ON(entity == NULL);
++ entity->service = 0;
++ }
++
++ bfqq = bfq_entity_to_bfqq(entity);
++ BUG_ON(bfqq == NULL);
++
++ return bfqq;
++}
++
++/*
++ * Forced extraction of the given queue.
++ */
++static void bfq_get_next_queue_forced(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity;
++ struct bfq_sched_data *sd;
++
++ BUG_ON(bfqd->in_service_queue != NULL);
++
++ entity = &bfqq->entity;
++ /*
++ * Bubble up extraction/update from the leaf to the root.
++ */
++ for_each_entity(entity) {
++ sd = entity->sched_data;
++ bfq_update_budget(entity);
++ bfq_update_vtime(bfq_entity_service_tree(entity));
++ bfq_active_extract(bfq_entity_service_tree(entity), entity);
++ sd->in_service_entity = entity;
++ sd->next_in_service = NULL;
++ entity->service = 0;
++ }
++
++ return;
++}
++
++static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
++{
++ if (bfqd->in_service_bic != NULL) {
++ put_io_context(bfqd->in_service_bic->icq.ioc);
++ bfqd->in_service_bic = NULL;
++ }
++
++ bfqd->in_service_queue = NULL;
++ del_timer(&bfqd->idle_slice_timer);
++}
++
++static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ int requeue)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++
++ if (bfqq == bfqd->in_service_queue)
++ __bfq_bfqd_reset_in_service(bfqd);
++
++ bfq_deactivate_entity(entity, requeue);
++}
++
++static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ struct bfq_entity *entity = &bfqq->entity;
++
++ bfq_activate_entity(entity);
++}
++
++/*
++ * Called when the bfqq no longer has requests pending, remove it from
++ * the service tree.
++ */
++static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ int requeue)
++{
++ BUG_ON(!bfq_bfqq_busy(bfqq));
++ BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list));
++
++ bfq_log_bfqq(bfqd, bfqq, "del from busy");
++
++ bfq_clear_bfqq_busy(bfqq);
++
++ BUG_ON(bfqd->busy_queues == 0);
++ bfqd->busy_queues--;
++
++ if (!bfqq->dispatched) {
++ bfq_weights_tree_remove(bfqd, &bfqq->entity,
++ &bfqd->queue_weights_tree);
++ if (!blk_queue_nonrot(bfqd->queue)) {
++ BUG_ON(!bfqd->busy_in_flight_queues);
++ bfqd->busy_in_flight_queues--;
++ if (bfq_bfqq_constantly_seeky(bfqq)) {
++ BUG_ON(!bfqd->
++ const_seeky_busy_in_flight_queues);
++ bfqd->const_seeky_busy_in_flight_queues--;
++ }
++ }
++ }
++ if (bfqq->wr_coeff > 1)
++ bfqd->wr_busy_queues--;
++
++ bfq_deactivate_bfqq(bfqd, bfqq, requeue);
++}
++
++/*
++ * Called when an inactive queue receives a new request.
++ */
++static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ BUG_ON(bfq_bfqq_busy(bfqq));
++ BUG_ON(bfqq == bfqd->in_service_queue);
++
++ bfq_log_bfqq(bfqd, bfqq, "add to busy");
++
++ bfq_activate_bfqq(bfqd, bfqq);
++
++ bfq_mark_bfqq_busy(bfqq);
++ bfqd->busy_queues++;
++
++ if (!bfqq->dispatched) {
++ if (bfqq->wr_coeff == 1)
++ bfq_weights_tree_add(bfqd, &bfqq->entity,
++ &bfqd->queue_weights_tree);
++ if (!blk_queue_nonrot(bfqd->queue)) {
++ bfqd->busy_in_flight_queues++;
++ if (bfq_bfqq_constantly_seeky(bfqq))
++ bfqd->const_seeky_busy_in_flight_queues++;
++ }
++ }
++ if (bfqq->wr_coeff > 1)
++ bfqd->wr_busy_queues++;
++}
+diff --git a/block/bfq.h b/block/bfq.h
+new file mode 100644
+index 0000000..a83e69d
+--- /dev/null
++++ b/block/bfq.h
+@@ -0,0 +1,742 @@
++/*
++ * BFQ-v7r5 for 3.16.0: data structures and common functions prototypes.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ * Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++#ifndef _BFQ_H
++#define _BFQ_H
++
++#include <linux/blktrace_api.h>
++#include <linux/hrtimer.h>
++#include <linux/ioprio.h>
++#include <linux/rbtree.h>
++
++#define BFQ_IOPRIO_CLASSES 3
++#define BFQ_CL_IDLE_TIMEOUT (HZ/5)
++
++#define BFQ_MIN_WEIGHT 1
++#define BFQ_MAX_WEIGHT 1000
++
++#define BFQ_DEFAULT_GRP_WEIGHT 10
++#define BFQ_DEFAULT_GRP_IOPRIO 0
++#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
++
++struct bfq_entity;
++
++/**
++ * struct bfq_service_tree - per ioprio_class service tree.
++ * @active: tree for active entities (i.e., those backlogged).
++ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
++ * @first_idle: idle entity with minimum F_i.
++ * @last_idle: idle entity with maximum F_i.
++ * @vtime: scheduler virtual time.
++ * @wsum: scheduler weight sum; active and idle entities contribute to it.
++ *
++ * Each service tree represents a B-WF2Q+ scheduler on its own. Each
++ * ioprio_class has its own independent scheduler, and so its own
++ * bfq_service_tree. All the fields are protected by the queue lock
++ * of the containing bfqd.
++ */
++struct bfq_service_tree {
++ struct rb_root active;
++ struct rb_root idle;
++
++ struct bfq_entity *first_idle;
++ struct bfq_entity *last_idle;
++
++ u64 vtime;
++ unsigned long wsum;
++};
++
++/**
++ * struct bfq_sched_data - multi-class scheduler.
++ * @in_service_entity: entity in service.
++ * @next_in_service: head-of-the-line entity in the scheduler.
++ * @service_tree: array of service trees, one per ioprio_class.
++ *
++ * bfq_sched_data is the basic scheduler queue. It supports three
++ * ioprio_classes, and can be used either as a toplevel queue or as
++ * an intermediate queue on a hierarchical setup.
++ * @next_in_service points to the active entity of the sched_data
++ * service trees that will be scheduled next.
++ *
++ * The supported ioprio_classes are the same as in CFQ, in descending
++ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
++ * Requests from higher priority queues are served before all the
++ * requests from lower priority queues; among requests of the same
++ * queue requests are served according to B-WF2Q+.
++ * All the fields are protected by the queue lock of the containing bfqd.
++ */
++struct bfq_sched_data {
++ struct bfq_entity *in_service_entity;
++ struct bfq_entity *next_in_service;
++ struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
++};
++
++/**
++ * struct bfq_weight_counter - counter of the number of all active entities
++ * with a given weight.
++ * @weight: weight of the entities that this counter refers to.
++ * @num_active: number of active entities with this weight.
++ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
++ * and @group_weights_tree).
++ */
++struct bfq_weight_counter {
++ short int weight;
++ unsigned int num_active;
++ struct rb_node weights_node;
++};
++
++/**
++ * struct bfq_entity - schedulable entity.
++ * @rb_node: service_tree member.
++ * @weight_counter: pointer to the weight counter associated with this entity.
++ * @on_st: flag, true if the entity is on a tree (either the active or
++ * the idle one of its service_tree).
++ * @finish: B-WF2Q+ finish timestamp (aka F_i).
++ * @start: B-WF2Q+ start timestamp (aka S_i).
++ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
++ * @min_start: minimum start time of the (active) subtree rooted at
++ * this entity; used for O(log N) lookups into active trees.
++ * @service: service received during the last round of service.
++ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
++ * @weight: weight of the queue
++ * @parent: parent entity, for hierarchical scheduling.
++ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
++ * associated scheduler queue, %NULL on leaf nodes.
++ * @sched_data: the scheduler queue this entity belongs to.
++ * @ioprio: the ioprio in use.
++ * @new_weight: when a weight change is requested, the new weight value.
++ * @orig_weight: original weight, used to implement weight boosting
++ * @new_ioprio: when an ioprio change is requested, the new ioprio value.
++ * @ioprio_class: the ioprio_class in use.
++ * @new_ioprio_class: when an ioprio_class change is requested, the new
++ * ioprio_class value.
++ * @ioprio_changed: flag, true when the user requested a weight, ioprio or
++ * ioprio_class change.
++ *
++ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
++ * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each
++ * entity belongs to the sched_data of the parent group in the cgroup
++ * hierarchy. Non-leaf entities have also their own sched_data, stored
++ * in @my_sched_data.
++ *
++ * Each entity stores independently its priority values; this would
++ * allow different weights on different devices, but this
++ * functionality is not exported to userspace by now. Priorities and
++ * weights are updated lazily, first storing the new values into the
++ * new_* fields, then setting the @ioprio_changed flag. As soon as
++ * there is a transition in the entity state that allows the priority
++ * update to take place the effective and the requested priority
++ * values are synchronized.
++ *
++ * Unless cgroups are used, the weight value is calculated from the
++ * ioprio to export the same interface as CFQ. When dealing with
++ * ``well-behaved'' queues (i.e., queues that do not spend too much
++ * time to consume their budget and have true sequential behavior, and
++ * when there are no external factors breaking anticipation) the
++ * relative weights at each level of the cgroups hierarchy should be
++ * guaranteed. All the fields are protected by the queue lock of the
++ * containing bfqd.
++ */
++struct bfq_entity {
++ struct rb_node rb_node;
++ struct bfq_weight_counter *weight_counter;
++
++ int on_st;
++
++ u64 finish;
++ u64 start;
++
++ struct rb_root *tree;
++
++ u64 min_start;
++
++ unsigned long service, budget;
++ unsigned short weight, new_weight;
++ unsigned short orig_weight;
++
++ struct bfq_entity *parent;
++
++ struct bfq_sched_data *my_sched_data;
++ struct bfq_sched_data *sched_data;
++
++ unsigned short ioprio, new_ioprio;
++ unsigned short ioprio_class, new_ioprio_class;
++
++ int ioprio_changed;
++};
++
++struct bfq_group;
++
++/**
++ * struct bfq_queue - leaf schedulable entity.
++ * @ref: reference counter.
++ * @bfqd: parent bfq_data.
++ * @new_bfqq: shared bfq_queue if queue is cooperating with
++ * one or more other queues.
++ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree).
++ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree).
++ * @sort_list: sorted list of pending requests.
++ * @next_rq: if fifo isn't expired, next request to serve.
++ * @queued: nr of requests queued in @sort_list.
++ * @allocated: currently allocated requests.
++ * @meta_pending: pending metadata requests.
++ * @fifo: fifo list of requests in sort_list.
++ * @entity: entity representing this queue in the scheduler.
++ * @max_budget: maximum budget allowed from the feedback mechanism.
++ * @budget_timeout: budget expiration (in jiffies).
++ * @dispatched: number of requests on the dispatch list or inside driver.
++ * @flags: status flags.
++ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
++ * @seek_samples: number of seeks sampled
++ * @seek_total: sum of the distances of the seeks sampled
++ * @seek_mean: mean seek distance
++ * @last_request_pos: position of the last request enqueued
++ * @requests_within_timer: number of consecutive pairs of request completion
++ * and arrival, such that the queue becomes idle
++ * after the completion, but the next request arrives
++ * within an idle time slice; used only if the queue's
++ * IO_bound has been cleared.
++ * @pid: pid of the process owning the queue, used for logging purposes.
++ * @last_wr_start_finish: start time of the current weight-raising period if
++ * the @bfq-queue is being weight-raised, otherwise
++ * finish time of the last weight-raising period
++ * @wr_cur_max_time: current max raising time for this queue
++ * @soft_rt_next_start: minimum time instant such that, only if a new
++ * request is enqueued after this time instant in an
++ * idle @bfq_queue with no outstanding requests, then
++ * the task associated with the queue it is deemed as
++ * soft real-time (see the comments to the function
++ * bfq_bfqq_softrt_next_start()).
++ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
++ * idle to backlogged
++ * @service_from_backlogged: cumulative service received from the @bfq_queue
++ * since the last transition from idle to
++ * backlogged
++ *
++ * A bfq_queue is a leaf request queue; it can be associated with an io_context
++ * or more, if it is async or shared between cooperating processes. @cgroup
++ * holds a reference to the cgroup, to be sure that it does not disappear while
++ * a bfqq still references it (mostly to avoid races between request issuing and
++ * task migration followed by cgroup destruction).
++ * All the fields are protected by the queue lock of the containing bfqd.
++ */
++struct bfq_queue {
++ atomic_t ref;
++ struct bfq_data *bfqd;
++
++ /* fields for cooperating queues handling */
++ struct bfq_queue *new_bfqq;
++ struct rb_node pos_node;
++ struct rb_root *pos_root;
++
++ struct rb_root sort_list;
++ struct request *next_rq;
++ int queued[2];
++ int allocated[2];
++ int meta_pending;
++ struct list_head fifo;
++
++ struct bfq_entity entity;
++
++ unsigned long max_budget;
++ unsigned long budget_timeout;
++
++ int dispatched;
++
++ unsigned int flags;
++
++ struct list_head bfqq_list;
++
++ unsigned int seek_samples;
++ u64 seek_total;
++ sector_t seek_mean;
++ sector_t last_request_pos;
++
++ unsigned int requests_within_timer;
++
++ pid_t pid;
++
++ /* weight-raising fields */
++ unsigned long wr_cur_max_time;
++ unsigned long soft_rt_next_start;
++ unsigned long last_wr_start_finish;
++ unsigned int wr_coeff;
++ unsigned long last_idle_bklogged;
++ unsigned long service_from_backlogged;
++};
++
++/**
++ * struct bfq_ttime - per process thinktime stats.
++ * @ttime_total: total process thinktime
++ * @ttime_samples: number of thinktime samples
++ * @ttime_mean: average process thinktime
++ */
++struct bfq_ttime {
++ unsigned long last_end_request;
++
++ unsigned long ttime_total;
++ unsigned long ttime_samples;
++ unsigned long ttime_mean;
++};
++
++/**
++ * struct bfq_io_cq - per (request_queue, io_context) structure.
++ * @icq: associated io_cq structure
++ * @bfqq: array of two process queues, the sync and the async
++ * @ttime: associated @bfq_ttime struct
++ */
++struct bfq_io_cq {
++ struct io_cq icq; /* must be the first member */
++ struct bfq_queue *bfqq[2];
++ struct bfq_ttime ttime;
++ int ioprio;
++};
++
++enum bfq_device_speed {
++ BFQ_BFQD_FAST,
++ BFQ_BFQD_SLOW,
++};
++
++/**
++ * struct bfq_data - per device data structure.
++ * @queue: request queue for the managed device.
++ * @root_group: root bfq_group for the device.
++ * @rq_pos_tree: rbtree sorted by next_request position, used when
++ * determining if two or more queues have interleaving
++ * requests (see bfq_close_cooperator()).
++ * @active_numerous_groups: number of bfq_groups containing more than one
++ * active @bfq_entity.
++ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
++ * weight. Used to keep track of whether all @bfq_queues
++ * have the same weight. The tree contains one counter
++ * for each distinct weight associated to some active
++ * and not weight-raised @bfq_queue (see the comments to
++ * the functions bfq_weights_tree_[add|remove] for
++ * further details).
++ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
++ * by weight. Used to keep track of whether all
++ * @bfq_groups have the same weight. The tree contains
++ * one counter for each distinct weight associated to
++ * some active @bfq_group (see the comments to the
++ * functions bfq_weights_tree_[add|remove] for further
++ * details).
++ * @busy_queues: number of bfq_queues containing requests (including the
++ * queue in service, even if it is idling).
++ * @busy_in_flight_queues: number of @bfq_queues containing pending or
++ * in-flight requests, plus the @bfq_queue in
++ * service, even if idle but waiting for the
++ * possible arrival of its next sync request. This
++ * field is updated only if the device is rotational,
++ * but used only if the device is also NCQ-capable.
++ * The reason why the field is updated also for non-
++ * NCQ-capable rotational devices is related to the
++ * fact that the value of @hw_tag may be set also
++ * later than when busy_in_flight_queues may need to
++ * be incremented for the first time(s). Taking also
++ * this possibility into account, to avoid unbalanced
++ * increments/decrements, would imply more overhead
++ * than just updating busy_in_flight_queues
++ * regardless of the value of @hw_tag.
++ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
++ * (that is, seeky queues that expired
++ * for budget timeout at least once)
++ * containing pending or in-flight
++ * requests, including the in-service
++ * @bfq_queue if constantly seeky. This
++ * field is updated only if the device
++ * is rotational, but used only if the
++ * device is also NCQ-capable (see the
++ * comments to @busy_in_flight_queues).
++ * @wr_busy_queues: number of weight-raised busy @bfq_queues.
++ * @queued: number of queued requests.
++ * @rq_in_driver: number of requests dispatched and waiting for completion.
++ * @sync_flight: number of sync requests in the driver.
++ * @max_rq_in_driver: max number of reqs in driver in the last
++ * @hw_tag_samples completed requests.
++ * @hw_tag_samples: nr of samples used to calculate hw_tag.
++ * @hw_tag: flag set to one if the driver is showing a queueing behavior.
++ * @budgets_assigned: number of budgets assigned.
++ * @idle_slice_timer: timer set when idling for the next sequential request
++ * from the queue in service.
++ * @unplug_work: delayed work to restart dispatching on the request queue.
++ * @in_service_queue: bfq_queue in service.
++ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue.
++ * @last_position: on-disk position of the last served request.
++ * @last_budget_start: beginning of the last budget.
++ * @last_idling_start: beginning of the last idle slice.
++ * @peak_rate: peak transfer rate observed for a budget.
++ * @peak_rate_samples: number of samples used to calculate @peak_rate.
++ * @bfq_max_budget: maximum budget allotted to a bfq_queue before
++ * rescheduling.
++ * @group_list: list of all the bfq_groups active on the device.
++ * @active_list: list of all the bfq_queues active on the device.
++ * @idle_list: list of all the bfq_queues idle on the device.
++ * @bfq_quantum: max number of requests dispatched per dispatch round.
++ * @bfq_fifo_expire: timeout for async/sync requests; when it expires
++ * requests are served in fifo order.
++ * @bfq_back_penalty: weight of backward seeks wrt forward ones.
++ * @bfq_back_max: maximum allowed backward seek.
++ * @bfq_slice_idle: maximum idling time.
++ * @bfq_user_max_budget: user-configured max budget value
++ * (0 for auto-tuning).
++ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to
++ * async queues.
++ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to
++ * to prevent seeky queues to impose long latencies to well
++ * behaved ones (this also implies that seeky queues cannot
++ * receive guarantees in the service domain; after a timeout
++ * they are charged for the whole allocated budget, to try
++ * to preserve a behavior reasonably fair among them, but
++ * without service-domain guarantees).
++ * @bfq_coop_thresh: number of queue merges after which a @bfq_queue is
++ * no more granted any weight-raising.
++ * @bfq_failed_cooperations: number of consecutive failed cooperation
++ * chances after which weight-raising is restored
++ * to a queue subject to more than bfq_coop_thresh
++ * queue merges.
++ * @bfq_requests_within_timer: number of consecutive requests that must be
++ * issued within the idle time slice to set
++ * again idling to a queue which was marked as
++ * non-I/O-bound (see the definition of the
++ * IO_bound flag for further details).
++ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
++ * queue is multiplied
++ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
++ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
++ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
++ * may be reactivated for a queue (in jiffies)
++ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
++ * after which weight-raising may be
++ * reactivated for an already busy queue
++ * (in jiffies)
++ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
++ * sectors per seconds
++ * @RT_prod: cached value of the product R*T used for computing the maximum
++ * duration of the weight raising automatically
++ * @device_speed: device-speed class for the low-latency heuristic
++ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
++ *
++ * All the fields are protected by the @queue lock.
++ */
++struct bfq_data {
++ struct request_queue *queue;
++
++ struct bfq_group *root_group;
++ struct rb_root rq_pos_tree;
++
++#ifdef CONFIG_CGROUP_BFQIO
++ int active_numerous_groups;
++#endif
++
++ struct rb_root queue_weights_tree;
++ struct rb_root group_weights_tree;
++
++ int busy_queues;
++ int busy_in_flight_queues;
++ int const_seeky_busy_in_flight_queues;
++ int wr_busy_queues;
++ int queued;
++ int rq_in_driver;
++ int sync_flight;
++
++ int max_rq_in_driver;
++ int hw_tag_samples;
++ int hw_tag;
++
++ int budgets_assigned;
++
++ struct timer_list idle_slice_timer;
++ struct work_struct unplug_work;
++
++ struct bfq_queue *in_service_queue;
++ struct bfq_io_cq *in_service_bic;
++
++ sector_t last_position;
++
++ ktime_t last_budget_start;
++ ktime_t last_idling_start;
++ int peak_rate_samples;
++ u64 peak_rate;
++ unsigned long bfq_max_budget;
++
++ struct hlist_head group_list;
++ struct list_head active_list;
++ struct list_head idle_list;
++
++ unsigned int bfq_quantum;
++ unsigned int bfq_fifo_expire[2];
++ unsigned int bfq_back_penalty;
++ unsigned int bfq_back_max;
++ unsigned int bfq_slice_idle;
++ u64 bfq_class_idle_last_service;
++
++ unsigned int bfq_user_max_budget;
++ unsigned int bfq_max_budget_async_rq;
++ unsigned int bfq_timeout[2];
++
++ unsigned int bfq_coop_thresh;
++ unsigned int bfq_failed_cooperations;
++ unsigned int bfq_requests_within_timer;
++
++ bool low_latency;
++
++ /* parameters of the low_latency heuristics */
++ unsigned int bfq_wr_coeff;
++ unsigned int bfq_wr_max_time;
++ unsigned int bfq_wr_rt_max_time;
++ unsigned int bfq_wr_min_idle_time;
++ unsigned long bfq_wr_min_inter_arr_async;
++ unsigned int bfq_wr_max_softrt_rate;
++ u64 RT_prod;
++ enum bfq_device_speed device_speed;
++
++ struct bfq_queue oom_bfqq;
++};
++
++enum bfqq_state_flags {
++ BFQ_BFQQ_FLAG_busy = 0, /* has requests or is in service */
++ BFQ_BFQQ_FLAG_wait_request, /* waiting for a request */
++ BFQ_BFQQ_FLAG_must_alloc, /* must be allowed rq alloc */
++ BFQ_BFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */
++ BFQ_BFQQ_FLAG_idle_window, /* slice idling enabled */
++ BFQ_BFQQ_FLAG_prio_changed, /* task priority has changed */
++ BFQ_BFQQ_FLAG_sync, /* synchronous queue */
++ BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */
++ BFQ_BFQQ_FLAG_IO_bound, /*
++ * bfqq has timed-out at least once
++ * having consumed at most 2/10 of
++ * its budget
++ */
++ BFQ_BFQQ_FLAG_constantly_seeky, /*
++ * bfqq has proved to be slow and
++ * seeky until budget timeout
++ */
++ BFQ_BFQQ_FLAG_softrt_update, /*
++ * may need softrt-next-start
++ * update
++ */
++ BFQ_BFQQ_FLAG_coop, /* bfqq is shared */
++ BFQ_BFQQ_FLAG_split_coop, /* shared bfqq will be splitted */
++};
++
++#define BFQ_BFQQ_FNS(name) \
++static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \
++{ \
++ (bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name); \
++} \
++static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \
++{ \
++ (bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name); \
++} \
++static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq) \
++{ \
++ return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0; \
++}
++
++BFQ_BFQQ_FNS(busy);
++BFQ_BFQQ_FNS(wait_request);
++BFQ_BFQQ_FNS(must_alloc);
++BFQ_BFQQ_FNS(fifo_expire);
++BFQ_BFQQ_FNS(idle_window);
++BFQ_BFQQ_FNS(prio_changed);
++BFQ_BFQQ_FNS(sync);
++BFQ_BFQQ_FNS(budget_new);
++BFQ_BFQQ_FNS(IO_bound);
++BFQ_BFQQ_FNS(constantly_seeky);
++BFQ_BFQQ_FNS(coop);
++BFQ_BFQQ_FNS(split_coop);
++BFQ_BFQQ_FNS(softrt_update);
++#undef BFQ_BFQQ_FNS
++
++/* Logging facilities. */
++#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
++ blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
++
++#define bfq_log(bfqd, fmt, args...) \
++ blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
++
++/* Expiration reasons. */
++enum bfqq_expiration {
++ BFQ_BFQQ_TOO_IDLE = 0, /*
++ * queue has been idling for
++ * too long
++ */
++ BFQ_BFQQ_BUDGET_TIMEOUT, /* budget took too long to be used */
++ BFQ_BFQQ_BUDGET_EXHAUSTED, /* budget consumed */
++ BFQ_BFQQ_NO_MORE_REQUESTS, /* the queue has no more requests */
++};
++
++#ifdef CONFIG_CGROUP_BFQIO
++/**
++ * struct bfq_group - per (device, cgroup) data structure.
++ * @entity: schedulable entity to insert into the parent group sched_data.
++ * @sched_data: own sched_data, to contain child entities (they may be
++ * both bfq_queues and bfq_groups).
++ * @group_node: node to be inserted into the bfqio_cgroup->group_data
++ * list of the containing cgroup's bfqio_cgroup.
++ * @bfqd_node: node to be inserted into the @bfqd->group_list list
++ * of the groups active on the same device; used for cleanup.
++ * @bfqd: the bfq_data for the device this group acts upon.
++ * @async_bfqq: array of async queues for all the tasks belonging to
++ * the group, one queue per ioprio value per ioprio_class,
++ * except for the idle class that has only one queue.
++ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
++ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
++ * to avoid too many special cases during group creation/
++ * migration.
++ * @active_entities: number of active entities belonging to the group;
++ * unused for the root group. Used to know whether there
++ * are groups with more than one active @bfq_entity
++ * (see the comments to the function
++ * bfq_bfqq_must_not_expire()).
++ *
++ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
++ * there is a set of bfq_groups, each one collecting the lower-level
++ * entities belonging to the group that are acting on the same device.
++ *
++ * Locking works as follows:
++ * o @group_node is protected by the bfqio_cgroup lock, and is accessed
++ * via RCU from its readers.
++ * o @bfqd is protected by the queue lock, RCU is used to access it
++ * from the readers.
++ * o All the other fields are protected by the @bfqd queue lock.
++ */
++struct bfq_group {
++ struct bfq_entity entity;
++ struct bfq_sched_data sched_data;
++
++ struct hlist_node group_node;
++ struct hlist_node bfqd_node;
++
++ void *bfqd;
++
++ struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
++ struct bfq_queue *async_idle_bfqq;
++
++ struct bfq_entity *my_entity;
++
++ int active_entities;
++};
++
++/**
++ * struct bfqio_cgroup - bfq cgroup data structure.
++ * @css: subsystem state for bfq in the containing cgroup.
++ * @online: flag marked when the subsystem is inserted.
++ * @weight: cgroup weight.
++ * @ioprio: cgroup ioprio.
++ * @ioprio_class: cgroup ioprio_class.
++ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
++ * @group_data: list containing the bfq_group belonging to this cgroup.
++ *
++ * @group_data is accessed using RCU, with @lock protecting the updates,
++ * @ioprio and @ioprio_class are protected by @lock.
++ */
++struct bfqio_cgroup {
++ struct cgroup_subsys_state css;
++ bool online;
++
++ unsigned short weight, ioprio, ioprio_class;
++
++ spinlock_t lock;
++ struct hlist_head group_data;
++};
++#else
++struct bfq_group {
++ struct bfq_sched_data sched_data;
++
++ struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
++ struct bfq_queue *async_idle_bfqq;
++};
++#endif
++
++static inline struct bfq_service_tree *
++bfq_entity_service_tree(struct bfq_entity *entity)
++{
++ struct bfq_sched_data *sched_data = entity->sched_data;
++ unsigned int idx = entity->ioprio_class - 1;
++
++ BUG_ON(idx >= BFQ_IOPRIO_CLASSES);
++ BUG_ON(sched_data == NULL);
++
++ return sched_data->service_tree + idx;
++}
++
++static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
++ int is_sync)
++{
++ return bic->bfqq[!!is_sync];
++}
++
++static inline void bic_set_bfqq(struct bfq_io_cq *bic,
++ struct bfq_queue *bfqq, int is_sync)
++{
++ bic->bfqq[!!is_sync] = bfqq;
++}
++
++static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
++{
++ return bic->icq.q->elevator->elevator_data;
++}
++
++/**
++ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer.
++ * @ptr: a pointer to a bfqd.
++ * @flags: storage for the flags to be saved.
++ *
++ * This function allows bfqg->bfqd to be protected by the
++ * queue lock of the bfqd they reference; the pointer is dereferenced
++ * under RCU, so the storage for bfqd is assured to be safe as long
++ * as the RCU read side critical section does not end. After the
++ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be
++ * sure that no other writer accessed it. If we raced with a writer,
++ * the function returns NULL, with the queue unlocked, otherwise it
++ * returns the dereferenced pointer, with the queue locked.
++ */
++static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr,
++ unsigned long *flags)
++{
++ struct bfq_data *bfqd;
++
++ rcu_read_lock();
++ bfqd = rcu_dereference(*(struct bfq_data **)ptr);
++
++ if (bfqd != NULL) {
++ spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
++ if (*ptr == bfqd)
++ goto out;
++ spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
++ }
++
++ bfqd = NULL;
++out:
++ rcu_read_unlock();
++ return bfqd;
++}
++
++static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
++ unsigned long *flags)
++{
++ spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
++}
++
++static void bfq_changed_ioprio(struct bfq_io_cq *bic);
++static void bfq_put_queue(struct bfq_queue *bfqq);
++static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
++static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
++ struct bfq_group *bfqg, int is_sync,
++ struct bfq_io_cq *bic, gfp_t gfp_mask);
++static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
++ struct bfq_group *bfqg);
++static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
++static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
++
++#endif /* _BFQ_H */
+--
+2.0.3
+
diff --git a/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
new file mode 100644
index 0000000..e606f5d
--- /dev/null
+++ b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
@@ -0,0 +1,1188 @@
+From 5b290be286aa74051b4b77a216032b771ceadd23 Mon Sep 17 00:00:00 2001
+From: Mauro Andreolini <mauro.andreolini@unimore.it>
+Date: Wed, 18 Jun 2014 17:38:07 +0200
+Subject: [PATCH 3/3] block, bfq: add Early Queue Merge (EQM) to BFQ-v7r5 for
+ 3.16.0
+
+A set of processes may happen to perform interleaved reads, i.e.,requests
+whose union would give rise to a sequential read pattern. There are two
+typical cases: in the first case, processes read fixed-size chunks of
+data at a fixed distance from each other, while in the second case processes
+may read variable-size chunks at variable distances. The latter case occurs
+for example with QEMU, which splits the I/O generated by the guest into
+multiple chunks, and lets these chunks be served by a pool of cooperating
+processes, iteratively assigning the next chunk of I/O to the first
+available process. CFQ uses actual queue merging for the first type of
+rocesses, whereas it uses preemption to get a sequential read pattern out
+of the read requests performed by the second type of processes. In the end
+it uses two different mechanisms to achieve the same goal: boosting the
+throughput with interleaved I/O.
+
+This patch introduces Early Queue Merge (EQM), a unified mechanism to get a
+sequential read pattern with both types of processes. The main idea is
+checking newly arrived requests against the next request of the active queue
+both in case of actual request insert and in case of request merge. By doing
+so, both the types of processes can be handled by just merging their queues.
+EQM is then simpler and more compact than the pair of mechanisms used in
+CFQ.
+
+Finally, EQM also preserves the typical low-latency properties of BFQ, by
+properly restoring the weight-raising state of a queue when it gets back to
+a non-merged state.
+
+Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+---
+ block/bfq-iosched.c | 736 ++++++++++++++++++++++++++++++++++++----------------
+ block/bfq-sched.c | 28 --
+ block/bfq.h | 46 +++-
+ 3 files changed, 556 insertions(+), 254 deletions(-)
+
+diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
+index 0a0891b..d1d8e67 100644
+--- a/block/bfq-iosched.c
++++ b/block/bfq-iosched.c
+@@ -571,6 +571,57 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+ return dur;
+ }
+
++static inline unsigned
++bfq_bfqq_cooperations(struct bfq_queue *bfqq)
++{
++ return bfqq->bic ? bfqq->bic->cooperations : 0;
++}
++
++static inline void
++bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
++{
++ if (bic->saved_idle_window)
++ bfq_mark_bfqq_idle_window(bfqq);
++ else
++ bfq_clear_bfqq_idle_window(bfqq);
++ if (bic->saved_IO_bound)
++ bfq_mark_bfqq_IO_bound(bfqq);
++ else
++ bfq_clear_bfqq_IO_bound(bfqq);
++ if (bic->wr_time_left && bfqq->bfqd->low_latency &&
++ bic->cooperations < bfqq->bfqd->bfq_coop_thresh) {
++ /*
++ * Start a weight raising period with the duration given by
++ * the raising_time_left snapshot.
++ */
++ if (bfq_bfqq_busy(bfqq))
++ bfqq->bfqd->wr_busy_queues++;
++ bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff;
++ bfqq->wr_cur_max_time = bic->wr_time_left;
++ bfqq->last_wr_start_finish = jiffies;
++ bfqq->entity.ioprio_changed = 1;
++ }
++ /*
++ * Clear wr_time_left to prevent bfq_bfqq_save_state() from
++ * getting confused about the queue's need of a weight-raising
++ * period.
++ */
++ bic->wr_time_left = 0;
++}
++
++/*
++ * Must be called with the queue_lock held.
++ */
++static int bfqq_process_refs(struct bfq_queue *bfqq)
++{
++ int process_refs, io_refs;
++
++ io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
++ process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
++ BUG_ON(process_refs < 0);
++ return process_refs;
++}
++
+ static void bfq_add_request(struct request *rq)
+ {
+ struct bfq_queue *bfqq = RQ_BFQQ(rq);
+@@ -602,8 +653,11 @@ static void bfq_add_request(struct request *rq)
+
+ if (!bfq_bfqq_busy(bfqq)) {
+ int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++ bfq_bfqq_cooperations(bfqq) < bfqd->bfq_coop_thresh &&
+ time_is_before_jiffies(bfqq->soft_rt_next_start);
+- idle_for_long_time = time_is_before_jiffies(
++ idle_for_long_time = bfq_bfqq_cooperations(bfqq) <
++ bfqd->bfq_coop_thresh &&
++ time_is_before_jiffies(
+ bfqq->budget_timeout +
+ bfqd->bfq_wr_min_idle_time);
+ entity->budget = max_t(unsigned long, bfqq->max_budget,
+@@ -624,11 +678,20 @@ static void bfq_add_request(struct request *rq)
+ if (!bfqd->low_latency)
+ goto add_bfqq_busy;
+
++ if (bfq_bfqq_just_split(bfqq))
++ goto set_ioprio_changed;
++
+ /*
+- * If the queue is not being boosted and has been idle
+- * for enough time, start a weight-raising period
++ * If the queue:
++ * - is not being boosted,
++ * - has been idle for enough time,
++ * - is not a sync queue or is linked to a bfq_io_cq (it is
++ * shared "for its nature" or it is not shared and its
++ * requests have not been redirected to a shared queue)
++ * start a weight-raising period.
+ */
+- if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++ if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
++ (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ if (idle_for_long_time)
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+@@ -642,9 +705,11 @@ static void bfq_add_request(struct request *rq)
+ } else if (old_wr_coeff > 1) {
+ if (idle_for_long_time)
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+- else if (bfqq->wr_cur_max_time ==
+- bfqd->bfq_wr_rt_max_time &&
+- !soft_rt) {
++ else if (bfq_bfqq_cooperations(bfqq) >=
++ bfqd->bfq_coop_thresh ||
++ (bfqq->wr_cur_max_time ==
++ bfqd->bfq_wr_rt_max_time &&
++ !soft_rt)) {
+ bfqq->wr_coeff = 1;
+ bfq_log_bfqq(bfqd, bfqq,
+ "wrais ending at %lu, rais_max_time %u",
+@@ -660,18 +725,18 @@ static void bfq_add_request(struct request *rq)
+ /*
+ *
+ * The remaining weight-raising time is lower
+- * than bfqd->bfq_wr_rt_max_time, which
+- * means that the application is enjoying
+- * weight raising either because deemed soft-
+- * rt in the near past, or because deemed
+- * interactive a long ago. In both cases,
+- * resetting now the current remaining weight-
+- * raising time for the application to the
+- * weight-raising duration for soft rt
+- * applications would not cause any latency
+- * increase for the application (as the new
+- * duration would be higher than the remaining
+- * time).
++ * than bfqd->bfq_wr_rt_max_time, which means
++ * that the application is enjoying weight
++ * raising either because deemed soft-rt in
++ * the near past, or because deemed interactive
++ * a long ago.
++ * In both cases, resetting now the current
++ * remaining weight-raising time for the
++ * application to the weight-raising duration
++ * for soft rt applications would not cause any
++ * latency increase for the application (as the
++ * new duration would be higher than the
++ * remaining time).
+ *
+ * In addition, the application is now meeting
+ * the requirements for being deemed soft rt.
+@@ -706,6 +771,7 @@ static void bfq_add_request(struct request *rq)
+ bfqd->bfq_wr_rt_max_time;
+ }
+ }
++set_ioprio_changed:
+ if (old_wr_coeff != bfqq->wr_coeff)
+ entity->ioprio_changed = 1;
+ add_bfqq_busy:
+@@ -918,90 +984,35 @@ static void bfq_end_wr(struct bfq_data *bfqd)
+ spin_unlock_irq(bfqd->queue->queue_lock);
+ }
+
+-static int bfq_allow_merge(struct request_queue *q, struct request *rq,
+- struct bio *bio)
++static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
+ {
+- struct bfq_data *bfqd = q->elevator->elevator_data;
+- struct bfq_io_cq *bic;
+- struct bfq_queue *bfqq;
+-
+- /*
+- * Disallow merge of a sync bio into an async request.
+- */
+- if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+- return 0;
+-
+- /*
+- * Lookup the bfqq that this bio will be queued with. Allow
+- * merge only if rq is queued there.
+- * Queue lock is held here.
+- */
+- bic = bfq_bic_lookup(bfqd, current->io_context);
+- if (bic == NULL)
+- return 0;
+-
+- bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+- return bfqq == RQ_BFQQ(rq);
+-}
+-
+-static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+- struct bfq_queue *bfqq)
+-{
+- if (bfqq != NULL) {
+- bfq_mark_bfqq_must_alloc(bfqq);
+- bfq_mark_bfqq_budget_new(bfqq);
+- bfq_clear_bfqq_fifo_expire(bfqq);
+-
+- bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+-
+- bfq_log_bfqq(bfqd, bfqq,
+- "set_in_service_queue, cur-budget = %lu",
+- bfqq->entity.budget);
+- }
+-
+- bfqd->in_service_queue = bfqq;
+-}
+-
+-/*
+- * Get and set a new queue for service.
+- */
+-static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd,
+- struct bfq_queue *bfqq)
+-{
+- if (!bfqq)
+- bfqq = bfq_get_next_queue(bfqd);
++ if (request)
++ return blk_rq_pos(io_struct);
+ else
+- bfq_get_next_queue_forced(bfqd, bfqq);
+-
+- __bfq_set_in_service_queue(bfqd, bfqq);
+- return bfqq;
++ return ((struct bio *)io_struct)->bi_iter.bi_sector;
+ }
+
+-static inline sector_t bfq_dist_from_last(struct bfq_data *bfqd,
+- struct request *rq)
++static inline sector_t bfq_dist_from(sector_t pos1,
++ sector_t pos2)
+ {
+- if (blk_rq_pos(rq) >= bfqd->last_position)
+- return blk_rq_pos(rq) - bfqd->last_position;
++ if (pos1 >= pos2)
++ return pos1 - pos2;
+ else
+- return bfqd->last_position - blk_rq_pos(rq);
++ return pos2 - pos1;
+ }
+
+-/*
+- * Return true if bfqq has no request pending and rq is close enough to
+- * bfqd->last_position, or if rq is closer to bfqd->last_position than
+- * bfqq->next_rq
+- */
+-static inline int bfq_rq_close(struct bfq_data *bfqd, struct request *rq)
++static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
++ sector_t sector)
+ {
+- return bfq_dist_from_last(bfqd, rq) <= BFQQ_SEEK_THR;
++ return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
++ BFQQ_SEEK_THR;
+ }
+
+-static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
++static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
+ {
+ struct rb_root *root = &bfqd->rq_pos_tree;
+ struct rb_node *parent, *node;
+ struct bfq_queue *__bfqq;
+- sector_t sector = bfqd->last_position;
+
+ if (RB_EMPTY_ROOT(root))
+ return NULL;
+@@ -1020,7 +1031,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ * next_request position).
+ */
+ __bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+- if (bfq_rq_close(bfqd, __bfqq->next_rq))
++ if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ return __bfqq;
+
+ if (blk_rq_pos(__bfqq->next_rq) < sector)
+@@ -1031,7 +1042,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ return NULL;
+
+ __bfqq = rb_entry(node, struct bfq_queue, pos_node);
+- if (bfq_rq_close(bfqd, __bfqq->next_rq))
++ if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ return __bfqq;
+
+ return NULL;
+@@ -1040,14 +1051,12 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ /*
+ * bfqd - obvious
+ * cur_bfqq - passed in so that we don't decide that the current queue
+- * is closely cooperating with itself.
+- *
+- * We are assuming that cur_bfqq has dispatched at least one request,
+- * and that bfqd->last_position reflects a position on the disk associated
+- * with the I/O issued by cur_bfqq.
++ * is closely cooperating with itself
++ * sector - used as a reference point to search for a close queue
+ */
+ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+- struct bfq_queue *cur_bfqq)
++ struct bfq_queue *cur_bfqq,
++ sector_t sector)
+ {
+ struct bfq_queue *bfqq;
+
+@@ -1067,7 +1076,7 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+ * working closely on the same area of the disk. In that case,
+ * we can group them together and don't waste time idling.
+ */
+- bfqq = bfqq_close(bfqd);
++ bfqq = bfqq_close(bfqd, sector);
+ if (bfqq == NULL || bfqq == cur_bfqq)
+ return NULL;
+
+@@ -1094,6 +1103,305 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+ return bfqq;
+ }
+
++static struct bfq_queue *
++bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++ int process_refs, new_process_refs;
++ struct bfq_queue *__bfqq;
++
++ /*
++ * If there are no process references on the new_bfqq, then it is
++ * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
++ * may have dropped their last reference (not just their last process
++ * reference).
++ */
++ if (!bfqq_process_refs(new_bfqq))
++ return NULL;
++
++ /* Avoid a circular list and skip interim queue merges. */
++ while ((__bfqq = new_bfqq->new_bfqq)) {
++ if (__bfqq == bfqq)
++ return NULL;
++ new_bfqq = __bfqq;
++ }
++
++ process_refs = bfqq_process_refs(bfqq);
++ new_process_refs = bfqq_process_refs(new_bfqq);
++ /*
++ * If the process for the bfqq has gone away, there is no
++ * sense in merging the queues.
++ */
++ if (process_refs == 0 || new_process_refs == 0)
++ return NULL;
++
++ bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
++ new_bfqq->pid);
++
++ /*
++ * Merging is just a redirection: the requests of the process
++ * owning one of the two queues are redirected to the other queue.
++ * The latter queue, in its turn, is set as shared if this is the
++ * first time that the requests of some process are redirected to
++ * it.
++ *
++ * We redirect bfqq to new_bfqq and not the opposite, because we
++ * are in the context of the process owning bfqq, hence we have
++ * the io_cq of this process. So we can immediately configure this
++ * io_cq to redirect the requests of the process to new_bfqq.
++ *
++ * NOTE, even if new_bfqq coincides with the in-service queue, the
++ * io_cq of new_bfqq is not available, because, if the in-service
++ * queue is shared, bfqd->in_service_bic may not point to the
++ * io_cq of the in-service queue.
++ * Redirecting the requests of the process owning bfqq to the
++ * currently in-service queue is in any case the best option, as
++ * we feed the in-service queue with new requests close to the
++ * last request served and, by doing so, hopefully increase the
++ * throughput.
++ */
++ bfqq->new_bfqq = new_bfqq;
++ atomic_add(process_refs, &new_bfqq->ref);
++ return new_bfqq;
++}
++
++/*
++ * Attempt to schedule a merge of bfqq with the currently in-service queue
++ * or with a close queue among the scheduled queues.
++ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
++ * structure otherwise.
++ */
++static struct bfq_queue *
++bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ void *io_struct, bool request)
++{
++ struct bfq_queue *in_service_bfqq, *new_bfqq;
++
++ if (bfqq->new_bfqq)
++ return bfqq->new_bfqq;
++
++ if (!io_struct)
++ return NULL;
++
++ in_service_bfqq = bfqd->in_service_queue;
++
++ if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
++ !bfqd->in_service_bic)
++ goto check_scheduled;
++
++ if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
++ goto check_scheduled;
++
++ if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
++ goto check_scheduled;
++
++ if (in_service_bfqq->entity.parent != bfqq->entity.parent)
++ goto check_scheduled;
++
++ if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
++ bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
++ new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
++ if (new_bfqq != NULL)
++ return new_bfqq; /* Merge with in-service queue */
++ }
++
++ /*
++ * Check whether there is a cooperator among currently scheduled
++ * queues. The only thing we need is that the bio/request is not
++ * NULL, as we need it to establish whether a cooperator exists.
++ */
++check_scheduled:
++ new_bfqq = bfq_close_cooperator(bfqd, bfqq,
++ bfq_io_struct_pos(io_struct, request));
++ if (new_bfqq)
++ return bfq_setup_merge(bfqq, new_bfqq);
++
++ return NULL;
++}
++
++static inline void
++bfq_bfqq_save_state(struct bfq_queue *bfqq)
++{
++ /*
++ * If bfqq->bic == NULL, the queue is already shared or its requests
++ * have already been redirected to a shared queue; both idle window
++ * and weight raising state have already been saved. Do nothing.
++ */
++ if (bfqq->bic == NULL)
++ return;
++ if (bfqq->bic->wr_time_left)
++ /*
++ * This is the queue of a just-started process, and would
++ * deserve weight raising: we set wr_time_left to the full
++ * weight-raising duration to trigger weight-raising when
++ * and if the queue is split and the first request of the
++ * queue is enqueued.
++ */
++ bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd);
++ else if (bfqq->wr_coeff > 1) {
++ unsigned long wr_duration =
++ jiffies - bfqq->last_wr_start_finish;
++ /*
++ * It may happen that a queue's weight raising period lasts
++ * longer than its wr_cur_max_time, as weight raising is
++ * handled only when a request is enqueued or dispatched (it
++ * does not use any timer). If the weight raising period is
++ * about to end, don't save it.
++ */
++ if (bfqq->wr_cur_max_time <= wr_duration)
++ bfqq->bic->wr_time_left = 0;
++ else
++ bfqq->bic->wr_time_left =
++ bfqq->wr_cur_max_time - wr_duration;
++ /*
++ * The bfq_queue is becoming shared or the requests of the
++ * process owning the queue are being redirected to a shared
++ * queue. Stop the weight raising period of the queue, as in
++ * both cases it should not be owned by an interactive or
++ * soft real-time application.
++ */
++ bfq_bfqq_end_wr(bfqq);
++ } else
++ bfqq->bic->wr_time_left = 0;
++ bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
++ bfqq->bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
++ bfqq->bic->cooperations++;
++ bfqq->bic->failed_cooperations = 0;
++}
++
++static inline void
++bfq_get_bic_reference(struct bfq_queue *bfqq)
++{
++ /*
++ * If bfqq->bic has a non-NULL value, the bic to which it belongs
++ * is about to begin using a shared bfq_queue.
++ */
++ if (bfqq->bic)
++ atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
++}
++
++static void
++bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
++ struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++ bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
++ (long unsigned)new_bfqq->pid);
++ /* Save weight raising and idle window of the merged queues */
++ bfq_bfqq_save_state(bfqq);
++ bfq_bfqq_save_state(new_bfqq);
++ if (bfq_bfqq_IO_bound(bfqq))
++ bfq_mark_bfqq_IO_bound(new_bfqq);
++ bfq_clear_bfqq_IO_bound(bfqq);
++ /*
++ * Grab a reference to the bic, to prevent it from being destroyed
++ * before being possibly touched by a bfq_split_bfqq().
++ */
++ bfq_get_bic_reference(bfqq);
++ bfq_get_bic_reference(new_bfqq);
++ /*
++ * Merge queues (that is, let bic redirect its requests to new_bfqq)
++ */
++ bic_set_bfqq(bic, new_bfqq, 1);
++ bfq_mark_bfqq_coop(new_bfqq);
++ /*
++ * new_bfqq now belongs to at least two bics (it is a shared queue):
++ * set new_bfqq->bic to NULL. bfqq either:
++ * - does not belong to any bic any more, and hence bfqq->bic must
++ * be set to NULL, or
++ * - is a queue whose owning bics have already been redirected to a
++ * different queue, hence the queue is destined to not belong to
++ * any bic soon and bfqq->bic is already NULL (therefore the next
++ * assignment causes no harm).
++ */
++ new_bfqq->bic = NULL;
++ bfqq->bic = NULL;
++ bfq_put_queue(bfqq);
++}
++
++static inline void bfq_bfqq_increase_failed_cooperations(struct bfq_queue *bfqq)
++{
++ struct bfq_io_cq *bic = bfqq->bic;
++ struct bfq_data *bfqd = bfqq->bfqd;
++
++ if (bic && bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh) {
++ bic->failed_cooperations++;
++ if (bic->failed_cooperations >= bfqd->bfq_failed_cooperations)
++ bic->cooperations = 0;
++ }
++}
++
++static int bfq_allow_merge(struct request_queue *q, struct request *rq,
++ struct bio *bio)
++{
++ struct bfq_data *bfqd = q->elevator->elevator_data;
++ struct bfq_io_cq *bic;
++ struct bfq_queue *bfqq, *new_bfqq;
++
++ /*
++ * Disallow merge of a sync bio into an async request.
++ */
++ if (bfq_bio_sync(bio) && !rq_is_sync(rq))
++ return 0;
++
++ /*
++ * Lookup the bfqq that this bio will be queued with. Allow
++ * merge only if rq is queued there.
++ * Queue lock is held here.
++ */
++ bic = bfq_bic_lookup(bfqd, current->io_context);
++ if (bic == NULL)
++ return 0;
++
++ bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++ /*
++ * We take advantage of this function to perform an early merge
++ * of the queues of possible cooperating processes.
++ */
++ if (bfqq != NULL) {
++ new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
++ if (new_bfqq != NULL) {
++ bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq);
++ /*
++ * If we get here, the bio will be queued in the
++ * shared queue, i.e., new_bfqq, so use new_bfqq
++ * to decide whether bio and rq can be merged.
++ */
++ bfqq = new_bfqq;
++ } else
++ bfq_bfqq_increase_failed_cooperations(bfqq);
++ }
++
++ return bfqq == RQ_BFQQ(rq);
++}
++
++static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ if (bfqq != NULL) {
++ bfq_mark_bfqq_must_alloc(bfqq);
++ bfq_mark_bfqq_budget_new(bfqq);
++ bfq_clear_bfqq_fifo_expire(bfqq);
++
++ bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
++
++ bfq_log_bfqq(bfqd, bfqq,
++ "set_in_service_queue, cur-budget = %lu",
++ bfqq->entity.budget);
++ }
++
++ bfqd->in_service_queue = bfqq;
++}
++
++/*
++ * Get and set a new queue for service.
++ */
++static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
++{
++ struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
++
++ __bfq_set_in_service_queue(bfqd, bfqq);
++ return bfqq;
++}
++
+ /*
+ * If enough samples have been computed, return the current max budget
+ * stored in bfqd, which is dynamically updated according to the
+@@ -1237,63 +1545,6 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+ return rq;
+ }
+
+-/*
+- * Must be called with the queue_lock held.
+- */
+-static int bfqq_process_refs(struct bfq_queue *bfqq)
+-{
+- int process_refs, io_refs;
+-
+- io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
+- process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
+- BUG_ON(process_refs < 0);
+- return process_refs;
+-}
+-
+-static void bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+-{
+- int process_refs, new_process_refs;
+- struct bfq_queue *__bfqq;
+-
+- /*
+- * If there are no process references on the new_bfqq, then it is
+- * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+- * may have dropped their last reference (not just their last process
+- * reference).
+- */
+- if (!bfqq_process_refs(new_bfqq))
+- return;
+-
+- /* Avoid a circular list and skip interim queue merges. */
+- while ((__bfqq = new_bfqq->new_bfqq)) {
+- if (__bfqq == bfqq)
+- return;
+- new_bfqq = __bfqq;
+- }
+-
+- process_refs = bfqq_process_refs(bfqq);
+- new_process_refs = bfqq_process_refs(new_bfqq);
+- /*
+- * If the process for the bfqq has gone away, there is no
+- * sense in merging the queues.
+- */
+- if (process_refs == 0 || new_process_refs == 0)
+- return;
+-
+- /*
+- * Merge in the direction of the lesser amount of work.
+- */
+- if (new_process_refs >= process_refs) {
+- bfqq->new_bfqq = new_bfqq;
+- atomic_add(process_refs, &new_bfqq->ref);
+- } else {
+- new_bfqq->new_bfqq = bfqq;
+- atomic_add(new_process_refs, &bfqq->ref);
+- }
+- bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+- new_bfqq->pid);
+-}
+-
+ static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+ {
+ struct bfq_entity *entity = &bfqq->entity;
+@@ -2011,7 +2262,7 @@ static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+ */
+ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ {
+- struct bfq_queue *bfqq, *new_bfqq = NULL;
++ struct bfq_queue *bfqq;
+ struct request *next_rq;
+ enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+
+@@ -2021,17 +2272,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+
+ bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+
+- /*
+- * If another queue has a request waiting within our mean seek
+- * distance, let it run. The expire code will check for close
+- * cooperators and put the close queue at the front of the
+- * service tree. If possible, merge the expiring queue with the
+- * new bfqq.
+- */
+- new_bfqq = bfq_close_cooperator(bfqd, bfqq);
+- if (new_bfqq != NULL && bfqq->new_bfqq == NULL)
+- bfq_setup_merge(bfqq, new_bfqq);
+-
+ if (bfq_may_expire_for_budg_timeout(bfqq) &&
+ !timer_pending(&bfqd->idle_slice_timer) &&
+ !bfq_bfqq_must_idle(bfqq))
+@@ -2070,10 +2310,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ bfq_clear_bfqq_wait_request(bfqq);
+ del_timer(&bfqd->idle_slice_timer);
+ }
+- if (new_bfqq == NULL)
+- goto keep_queue;
+- else
+- goto expire;
++ goto keep_queue;
+ }
+ }
+
+@@ -2082,40 +2319,30 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ * in flight (possibly waiting for a completion) or is idling for a
+ * new request, then keep it.
+ */
+- if (new_bfqq == NULL && (timer_pending(&bfqd->idle_slice_timer) ||
+- (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq)))) {
++ if (timer_pending(&bfqd->idle_slice_timer) ||
++ (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) {
+ bfqq = NULL;
+ goto keep_queue;
+- } else if (new_bfqq != NULL && timer_pending(&bfqd->idle_slice_timer)) {
+- /*
+- * Expiring the queue because there is a close cooperator,
+- * cancel timer.
+- */
+- bfq_clear_bfqq_wait_request(bfqq);
+- del_timer(&bfqd->idle_slice_timer);
+ }
+
+ reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+ expire:
+ bfq_bfqq_expire(bfqd, bfqq, 0, reason);
+ new_queue:
+- bfqq = bfq_set_in_service_queue(bfqd, new_bfqq);
++ bfqq = bfq_set_in_service_queue(bfqd);
+ bfq_log(bfqd, "select_queue: new queue %d returned",
+ bfqq != NULL ? bfqq->pid : 0);
+ keep_queue:
+ return bfqq;
+ }
+
+-static void bfq_update_wr_data(struct bfq_data *bfqd,
+- struct bfq_queue *bfqq)
++static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+ {
+- if (bfqq->wr_coeff > 1) { /* queue is being boosted */
+- struct bfq_entity *entity = &bfqq->entity;
+-
++ struct bfq_entity *entity = &bfqq->entity;
++ if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+ bfq_log_bfqq(bfqd, bfqq,
+ "raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+- jiffies_to_msecs(jiffies -
+- bfqq->last_wr_start_finish),
++ jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+ jiffies_to_msecs(bfqq->wr_cur_max_time),
+ bfqq->wr_coeff,
+ bfqq->entity.weight, bfqq->entity.orig_weight);
+@@ -2124,11 +2351,15 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+ entity->orig_weight * bfqq->wr_coeff);
+ if (entity->ioprio_changed)
+ bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
++
+ /*
+ * If too much time has elapsed from the beginning
+- * of this weight-raising, stop it.
++ * of this weight-raising period, or the queue has
++ * exceeded the acceptable number of cooperations,
++ * stop it.
+ */
+- if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++ if (bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh ||
++ time_is_before_jiffies(bfqq->last_wr_start_finish +
+ bfqq->wr_cur_max_time)) {
+ bfqq->last_wr_start_finish = jiffies;
+ bfq_log_bfqq(bfqd, bfqq,
+@@ -2136,11 +2367,13 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+ bfqq->last_wr_start_finish,
+ jiffies_to_msecs(bfqq->wr_cur_max_time));
+ bfq_bfqq_end_wr(bfqq);
+- __bfq_entity_update_weight_prio(
+- bfq_entity_service_tree(entity),
+- entity);
+ }
+ }
++ /* Update weight both if it must be raised and if it must be lowered */
++ if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
++ __bfq_entity_update_weight_prio(
++ bfq_entity_service_tree(entity),
++ entity);
+ }
+
+ /*
+@@ -2377,6 +2610,25 @@ static inline void bfq_init_icq(struct io_cq *icq)
+ struct bfq_io_cq *bic = icq_to_bic(icq);
+
+ bic->ttime.last_end_request = jiffies;
++ /*
++ * A newly created bic indicates that the process has just
++ * started doing I/O, and is probably mapping into memory its
++ * executable and libraries: it definitely needs weight raising.
++ * There is however the possibility that the process performs,
++ * for a while, I/O close to some other process. EQM intercepts
++ * this behavior and may merge the queue corresponding to the
++ * process with some other queue, BEFORE the weight of the queue
++ * is raised. Merged queues are not weight-raised (they are assumed
++ * to belong to processes that benefit only from high throughput).
++ * If the merge is basically the consequence of an accident, then
++ * the queue will be split soon and will get back its old weight.
++ * It is then important to write down somewhere that this queue
++ * does need weight raising, even if it did not make it to get its
++ * weight raised before being merged. To this purpose, we overload
++ * the field raising_time_left and assign 1 to it, to mark the queue
++ * as needing weight raising.
++ */
++ bic->wr_time_left = 1;
+ }
+
+ static void bfq_exit_icq(struct io_cq *icq)
+@@ -2390,6 +2642,13 @@ static void bfq_exit_icq(struct io_cq *icq)
+ }
+
+ if (bic->bfqq[BLK_RW_SYNC]) {
++ /*
++ * If the bic is using a shared queue, put the reference
++ * taken on the io_context when the bic started using a
++ * shared bfq_queue.
++ */
++ if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC]))
++ put_io_context(icq->ioc);
+ bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+ bic->bfqq[BLK_RW_SYNC] = NULL;
+ }
+@@ -2678,6 +2937,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
+ if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+ return;
+
++ /* Idle window just restored, statistics are meaningless. */
++ if (bfq_bfqq_just_split(bfqq))
++ return;
++
+ enable_idle = bfq_bfqq_idle_window(bfqq);
+
+ if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+@@ -2725,6 +2988,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+ !BFQQ_SEEKY(bfqq))
+ bfq_update_idle_window(bfqd, bfqq, bic);
++ bfq_clear_bfqq_just_split(bfqq);
+
+ bfq_log_bfqq(bfqd, bfqq,
+ "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
+@@ -2785,13 +3049,49 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ static void bfq_insert_request(struct request_queue *q, struct request *rq)
+ {
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+- struct bfq_queue *bfqq = RQ_BFQQ(rq);
++ struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
+
+ assert_spin_locked(bfqd->queue->queue_lock);
++
++ /*
++ * An unplug may trigger a requeue of a request from the device
++ * driver: make sure we are in process context while trying to
++ * merge two bfq_queues.
++ */
++ if (!in_interrupt()) {
++ new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
++ if (new_bfqq != NULL) {
++ if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
++ new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
++ /*
++ * Release the request's reference to the old bfqq
++ * and make sure one is taken to the shared queue.
++ */
++ new_bfqq->allocated[rq_data_dir(rq)]++;
++ bfqq->allocated[rq_data_dir(rq)]--;
++ atomic_inc(&new_bfqq->ref);
++ bfq_put_queue(bfqq);
++ if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
++ bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
++ bfqq, new_bfqq);
++ rq->elv.priv[1] = new_bfqq;
++ bfqq = new_bfqq;
++ } else
++ bfq_bfqq_increase_failed_cooperations(bfqq);
++ }
++
+ bfq_init_prio_data(bfqq, RQ_BIC(rq));
+
+ bfq_add_request(rq);
+
++ /*
++ * Here a newly-created bfq_queue has already started a weight-raising
++ * period: clear raising_time_left to prevent bfq_bfqq_save_state()
++ * from assigning it a full weight-raising period. See the detailed
++ * comments about this field in bfq_init_icq().
++ */
++ if (bfqq->bic != NULL)
++ bfqq->bic->wr_time_left = 0;
+ rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+ list_add_tail(&rq->queuelist, &bfqq->fifo);
+
+@@ -2956,18 +3256,6 @@ static void bfq_put_request(struct request *rq)
+ }
+ }
+
+-static struct bfq_queue *
+-bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+- struct bfq_queue *bfqq)
+-{
+- bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+- (long unsigned)bfqq->new_bfqq->pid);
+- bic_set_bfqq(bic, bfqq->new_bfqq, 1);
+- bfq_mark_bfqq_coop(bfqq->new_bfqq);
+- bfq_put_queue(bfqq);
+- return bic_to_bfqq(bic, 1);
+-}
+-
+ /*
+ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+ * was the last process referring to said bfqq.
+@@ -2976,6 +3264,9 @@ static struct bfq_queue *
+ bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+ {
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
++
++ put_io_context(bic->icq.ioc);
++
+ if (bfqq_process_refs(bfqq) == 1) {
+ bfqq->pid = current->pid;
+ bfq_clear_bfqq_coop(bfqq);
+@@ -3004,6 +3295,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
+ struct bfq_queue *bfqq;
+ struct bfq_group *bfqg;
+ unsigned long flags;
++ bool split = false;
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+
+@@ -3022,24 +3314,14 @@ new_queue:
+ bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
+ bic_set_bfqq(bic, bfqq, is_sync);
+ } else {
+- /*
+- * If the queue was seeky for too long, break it apart.
+- */
++ /* If the queue was seeky for too long, break it apart. */
+ if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+ bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+ bfqq = bfq_split_bfqq(bic, bfqq);
++ split = true;
+ if (!bfqq)
+ goto new_queue;
+ }
+-
+- /*
+- * Check to see if this queue is scheduled to merge with
+- * another closely cooperating queue. The merging of queues
+- * happens here as it must be done in process context.
+- * The reference on new_bfqq was taken in merge_bfqqs.
+- */
+- if (bfqq->new_bfqq != NULL)
+- bfqq = bfq_merge_bfqqs(bfqd, bic, bfqq);
+ }
+
+ bfqq->allocated[rw]++;
+@@ -3050,6 +3332,26 @@ new_queue:
+ rq->elv.priv[0] = bic;
+ rq->elv.priv[1] = bfqq;
+
++ /*
++ * If a bfq_queue has only one process reference, it is owned
++ * by only one bfq_io_cq: we can set the bic field of the
++ * bfq_queue to the address of that structure. Also, if the
++ * queue has just been split, mark a flag so that the
++ * information is available to the other scheduler hooks.
++ */
++ if (bfqq_process_refs(bfqq) == 1) {
++ bfqq->bic = bic;
++ if (split) {
++ bfq_mark_bfqq_just_split(bfqq);
++ /*
++ * If the queue has just been split from a shared
++ * queue, restore the idle window and the possible
++ * weight raising period.
++ */
++ bfq_bfqq_resume_state(bfqq, bic);
++ }
++ }
++
+ spin_unlock_irqrestore(q->queue_lock, flags);
+
+ return 0;
+diff --git a/block/bfq-sched.c b/block/bfq-sched.c
+index c4831b7..546a254 100644
+--- a/block/bfq-sched.c
++++ b/block/bfq-sched.c
+@@ -1084,34 +1084,6 @@ static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+ return bfqq;
+ }
+
+-/*
+- * Forced extraction of the given queue.
+- */
+-static void bfq_get_next_queue_forced(struct bfq_data *bfqd,
+- struct bfq_queue *bfqq)
+-{
+- struct bfq_entity *entity;
+- struct bfq_sched_data *sd;
+-
+- BUG_ON(bfqd->in_service_queue != NULL);
+-
+- entity = &bfqq->entity;
+- /*
+- * Bubble up extraction/update from the leaf to the root.
+- */
+- for_each_entity(entity) {
+- sd = entity->sched_data;
+- bfq_update_budget(entity);
+- bfq_update_vtime(bfq_entity_service_tree(entity));
+- bfq_active_extract(bfq_entity_service_tree(entity), entity);
+- sd->in_service_entity = entity;
+- sd->next_in_service = NULL;
+- entity->service = 0;
+- }
+-
+- return;
+-}
+-
+ static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+ {
+ if (bfqd->in_service_bic != NULL) {
+diff --git a/block/bfq.h b/block/bfq.h
+index a83e69d..ebbd040 100644
+--- a/block/bfq.h
++++ b/block/bfq.h
+@@ -215,18 +215,21 @@ struct bfq_group;
+ * idle @bfq_queue with no outstanding requests, then
+ * the task associated with the queue it is deemed as
+ * soft real-time (see the comments to the function
+- * bfq_bfqq_softrt_next_start()).
++ * bfq_bfqq_softrt_next_start())
+ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
+ * idle to backlogged
+ * @service_from_backlogged: cumulative service received from the @bfq_queue
+ * since the last transition from idle to
+ * backlogged
++ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the
++ * queue is shared
+ *
+- * A bfq_queue is a leaf request queue; it can be associated with an io_context
+- * or more, if it is async or shared between cooperating processes. @cgroup
+- * holds a reference to the cgroup, to be sure that it does not disappear while
+- * a bfqq still references it (mostly to avoid races between request issuing and
+- * task migration followed by cgroup destruction).
++ * A bfq_queue is a leaf request queue; it can be associated with an
++ * io_context or more, if it is async or shared between cooperating
++ * processes. @cgroup holds a reference to the cgroup, to be sure that it
++ * does not disappear while a bfqq still references it (mostly to avoid
++ * races between request issuing and task migration followed by cgroup
++ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+ struct bfq_queue {
+@@ -264,6 +267,7 @@ struct bfq_queue {
+ unsigned int requests_within_timer;
+
+ pid_t pid;
++ struct bfq_io_cq *bic;
+
+ /* weight-raising fields */
+ unsigned long wr_cur_max_time;
+@@ -293,12 +297,34 @@ struct bfq_ttime {
+ * @icq: associated io_cq structure
+ * @bfqq: array of two process queues, the sync and the async
+ * @ttime: associated @bfq_ttime struct
++ * @wr_time_left: snapshot of the time left before weight raising ends
++ * for the sync queue associated to this process; this
++ * snapshot is taken to remember this value while the weight
++ * raising is suspended because the queue is merged with a
++ * shared queue, and is used to set @raising_cur_max_time
++ * when the queue is split from the shared queue and its
++ * weight is raised again
++ * @saved_idle_window: same purpose as the previous field for the idle
++ * window
++ * @saved_IO_bound: same purpose as the previous two fields for the I/O
++ * bound classification of a queue
++ * @cooperations: counter of consecutive successful queue merges underwent
++ * by any of the process' @bfq_queues
++ * @failed_cooperations: counter of consecutive failed queue merges of any
++ * of the process' @bfq_queues
+ */
+ struct bfq_io_cq {
+ struct io_cq icq; /* must be the first member */
+ struct bfq_queue *bfqq[2];
+ struct bfq_ttime ttime;
+ int ioprio;
++
++ unsigned int wr_time_left;
++ unsigned int saved_idle_window;
++ unsigned int saved_IO_bound;
++
++ unsigned int cooperations;
++ unsigned int failed_cooperations;
+ };
+
+ enum bfq_device_speed {
+@@ -511,7 +537,7 @@ enum bfqq_state_flags {
+ BFQ_BFQQ_FLAG_prio_changed, /* task priority has changed */
+ BFQ_BFQQ_FLAG_sync, /* synchronous queue */
+ BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */
+- BFQ_BFQQ_FLAG_IO_bound, /*
++ BFQ_BFQQ_FLAG_IO_bound, /*
+ * bfqq has timed-out at least once
+ * having consumed at most 2/10 of
+ * its budget
+@@ -520,12 +546,13 @@ enum bfqq_state_flags {
+ * bfqq has proved to be slow and
+ * seeky until budget timeout
+ */
+- BFQ_BFQQ_FLAG_softrt_update, /*
++ BFQ_BFQQ_FLAG_softrt_update, /*
+ * may need softrt-next-start
+ * update
+ */
+ BFQ_BFQQ_FLAG_coop, /* bfqq is shared */
+- BFQ_BFQQ_FLAG_split_coop, /* shared bfqq will be splitted */
++ BFQ_BFQQ_FLAG_split_coop, /* shared bfqq will be split */
++ BFQ_BFQQ_FLAG_just_split, /* queue has just been split */
+ };
+
+ #define BFQ_BFQQ_FNS(name) \
+@@ -554,6 +581,7 @@ BFQ_BFQQ_FNS(IO_bound);
+ BFQ_BFQQ_FNS(constantly_seeky);
+ BFQ_BFQQ_FNS(coop);
+ BFQ_BFQQ_FNS(split_coop);
++BFQ_BFQQ_FNS(just_split);
+ BFQ_BFQQ_FNS(softrt_update);
+ #undef BFQ_BFQQ_FNS
+
+--
+2.0.3
+
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-08-26 12:16 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-08-26 12:16 UTC (permalink / raw
To: gentoo-commits
commit: eb0a44de4e660928fbf347dae020a3b6cde29d7b
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Tue Aug 26 12:16:43 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Tue Aug 26 12:16:43 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=eb0a44de
Update to correct double mount thanks to mgorny
---
2900_dev-root-proc-mount-fix.patch | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/2900_dev-root-proc-mount-fix.patch b/2900_dev-root-proc-mount-fix.patch
index 4c89adf..6ea86e2 100644
--- a/2900_dev-root-proc-mount-fix.patch
+++ b/2900_dev-root-proc-mount-fix.patch
@@ -1,6 +1,6 @@
---- a/init/do_mounts.c 2013-01-25 19:11:11.609802424 -0500
-+++ b/init/do_mounts.c 2013-01-25 19:14:20.606053568 -0500
-@@ -461,7 +461,10 @@ void __init change_floppy(char *fmt, ...
+--- a/init/do_mounts.c 2014-08-26 08:03:30.000013100 -0400
++++ b/init/do_mounts.c 2014-08-26 08:11:19.720014712 -0400
+@@ -484,7 +484,10 @@ void __init change_floppy(char *fmt, ...
va_start(args, fmt);
vsprintf(buf, fmt, args);
va_end(args);
@@ -12,10 +12,11 @@
if (fd >= 0) {
sys_ioctl(fd, FDEJECT, 0);
sys_close(fd);
-@@ -505,7 +508,13 @@ void __init mount_root(void)
+@@ -527,8 +530,13 @@ void __init mount_root(void)
+ }
#endif
#ifdef CONFIG_BLOCK
- create_dev("/dev/root", ROOT_DEV);
+- create_dev("/dev/root", ROOT_DEV);
- mount_block_root("/dev/root", root_mountflags);
+ if (saved_root_name[0]) {
+ create_dev(saved_root_name, ROOT_DEV);
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-09 21:38 Vlastimil Babka
0 siblings, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2014-09-09 21:38 UTC (permalink / raw
To: gentoo-commits
commit: 3cbefb09946b411dbf2d5efb82db9628598dd2bb
Author: Caster <caster <AT> gentoo <DOT> org>
AuthorDate: Tue Sep 9 21:35:39 2014 +0000
Commit: Vlastimil Babka <caster <AT> gentoo <DOT> org>
CommitDate: Tue Sep 9 21:35:39 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=3cbefb09
Linux patch 3.16.2
---
0000_README | 4 +
1001_linux-3.16.2.patch | 5945 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 5949 insertions(+)
diff --git a/0000_README b/0000_README
index f57085e..1ecfc95 100644
--- a/0000_README
+++ b/0000_README
@@ -46,6 +46,10 @@ Patch: 1000_linux-3.16.1.patch
From: http://www.kernel.org
Desc: Linux 3.16.1
+Patch: 1001_linux-3.16.2.patch
+From: http://www.kernel.org
+Desc: Linux 3.16.2
+
Patch: 2400_kcopy-patch-for-infiniband-driver.patch
From: Alexey Shvetsov <alexxy@gentoo.org>
Desc: Zero copy for infiniband psm userspace driver
diff --git a/1001_linux-3.16.2.patch b/1001_linux-3.16.2.patch
new file mode 100644
index 0000000..b0b883d
--- /dev/null
+++ b/1001_linux-3.16.2.patch
@@ -0,0 +1,5945 @@
+diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt
+index 7ccf933bfbe0..48148d6d9307 100644
+--- a/Documentation/sound/alsa/ALSA-Configuration.txt
++++ b/Documentation/sound/alsa/ALSA-Configuration.txt
+@@ -2026,8 +2026,8 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
+ -------------------
+
+ Module for sound cards based on the Asus AV66/AV100/AV200 chips,
+- i.e., Xonar D1, DX, D2, D2X, DS, Essence ST (Deluxe), Essence STX,
+- HDAV1.3 (Deluxe), and HDAV1.3 Slim.
++ i.e., Xonar D1, DX, D2, D2X, DS, DSX, Essence ST (Deluxe),
++ Essence STX (II), HDAV1.3 (Deluxe), and HDAV1.3 Slim.
+
+ This module supports autoprobe and multiple cards.
+
+diff --git a/Documentation/stable_kernel_rules.txt b/Documentation/stable_kernel_rules.txt
+index cbc2f03056bd..aee73e78c7d4 100644
+--- a/Documentation/stable_kernel_rules.txt
++++ b/Documentation/stable_kernel_rules.txt
+@@ -29,6 +29,9 @@ Rules on what kind of patches are accepted, and which ones are not, into the
+
+ Procedure for submitting patches to the -stable tree:
+
++ - If the patch covers files in net/ or drivers/net please follow netdev stable
++ submission guidelines as described in
++ Documentation/networking/netdev-FAQ.txt
+ - Send the patch, after verifying that it follows the above rules, to
+ stable@vger.kernel.org. You must note the upstream commit ID in the
+ changelog of your submission, as well as the kernel version you wish
+diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
+index 0fe36497642c..612e6e99d1e5 100644
+--- a/Documentation/virtual/kvm/api.txt
++++ b/Documentation/virtual/kvm/api.txt
+@@ -1869,7 +1869,8 @@ registers, find a list below:
+ PPC | KVM_REG_PPC_PID | 64
+ PPC | KVM_REG_PPC_ACOP | 64
+ PPC | KVM_REG_PPC_VRSAVE | 32
+- PPC | KVM_REG_PPC_LPCR | 64
++ PPC | KVM_REG_PPC_LPCR | 32
++ PPC | KVM_REG_PPC_LPCR_64 | 64
+ PPC | KVM_REG_PPC_PPR | 64
+ PPC | KVM_REG_PPC_ARCH_COMPAT 32
+ PPC | KVM_REG_PPC_DABRX | 32
+diff --git a/Makefile b/Makefile
+index 87663a2d1d10..c2617526e605 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 1
++SUBLEVEL = 2
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+
+diff --git a/arch/arm/boot/dts/am4372.dtsi b/arch/arm/boot/dts/am4372.dtsi
+index 49fa59622254..c9aee0e799bb 100644
+--- a/arch/arm/boot/dts/am4372.dtsi
++++ b/arch/arm/boot/dts/am4372.dtsi
+@@ -168,9 +168,6 @@
+ ti,hwmods = "mailbox";
+ ti,mbox-num-users = <4>;
+ ti,mbox-num-fifos = <8>;
+- ti,mbox-names = "wkup_m3";
+- ti,mbox-data = <0 0 0 0>;
+- status = "disabled";
+ };
+
+ timer1: timer@44e31000 {
+diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
+index 43876245fc57..21ca0cebcab0 100644
+--- a/arch/arm/include/asm/unistd.h
++++ b/arch/arm/include/asm/unistd.h
+@@ -15,7 +15,17 @@
+
+ #include <uapi/asm/unistd.h>
+
++/*
++ * This may need to be greater than __NR_last_syscall+1 in order to
++ * account for the padding in the syscall table
++ */
+ #define __NR_syscalls (384)
++
++/*
++ * *NOTE*: This is a ghost syscall private to the kernel. Only the
++ * __kuser_cmpxchg code in entry-armv.S should be aware of its
++ * existence. Don't ever use this from user code.
++ */
+ #define __ARM_NR_cmpxchg (__ARM_NR_BASE+0x00fff0)
+
+ #define __ARCH_WANT_STAT64
+diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
+index ba94446c72d9..acd5b66ea3aa 100644
+--- a/arch/arm/include/uapi/asm/unistd.h
++++ b/arch/arm/include/uapi/asm/unistd.h
+@@ -411,11 +411,6 @@
+ #define __NR_renameat2 (__NR_SYSCALL_BASE+382)
+
+ /*
+- * This may need to be greater than __NR_last_syscall+1 in order to
+- * account for the padding in the syscall table
+- */
+-
+-/*
+ * The following SWIs are ARM private.
+ */
+ #define __ARM_NR_BASE (__NR_SYSCALL_BASE+0x0f0000)
+@@ -426,12 +421,6 @@
+ #define __ARM_NR_set_tls (__ARM_NR_BASE+5)
+
+ /*
+- * *NOTE*: This is a ghost syscall private to the kernel. Only the
+- * __kuser_cmpxchg code in entry-armv.S should be aware of its
+- * existence. Don't ever use this from user code.
+- */
+-
+-/*
+ * The following syscalls are obsolete and no longer available for EABI.
+ */
+ #if !defined(__KERNEL__)
+diff --git a/arch/arm/mach-omap2/control.c b/arch/arm/mach-omap2/control.c
+index 751f3549bf6f..acadac0992b6 100644
+--- a/arch/arm/mach-omap2/control.c
++++ b/arch/arm/mach-omap2/control.c
+@@ -314,7 +314,8 @@ void omap3_save_scratchpad_contents(void)
+ scratchpad_contents.public_restore_ptr =
+ virt_to_phys(omap3_restore_3630);
+ else if (omap_rev() != OMAP3430_REV_ES3_0 &&
+- omap_rev() != OMAP3430_REV_ES3_1)
++ omap_rev() != OMAP3430_REV_ES3_1 &&
++ omap_rev() != OMAP3430_REV_ES3_1_2)
+ scratchpad_contents.public_restore_ptr =
+ virt_to_phys(omap3_restore);
+ else
+diff --git a/arch/arm/mach-omap2/omap_hwmod.c b/arch/arm/mach-omap2/omap_hwmod.c
+index 6c074f37cdd2..da1b256caccc 100644
+--- a/arch/arm/mach-omap2/omap_hwmod.c
++++ b/arch/arm/mach-omap2/omap_hwmod.c
+@@ -2185,6 +2185,8 @@ static int _enable(struct omap_hwmod *oh)
+ oh->mux->pads_dynamic))) {
+ omap_hwmod_mux(oh->mux, _HWMOD_STATE_ENABLED);
+ _reconfigure_io_chain();
++ } else if (oh->flags & HWMOD_FORCE_MSTANDBY) {
++ _reconfigure_io_chain();
+ }
+
+ _add_initiator_dep(oh, mpu_oh);
+@@ -2291,6 +2293,8 @@ static int _idle(struct omap_hwmod *oh)
+ if (oh->mux && oh->mux->pads_dynamic) {
+ omap_hwmod_mux(oh->mux, _HWMOD_STATE_IDLE);
+ _reconfigure_io_chain();
++ } else if (oh->flags & HWMOD_FORCE_MSTANDBY) {
++ _reconfigure_io_chain();
+ }
+
+ oh->_state = _HWMOD_STATE_IDLE;
+diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
+index a5176cf32dad..f2defe1c380c 100644
+--- a/arch/arm64/include/asm/cacheflush.h
++++ b/arch/arm64/include/asm/cacheflush.h
+@@ -138,19 +138,10 @@ static inline void __flush_icache_all(void)
+ #define flush_icache_page(vma,page) do { } while (0)
+
+ /*
+- * flush_cache_vmap() is used when creating mappings (eg, via vmap,
+- * vmalloc, ioremap etc) in kernel space for pages. On non-VIPT
+- * caches, since the direct-mappings of these pages may contain cached
+- * data, we need to do a full cache flush to ensure that writebacks
+- * don't corrupt data placed into these pages via the new mappings.
++ * Not required on AArch64 (PIPT or VIPT non-aliasing D-cache).
+ */
+ static inline void flush_cache_vmap(unsigned long start, unsigned long end)
+ {
+- /*
+- * set_pte_at() called from vmap_pte_range() does not
+- * have a DSB after cleaning the cache line.
+- */
+- dsb(ish);
+ }
+
+ static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
+diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
+index e0ccceb317d9..2a1508cdead0 100644
+--- a/arch/arm64/include/asm/pgtable.h
++++ b/arch/arm64/include/asm/pgtable.h
+@@ -138,6 +138,8 @@ extern struct page *empty_zero_page;
+
+ #define pte_valid_user(pte) \
+ ((pte_val(pte) & (PTE_VALID | PTE_USER)) == (PTE_VALID | PTE_USER))
++#define pte_valid_not_user(pte) \
++ ((pte_val(pte) & (PTE_VALID | PTE_USER)) == PTE_VALID)
+
+ static inline pte_t pte_wrprotect(pte_t pte)
+ {
+@@ -184,6 +186,15 @@ static inline pte_t pte_mkspecial(pte_t pte)
+ static inline void set_pte(pte_t *ptep, pte_t pte)
+ {
+ *ptep = pte;
++
++ /*
++ * Only if the new pte is valid and kernel, otherwise TLB maintenance
++ * or update_mmu_cache() have the necessary barriers.
++ */
++ if (pte_valid_not_user(pte)) {
++ dsb(ishst);
++ isb();
++ }
+ }
+
+ extern void __sync_icache_dcache(pte_t pteval, unsigned long addr);
+@@ -303,6 +314,7 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
+ {
+ *pmdp = pmd;
+ dsb(ishst);
++ isb();
+ }
+
+ static inline void pmd_clear(pmd_t *pmdp)
+@@ -333,6 +345,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
+ {
+ *pudp = pud;
+ dsb(ishst);
++ isb();
+ }
+
+ static inline void pud_clear(pud_t *pudp)
+diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
+index b9349c4513ea..3796ea6bb734 100644
+--- a/arch/arm64/include/asm/tlbflush.h
++++ b/arch/arm64/include/asm/tlbflush.h
+@@ -122,6 +122,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
+ for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12))
+ asm("tlbi vaae1is, %0" : : "r"(addr));
+ dsb(ish);
++ isb();
+ }
+
+ /*
+@@ -131,8 +132,8 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+ {
+ /*
+- * set_pte() does not have a DSB, so make sure that the page table
+- * write is visible.
++ * set_pte() does not have a DSB for user mappings, so make sure that
++ * the page table write is visible.
+ */
+ dsb(ishst);
+ }
+diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c
+index a7fb874b595e..fe5b94078d82 100644
+--- a/arch/arm64/kernel/debug-monitors.c
++++ b/arch/arm64/kernel/debug-monitors.c
+@@ -315,20 +315,20 @@ static int brk_handler(unsigned long addr, unsigned int esr,
+ {
+ siginfo_t info;
+
+- if (call_break_hook(regs, esr) == DBG_HOOK_HANDLED)
+- return 0;
++ if (user_mode(regs)) {
++ info = (siginfo_t) {
++ .si_signo = SIGTRAP,
++ .si_errno = 0,
++ .si_code = TRAP_BRKPT,
++ .si_addr = (void __user *)instruction_pointer(regs),
++ };
+
+- if (!user_mode(regs))
++ force_sig_info(SIGTRAP, &info, current);
++ } else if (call_break_hook(regs, esr) != DBG_HOOK_HANDLED) {
++ pr_warning("Unexpected kernel BRK exception at EL1\n");
+ return -EFAULT;
++ }
+
+- info = (siginfo_t) {
+- .si_signo = SIGTRAP,
+- .si_errno = 0,
+- .si_code = TRAP_BRKPT,
+- .si_addr = (void __user *)instruction_pointer(regs),
+- };
+-
+- force_sig_info(SIGTRAP, &info, current);
+ return 0;
+ }
+
+diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
+index 14db1f6e8d7f..c0aead7d1a72 100644
+--- a/arch/arm64/kernel/efi.c
++++ b/arch/arm64/kernel/efi.c
+@@ -464,6 +464,8 @@ static int __init arm64_enter_virtual_mode(void)
+
+ set_bit(EFI_RUNTIME_SERVICES, &efi.flags);
+
++ efi.runtime_version = efi.systab->hdr.revision;
++
+ return 0;
+ }
+ early_initcall(arm64_enter_virtual_mode);
+diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
+index 736c17a226e9..bf0fc6b16ad9 100644
+--- a/arch/mips/math-emu/cp1emu.c
++++ b/arch/mips/math-emu/cp1emu.c
+@@ -1827,7 +1827,7 @@ dcopuop:
+ case -1:
+
+ if (cpu_has_mips_4_5_r)
+- cbit = fpucondbit[MIPSInst_RT(ir) >> 2];
++ cbit = fpucondbit[MIPSInst_FD(ir) >> 2];
+ else
+ cbit = FPU_CSR_COND;
+ if (rv.w)
+diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
+index 2bc4a9409a93..de7d426a9b0c 100644
+--- a/arch/powerpc/include/uapi/asm/kvm.h
++++ b/arch/powerpc/include/uapi/asm/kvm.h
+@@ -548,6 +548,7 @@ struct kvm_get_htab_header {
+
+ #define KVM_REG_PPC_VRSAVE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4)
+ #define KVM_REG_PPC_LPCR (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5)
++#define KVM_REG_PPC_LPCR_64 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb5)
+ #define KVM_REG_PPC_PPR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb6)
+
+ /* Architecture compatibility level */
+diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
+index fbd01eba4473..94802d267022 100644
+--- a/arch/powerpc/kernel/eeh_pe.c
++++ b/arch/powerpc/kernel/eeh_pe.c
+@@ -802,53 +802,33 @@ void eeh_pe_restore_bars(struct eeh_pe *pe)
+ */
+ const char *eeh_pe_loc_get(struct eeh_pe *pe)
+ {
+- struct pci_controller *hose;
+ struct pci_bus *bus = eeh_pe_bus_get(pe);
+- struct pci_dev *pdev;
+- struct device_node *dn;
+- const char *loc;
++ struct device_node *dn = pci_bus_to_OF_node(bus);
++ const char *loc = NULL;
+
+- if (!bus)
+- return "N/A";
++ if (!dn)
++ goto out;
+
+ /* PHB PE or root PE ? */
+ if (pci_is_root_bus(bus)) {
+- hose = pci_bus_to_host(bus);
+- loc = of_get_property(hose->dn,
+- "ibm,loc-code", NULL);
+- if (loc)
+- return loc;
+- loc = of_get_property(hose->dn,
+- "ibm,io-base-loc-code", NULL);
++ loc = of_get_property(dn, "ibm,loc-code", NULL);
++ if (!loc)
++ loc = of_get_property(dn, "ibm,io-base-loc-code", NULL);
+ if (loc)
+- return loc;
+-
+- pdev = pci_get_slot(bus, 0x0);
+- } else {
+- pdev = bus->self;
+- }
+-
+- if (!pdev) {
+- loc = "N/A";
+- goto out;
+- }
++ goto out;
+
+- dn = pci_device_to_OF_node(pdev);
+- if (!dn) {
+- loc = "N/A";
+- goto out;
++ /* Check the root port */
++ dn = dn->child;
++ if (!dn)
++ goto out;
+ }
+
+ loc = of_get_property(dn, "ibm,loc-code", NULL);
+ if (!loc)
+ loc = of_get_property(dn, "ibm,slot-location-code", NULL);
+- if (!loc)
+- loc = "N/A";
+
+ out:
+- if (pci_is_root_bus(bus) && pdev)
+- pci_dev_put(pdev);
+- return loc;
++ return loc ? loc : "N/A";
+ }
+
+ /**
+diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
+index 7a12edbb61e7..0f3a19237444 100644
+--- a/arch/powerpc/kvm/book3s_hv.c
++++ b/arch/powerpc/kvm/book3s_hv.c
+@@ -785,7 +785,8 @@ static int kvm_arch_vcpu_ioctl_set_sregs_hv(struct kvm_vcpu *vcpu,
+ return 0;
+ }
+
+-static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr)
++static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr,
++ bool preserve_top32)
+ {
+ struct kvmppc_vcore *vc = vcpu->arch.vcore;
+ u64 mask;
+@@ -820,6 +821,10 @@ static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr)
+ mask = LPCR_DPFD | LPCR_ILE | LPCR_TC;
+ if (cpu_has_feature(CPU_FTR_ARCH_207S))
+ mask |= LPCR_AIL;
++
++ /* Broken 32-bit version of LPCR must not clear top bits */
++ if (preserve_top32)
++ mask &= 0xFFFFFFFF;
+ vc->lpcr = (vc->lpcr & ~mask) | (new_lpcr & mask);
+ spin_unlock(&vc->lock);
+ }
+@@ -939,6 +944,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
+ *val = get_reg_val(id, vcpu->arch.vcore->tb_offset);
+ break;
+ case KVM_REG_PPC_LPCR:
++ case KVM_REG_PPC_LPCR_64:
+ *val = get_reg_val(id, vcpu->arch.vcore->lpcr);
+ break;
+ case KVM_REG_PPC_PPR:
+@@ -1150,7 +1156,10 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
+ ALIGN(set_reg_val(id, *val), 1UL << 24);
+ break;
+ case KVM_REG_PPC_LPCR:
+- kvmppc_set_lpcr(vcpu, set_reg_val(id, *val));
++ kvmppc_set_lpcr(vcpu, set_reg_val(id, *val), true);
++ break;
++ case KVM_REG_PPC_LPCR_64:
++ kvmppc_set_lpcr(vcpu, set_reg_val(id, *val), false);
+ break;
+ case KVM_REG_PPC_PPR:
+ vcpu->arch.ppr = set_reg_val(id, *val);
+diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
+index 8eef1e519077..66b7afec250f 100644
+--- a/arch/powerpc/kvm/book3s_pr.c
++++ b/arch/powerpc/kvm/book3s_pr.c
+@@ -1233,6 +1233,7 @@ static int kvmppc_get_one_reg_pr(struct kvm_vcpu *vcpu, u64 id,
+ *val = get_reg_val(id, to_book3s(vcpu)->hior);
+ break;
+ case KVM_REG_PPC_LPCR:
++ case KVM_REG_PPC_LPCR_64:
+ /*
+ * We are only interested in the LPCR_ILE bit
+ */
+@@ -1268,6 +1269,7 @@ static int kvmppc_set_one_reg_pr(struct kvm_vcpu *vcpu, u64 id,
+ to_book3s(vcpu)->hior_explicit = true;
+ break;
+ case KVM_REG_PPC_LPCR:
++ case KVM_REG_PPC_LPCR_64:
+ kvmppc_set_lpcr_pr(vcpu, set_reg_val(id, *val));
+ break;
+ default:
+diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
+index de19edeaa7a7..3136ae2f75af 100644
+--- a/arch/powerpc/platforms/powernv/pci-ioda.c
++++ b/arch/powerpc/platforms/powernv/pci-ioda.c
+@@ -491,6 +491,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
+ set_dma_ops(&pdev->dev, &dma_iommu_ops);
+ set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+ }
++ *pdev->dev.dma_mask = dma_mask;
+ return 0;
+ }
+
+diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
+index 203cbf0dc101..89e23811199c 100644
+--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
++++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
+@@ -118,10 +118,10 @@ int remove_phb_dynamic(struct pci_controller *phb)
+ }
+ }
+
+- /* Unregister the bridge device from sysfs and remove the PCI bus */
+- device_unregister(b->bridge);
++ /* Remove the PCI bus and unregister the bridge device from sysfs */
+ phb->bus = NULL;
+ pci_remove_bus(b);
++ device_unregister(b->bridge);
+
+ /* Now release the IO resource */
+ if (res->flags & IORESOURCE_IO)
+diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
+index 37b8241ec784..f90ad8592b36 100644
+--- a/arch/s390/mm/pgtable.c
++++ b/arch/s390/mm/pgtable.c
+@@ -1279,6 +1279,7 @@ static unsigned long page_table_realloc_pmd(struct mmu_gather *tlb,
+ {
+ unsigned long next, *table, *new;
+ struct page *page;
++ spinlock_t *ptl;
+ pmd_t *pmd;
+
+ pmd = pmd_offset(pud, addr);
+@@ -1296,7 +1297,7 @@ again:
+ if (!new)
+ return -ENOMEM;
+
+- spin_lock(&mm->page_table_lock);
++ ptl = pmd_lock(mm, pmd);
+ if (likely((unsigned long *) pmd_deref(*pmd) == table)) {
+ /* Nuke pmd entry pointing to the "short" page table */
+ pmdp_flush_lazy(mm, addr, pmd);
+@@ -1310,7 +1311,7 @@ again:
+ page_table_free_rcu(tlb, table);
+ new = NULL;
+ }
+- spin_unlock(&mm->page_table_lock);
++ spin_unlock(ptl);
+ if (new) {
+ page_table_free_pgste(new);
+ goto again;
+diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
+index d24887b645dc..27adfd902c6f 100644
+--- a/arch/x86/Kconfig
++++ b/arch/x86/Kconfig
+@@ -1537,6 +1537,7 @@ config EFI
+ config EFI_STUB
+ bool "EFI stub support"
+ depends on EFI
++ select RELOCATABLE
+ ---help---
+ This kernel feature allows a bzImage to be loaded directly
+ by EFI firmware without the use of a bootloader.
+diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
+index 49205d01b9ad..9f83c171ac18 100644
+--- a/arch/x86/include/asm/kvm_host.h
++++ b/arch/x86/include/asm/kvm_host.h
+@@ -95,7 +95,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
+ #define KVM_REFILL_PAGES 25
+ #define KVM_MAX_CPUID_ENTRIES 80
+ #define KVM_NR_FIXED_MTRR_REGION 88
+-#define KVM_NR_VAR_MTRR 10
++#define KVM_NR_VAR_MTRR 8
+
+ #define ASYNC_PF_PER_VCPU 64
+
+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
+index 0ec056012618..aa97a070f09f 100644
+--- a/arch/x86/include/asm/pgtable.h
++++ b/arch/x86/include/asm/pgtable.h
+@@ -131,8 +131,13 @@ static inline int pte_exec(pte_t pte)
+
+ static inline int pte_special(pte_t pte)
+ {
+- return (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_SPECIAL)) ==
+- (_PAGE_PRESENT|_PAGE_SPECIAL);
++ /*
++ * See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
++ * On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
++ * __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
++ */
++ return (pte_flags(pte) & _PAGE_SPECIAL) &&
++ (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
+ }
+
+ static inline unsigned long pte_pfn(pte_t pte)
+diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
+index 9a316b21df8b..3bdb95ae8c43 100644
+--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
++++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
+@@ -42,7 +42,7 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
+ * cmci_discover_lock protects against parallel discovery attempts
+ * which could race against each other.
+ */
+-static DEFINE_SPINLOCK(cmci_discover_lock);
++static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
+
+ #define CMCI_THRESHOLD 1
+ #define CMCI_POLL_INTERVAL (30 * HZ)
+@@ -144,14 +144,14 @@ static void cmci_storm_disable_banks(void)
+ int bank;
+ u64 val;
+
+- spin_lock_irqsave(&cmci_discover_lock, flags);
++ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ owned = __get_cpu_var(mce_banks_owned);
+ for_each_set_bit(bank, owned, MAX_NR_BANKS) {
+ rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ val &= ~MCI_CTL2_CMCI_EN;
+ wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ }
+- spin_unlock_irqrestore(&cmci_discover_lock, flags);
++ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+ }
+
+ static bool cmci_storm_detect(void)
+@@ -211,7 +211,7 @@ static void cmci_discover(int banks)
+ int i;
+ int bios_wrong_thresh = 0;
+
+- spin_lock_irqsave(&cmci_discover_lock, flags);
++ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ for (i = 0; i < banks; i++) {
+ u64 val;
+ int bios_zero_thresh = 0;
+@@ -266,7 +266,7 @@ static void cmci_discover(int banks)
+ WARN_ON(!test_bit(i, __get_cpu_var(mce_poll_banks)));
+ }
+ }
+- spin_unlock_irqrestore(&cmci_discover_lock, flags);
++ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+ if (mca_cfg.bios_cmci_threshold && bios_wrong_thresh) {
+ pr_info_once(
+ "bios_cmci_threshold: Some banks do not have valid thresholds set\n");
+@@ -316,10 +316,10 @@ void cmci_clear(void)
+
+ if (!cmci_supported(&banks))
+ return;
+- spin_lock_irqsave(&cmci_discover_lock, flags);
++ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ for (i = 0; i < banks; i++)
+ __cmci_disable_bank(i);
+- spin_unlock_irqrestore(&cmci_discover_lock, flags);
++ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+ }
+
+ static void cmci_rediscover_work_func(void *arg)
+@@ -360,9 +360,9 @@ void cmci_disable_bank(int bank)
+ if (!cmci_supported(&banks))
+ return;
+
+- spin_lock_irqsave(&cmci_discover_lock, flags);
++ raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ __cmci_disable_bank(bank);
+- spin_unlock_irqrestore(&cmci_discover_lock, flags);
++ raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+ }
+
+ static void intel_init_cmci(void)
+diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
+index 2a26819bb6a8..80eab01c1a68 100644
+--- a/arch/x86/kernel/resource.c
++++ b/arch/x86/kernel/resource.c
+@@ -37,10 +37,12 @@ static void remove_e820_regions(struct resource *avail)
+
+ void arch_remove_reservations(struct resource *avail)
+ {
+- /* Trim out BIOS areas (low 1MB and high 2MB) and E820 regions */
++ /*
++ * Trim out BIOS area (high 2MB) and E820 regions. We do not remove
++ * the low 1MB unconditionally, as this area is needed for some ISA
++ * cards requiring a memory range, e.g. the i82365 PCMCIA controller.
++ */
+ if (avail->flags & IORESOURCE_MEM) {
+- if (avail->start < BIOS_END)
+- avail->start = BIOS_END;
+ resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END);
+
+ remove_e820_regions(avail);
+diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
+index ea5b5709aa76..e1e1e80fc6a6 100644
+--- a/arch/x86/kernel/vsyscall_64.c
++++ b/arch/x86/kernel/vsyscall_64.c
+@@ -81,10 +81,10 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
+ if (!show_unhandled_signals)
+ return;
+
+- pr_notice_ratelimited("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
+- level, current->comm, task_pid_nr(current),
+- message, regs->ip, regs->cs,
+- regs->sp, regs->ax, regs->si, regs->di);
++ printk_ratelimited("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
++ level, current->comm, task_pid_nr(current),
++ message, regs->ip, regs->cs,
++ regs->sp, regs->ax, regs->si, regs->di);
+ }
+
+ static int addr_to_vsyscall_nr(unsigned long addr)
+diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
+index e4e833d3d7d7..2d3b8d0efa0f 100644
+--- a/arch/x86/kvm/emulate.c
++++ b/arch/x86/kvm/emulate.c
+@@ -2017,6 +2017,7 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt)
+ {
+ int rc;
+ unsigned long cs;
++ int cpl = ctxt->ops->cpl(ctxt);
+
+ rc = emulate_pop(ctxt, &ctxt->_eip, ctxt->op_bytes);
+ if (rc != X86EMUL_CONTINUE)
+@@ -2026,6 +2027,9 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt)
+ rc = emulate_pop(ctxt, &cs, ctxt->op_bytes);
+ if (rc != X86EMUL_CONTINUE)
+ return rc;
++ /* Outer-privilege level return is not implemented */
++ if (ctxt->mode >= X86EMUL_MODE_PROT16 && (cs & 3) > cpl)
++ return X86EMUL_UNHANDLEABLE;
+ rc = load_segment_descriptor(ctxt, (u16)cs, VCPU_SREG_CS);
+ return rc;
+ }
+diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
+index bd0da433e6d7..a1ec6a50a05a 100644
+--- a/arch/x86/kvm/irq.c
++++ b/arch/x86/kvm/irq.c
+@@ -108,7 +108,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
+
+ vector = kvm_cpu_get_extint(v);
+
+- if (kvm_apic_vid_enabled(v->kvm) || vector != -1)
++ if (vector != -1)
+ return vector; /* PIC */
+
+ return kvm_get_apic_interrupt(v); /* APIC */
+diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
+index 006911858174..453e5fbbb7ae 100644
+--- a/arch/x86/kvm/lapic.c
++++ b/arch/x86/kvm/lapic.c
+@@ -352,25 +352,46 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic)
+
+ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic)
+ {
+- apic->irr_pending = false;
++ struct kvm_vcpu *vcpu;
++
++ vcpu = apic->vcpu;
++
+ apic_clear_vector(vec, apic->regs + APIC_IRR);
+- if (apic_search_irr(apic) != -1)
+- apic->irr_pending = true;
++ if (unlikely(kvm_apic_vid_enabled(vcpu->kvm)))
++ /* try to update RVI */
++ kvm_make_request(KVM_REQ_EVENT, vcpu);
++ else {
++ vec = apic_search_irr(apic);
++ apic->irr_pending = (vec != -1);
++ }
+ }
+
+ static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
+ {
+- /* Note that we never get here with APIC virtualization enabled. */
++ struct kvm_vcpu *vcpu;
++
++ if (__apic_test_and_set_vector(vec, apic->regs + APIC_ISR))
++ return;
++
++ vcpu = apic->vcpu;
+
+- if (!__apic_test_and_set_vector(vec, apic->regs + APIC_ISR))
+- ++apic->isr_count;
+- BUG_ON(apic->isr_count > MAX_APIC_VECTOR);
+ /*
+- * ISR (in service register) bit is set when injecting an interrupt.
+- * The highest vector is injected. Thus the latest bit set matches
+- * the highest bit in ISR.
++ * With APIC virtualization enabled, all caching is disabled
++ * because the processor can modify ISR under the hood. Instead
++ * just set SVI.
+ */
+- apic->highest_isr_cache = vec;
++ if (unlikely(kvm_apic_vid_enabled(vcpu->kvm)))
++ kvm_x86_ops->hwapic_isr_update(vcpu->kvm, vec);
++ else {
++ ++apic->isr_count;
++ BUG_ON(apic->isr_count > MAX_APIC_VECTOR);
++ /*
++ * ISR (in service register) bit is set when injecting an interrupt.
++ * The highest vector is injected. Thus the latest bit set matches
++ * the highest bit in ISR.
++ */
++ apic->highest_isr_cache = vec;
++ }
+ }
+
+ static inline int apic_find_highest_isr(struct kvm_lapic *apic)
+@@ -1627,11 +1648,16 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
+ int vector = kvm_apic_has_interrupt(vcpu);
+ struct kvm_lapic *apic = vcpu->arch.apic;
+
+- /* Note that we never get here with APIC virtualization enabled. */
+-
+ if (vector == -1)
+ return -1;
+
++ /*
++ * We get here even with APIC virtualization enabled, if doing
++ * nested virtualization and L1 runs with the "acknowledge interrupt
++ * on exit" mode. Then we cannot inject the interrupt via RVI,
++ * because the process would deliver it through the IDT.
++ */
++
+ apic_set_isr(vector, apic);
+ apic_update_ppr(apic);
+ apic_clear_irr(vector, apic);
+diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
+index a19ed92e74e4..2ae525e0d8ba 100644
+--- a/arch/x86/pci/i386.c
++++ b/arch/x86/pci/i386.c
+@@ -162,6 +162,10 @@ pcibios_align_resource(void *data, const struct resource *res,
+ return start;
+ if (start & 0x300)
+ start = (start + 0x3ff) & ~0x3ff;
++ } else if (res->flags & IORESOURCE_MEM) {
++ /* The low 1MB range is reserved for ISA cards */
++ if (start < BIOS_END)
++ start = BIOS_END;
+ }
+ return start;
+ }
+diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
+index ebfa9b2c871d..767c9cbb869f 100644
+--- a/arch/x86/xen/grant-table.c
++++ b/arch/x86/xen/grant-table.c
+@@ -168,6 +168,7 @@ static int __init xlated_setup_gnttab_pages(void)
+ {
+ struct page **pages;
+ xen_pfn_t *pfns;
++ void *vaddr;
+ int rc;
+ unsigned int i;
+ unsigned long nr_grant_frames = gnttab_max_grant_frames();
+@@ -193,21 +194,20 @@ static int __init xlated_setup_gnttab_pages(void)
+ for (i = 0; i < nr_grant_frames; i++)
+ pfns[i] = page_to_pfn(pages[i]);
+
+- rc = arch_gnttab_map_shared(pfns, nr_grant_frames, nr_grant_frames,
+- &xen_auto_xlat_grant_frames.vaddr);
+-
+- if (rc) {
++ vaddr = vmap(pages, nr_grant_frames, 0, PAGE_KERNEL);
++ if (!vaddr) {
+ pr_warn("%s Couldn't map %ld pfns rc:%d\n", __func__,
+ nr_grant_frames, rc);
+ free_xenballooned_pages(nr_grant_frames, pages);
+ kfree(pages);
+ kfree(pfns);
+- return rc;
++ return -ENOMEM;
+ }
+ kfree(pages);
+
+ xen_auto_xlat_grant_frames.pfn = pfns;
+ xen_auto_xlat_grant_frames.count = nr_grant_frames;
++ xen_auto_xlat_grant_frames.vaddr = vaddr;
+
+ return 0;
+ }
+diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
+index 7b78f88c1707..5718b0b58b60 100644
+--- a/arch/x86/xen/time.c
++++ b/arch/x86/xen/time.c
+@@ -444,7 +444,7 @@ void xen_setup_timer(int cpu)
+
+ irq = bind_virq_to_irqhandler(VIRQ_TIMER, cpu, xen_timer_interrupt,
+ IRQF_PERCPU|IRQF_NOBALANCING|IRQF_TIMER|
+- IRQF_FORCE_RESUME,
++ IRQF_FORCE_RESUME|IRQF_EARLY_RESUME,
+ name, NULL);
+ (void)xen_set_irq_priority(irq, XEN_IRQ_PRIORITY_MAX);
+
+diff --git a/drivers/char/tpm/tpm_i2c_stm_st33.c b/drivers/char/tpm/tpm_i2c_stm_st33.c
+index 3b7bf2162898..4669e3713428 100644
+--- a/drivers/char/tpm/tpm_i2c_stm_st33.c
++++ b/drivers/char/tpm/tpm_i2c_stm_st33.c
+@@ -714,6 +714,7 @@ tpm_st33_i2c_probe(struct i2c_client *client, const struct i2c_device_id *id)
+ }
+
+ tpm_get_timeouts(chip);
++ tpm_do_selftest(chip);
+
+ dev_info(chip->dev, "TPM I2C Initialized\n");
+ return 0;
+diff --git a/drivers/crypto/ux500/cryp/cryp_core.c b/drivers/crypto/ux500/cryp/cryp_core.c
+index a999f537228f..92105f3dc8e0 100644
+--- a/drivers/crypto/ux500/cryp/cryp_core.c
++++ b/drivers/crypto/ux500/cryp/cryp_core.c
+@@ -190,7 +190,7 @@ static void add_session_id(struct cryp_ctx *ctx)
+ static irqreturn_t cryp_interrupt_handler(int irq, void *param)
+ {
+ struct cryp_ctx *ctx;
+- int i;
++ int count;
+ struct cryp_device_data *device_data;
+
+ if (param == NULL) {
+@@ -215,12 +215,11 @@ static irqreturn_t cryp_interrupt_handler(int irq, void *param)
+ if (cryp_pending_irq_src(device_data,
+ CRYP_IRQ_SRC_OUTPUT_FIFO)) {
+ if (ctx->outlen / ctx->blocksize > 0) {
+- for (i = 0; i < ctx->blocksize / 4; i++) {
+- *(ctx->outdata) = readl_relaxed(
+- &device_data->base->dout);
+- ctx->outdata += 4;
+- ctx->outlen -= 4;
+- }
++ count = ctx->blocksize / 4;
++
++ readsl(&device_data->base->dout, ctx->outdata, count);
++ ctx->outdata += count;
++ ctx->outlen -= count;
+
+ if (ctx->outlen == 0) {
+ cryp_disable_irq_src(device_data,
+@@ -230,12 +229,12 @@ static irqreturn_t cryp_interrupt_handler(int irq, void *param)
+ } else if (cryp_pending_irq_src(device_data,
+ CRYP_IRQ_SRC_INPUT_FIFO)) {
+ if (ctx->datalen / ctx->blocksize > 0) {
+- for (i = 0 ; i < ctx->blocksize / 4; i++) {
+- writel_relaxed(ctx->indata,
+- &device_data->base->din);
+- ctx->indata += 4;
+- ctx->datalen -= 4;
+- }
++ count = ctx->blocksize / 4;
++
++ writesl(&device_data->base->din, ctx->indata, count);
++
++ ctx->indata += count;
++ ctx->datalen -= count;
+
+ if (ctx->datalen == 0)
+ cryp_disable_irq_src(device_data,
+diff --git a/drivers/gpu/drm/omapdrm/omap_dmm_tiler.c b/drivers/gpu/drm/omapdrm/omap_dmm_tiler.c
+index f926b4caf449..56c60552abba 100644
+--- a/drivers/gpu/drm/omapdrm/omap_dmm_tiler.c
++++ b/drivers/gpu/drm/omapdrm/omap_dmm_tiler.c
+@@ -199,7 +199,7 @@ static struct dmm_txn *dmm_txn_init(struct dmm *dmm, struct tcm *tcm)
+ static void dmm_txn_append(struct dmm_txn *txn, struct pat_area *area,
+ struct page **pages, uint32_t npages, uint32_t roll)
+ {
+- dma_addr_t pat_pa = 0;
++ dma_addr_t pat_pa = 0, data_pa = 0;
+ uint32_t *data;
+ struct pat *pat;
+ struct refill_engine *engine = txn->engine_handle;
+@@ -223,7 +223,9 @@ static void dmm_txn_append(struct dmm_txn *txn, struct pat_area *area,
+ .lut_id = engine->tcm->lut_id,
+ };
+
+- data = alloc_dma(txn, 4*i, &pat->data_pa);
++ data = alloc_dma(txn, 4*i, &data_pa);
++ /* FIXME: what if data_pa is more than 32-bit ? */
++ pat->data_pa = data_pa;
+
+ while (i--) {
+ int n = i + roll;
+diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c
+index 95dbce286a41..d9f5e5241af4 100644
+--- a/drivers/gpu/drm/omapdrm/omap_gem.c
++++ b/drivers/gpu/drm/omapdrm/omap_gem.c
+@@ -791,7 +791,7 @@ int omap_gem_get_paddr(struct drm_gem_object *obj,
+ omap_obj->paddr = tiler_ssptr(block);
+ omap_obj->block = block;
+
+- DBG("got paddr: %08x", omap_obj->paddr);
++ DBG("got paddr: %pad", &omap_obj->paddr);
+ }
+
+ omap_obj->paddr_cnt++;
+@@ -985,9 +985,9 @@ void omap_gem_describe(struct drm_gem_object *obj, struct seq_file *m)
+
+ off = drm_vma_node_start(&obj->vma_node);
+
+- seq_printf(m, "%08x: %2d (%2d) %08llx %08Zx (%2d) %p %4d",
++ seq_printf(m, "%08x: %2d (%2d) %08llx %pad (%2d) %p %4d",
+ omap_obj->flags, obj->name, obj->refcount.refcount.counter,
+- off, omap_obj->paddr, omap_obj->paddr_cnt,
++ off, &omap_obj->paddr, omap_obj->paddr_cnt,
+ omap_obj->vaddr, omap_obj->roll);
+
+ if (omap_obj->flags & OMAP_BO_TILED) {
+@@ -1467,8 +1467,8 @@ void omap_gem_init(struct drm_device *dev)
+ entry->paddr = tiler_ssptr(block);
+ entry->block = block;
+
+- DBG("%d:%d: %dx%d: paddr=%08x stride=%d", i, j, w, h,
+- entry->paddr,
++ DBG("%d:%d: %dx%d: paddr=%pad stride=%d", i, j, w, h,
++ &entry->paddr,
+ usergart[i].stride_pfn << PAGE_SHIFT);
+ }
+ }
+diff --git a/drivers/gpu/drm/omapdrm/omap_plane.c b/drivers/gpu/drm/omapdrm/omap_plane.c
+index 3cf31ee59aac..6af3398b5278 100644
+--- a/drivers/gpu/drm/omapdrm/omap_plane.c
++++ b/drivers/gpu/drm/omapdrm/omap_plane.c
+@@ -142,8 +142,8 @@ static void omap_plane_pre_apply(struct omap_drm_apply *apply)
+ DBG("%dx%d -> %dx%d (%d)", info->width, info->height,
+ info->out_width, info->out_height,
+ info->screen_width);
+- DBG("%d,%d %08x %08x", info->pos_x, info->pos_y,
+- info->paddr, info->p_uv_addr);
++ DBG("%d,%d %pad %pad", info->pos_x, info->pos_y,
++ &info->paddr, &info->p_uv_addr);
+
+ /* TODO: */
+ ilace = false;
+diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
+index c0ea66192fe0..767f2cc44bd8 100644
+--- a/drivers/gpu/drm/radeon/cik.c
++++ b/drivers/gpu/drm/radeon/cik.c
+@@ -3320,6 +3320,7 @@ static void cik_gpu_init(struct radeon_device *rdev)
+ (rdev->pdev->device == 0x130B) ||
+ (rdev->pdev->device == 0x130E) ||
+ (rdev->pdev->device == 0x1315) ||
++ (rdev->pdev->device == 0x1318) ||
+ (rdev->pdev->device == 0x131B)) {
+ rdev->config.cik.max_cu_per_sh = 4;
+ rdev->config.cik.max_backends_per_se = 1;
+diff --git a/drivers/hid/hid-cherry.c b/drivers/hid/hid-cherry.c
+index 1bdcccc54a1d..f745d2c1325e 100644
+--- a/drivers/hid/hid-cherry.c
++++ b/drivers/hid/hid-cherry.c
+@@ -28,7 +28,7 @@
+ static __u8 *ch_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ unsigned int *rsize)
+ {
+- if (*rsize >= 17 && rdesc[11] == 0x3c && rdesc[12] == 0x02) {
++ if (*rsize >= 18 && rdesc[11] == 0x3c && rdesc[12] == 0x02) {
+ hid_info(hdev, "fixing up Cherry Cymotion report descriptor\n");
+ rdesc[11] = rdesc[16] = 0xff;
+ rdesc[12] = rdesc[17] = 0x03;
+diff --git a/drivers/hid/hid-kye.c b/drivers/hid/hid-kye.c
+index e77696367591..b92bf01a1ae8 100644
+--- a/drivers/hid/hid-kye.c
++++ b/drivers/hid/hid-kye.c
+@@ -300,7 +300,7 @@ static __u8 *kye_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ * - change the button usage range to 4-7 for the extra
+ * buttons
+ */
+- if (*rsize >= 74 &&
++ if (*rsize >= 75 &&
+ rdesc[61] == 0x05 && rdesc[62] == 0x08 &&
+ rdesc[63] == 0x19 && rdesc[64] == 0x08 &&
+ rdesc[65] == 0x29 && rdesc[66] == 0x0f &&
+diff --git a/drivers/hid/hid-lg.c b/drivers/hid/hid-lg.c
+index a976f48263f6..f91ff145db9a 100644
+--- a/drivers/hid/hid-lg.c
++++ b/drivers/hid/hid-lg.c
+@@ -345,14 +345,14 @@ static __u8 *lg_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ struct usb_device_descriptor *udesc;
+ __u16 bcdDevice, rev_maj, rev_min;
+
+- if ((drv_data->quirks & LG_RDESC) && *rsize >= 90 && rdesc[83] == 0x26 &&
++ if ((drv_data->quirks & LG_RDESC) && *rsize >= 91 && rdesc[83] == 0x26 &&
+ rdesc[84] == 0x8c && rdesc[85] == 0x02) {
+ hid_info(hdev,
+ "fixing up Logitech keyboard report descriptor\n");
+ rdesc[84] = rdesc[89] = 0x4d;
+ rdesc[85] = rdesc[90] = 0x10;
+ }
+- if ((drv_data->quirks & LG_RDESC_REL_ABS) && *rsize >= 50 &&
++ if ((drv_data->quirks & LG_RDESC_REL_ABS) && *rsize >= 51 &&
+ rdesc[32] == 0x81 && rdesc[33] == 0x06 &&
+ rdesc[49] == 0x81 && rdesc[50] == 0x06) {
+ hid_info(hdev,
+diff --git a/drivers/hid/hid-logitech-dj.c b/drivers/hid/hid-logitech-dj.c
+index 486dbde2ba2d..b7ba82960c79 100644
+--- a/drivers/hid/hid-logitech-dj.c
++++ b/drivers/hid/hid-logitech-dj.c
+@@ -238,13 +238,6 @@ static void logi_dj_recv_add_djhid_device(struct dj_receiver_dev *djrcv_dev,
+ return;
+ }
+
+- if ((dj_report->device_index < DJ_DEVICE_INDEX_MIN) ||
+- (dj_report->device_index > DJ_DEVICE_INDEX_MAX)) {
+- dev_err(&djrcv_hdev->dev, "%s: invalid device index:%d\n",
+- __func__, dj_report->device_index);
+- return;
+- }
+-
+ if (djrcv_dev->paired_dj_devices[dj_report->device_index]) {
+ /* The device is already known. No need to reallocate it. */
+ dbg_hid("%s: device is already known\n", __func__);
+@@ -557,7 +550,7 @@ static int logi_dj_ll_raw_request(struct hid_device *hid,
+ if (!out_buf)
+ return -ENOMEM;
+
+- if (count < DJREPORT_SHORT_LENGTH - 2)
++ if (count > DJREPORT_SHORT_LENGTH - 2)
+ count = DJREPORT_SHORT_LENGTH - 2;
+
+ out_buf[0] = REPORT_ID_DJ_SHORT;
+@@ -690,6 +683,12 @@ static int logi_dj_raw_event(struct hid_device *hdev,
+ * device (via hid_input_report() ) and return 1 so hid-core does not do
+ * anything else with it.
+ */
++ if ((dj_report->device_index < DJ_DEVICE_INDEX_MIN) ||
++ (dj_report->device_index > DJ_DEVICE_INDEX_MAX)) {
++ dev_err(&hdev->dev, "%s: invalid device index:%d\n",
++ __func__, dj_report->device_index);
++ return false;
++ }
+
+ spin_lock_irqsave(&djrcv_dev->lock, flags);
+ if (dj_report->report_id == REPORT_ID_DJ_SHORT) {
+diff --git a/drivers/hid/hid-monterey.c b/drivers/hid/hid-monterey.c
+index 9e14c00eb1b6..25daf28b26bd 100644
+--- a/drivers/hid/hid-monterey.c
++++ b/drivers/hid/hid-monterey.c
+@@ -24,7 +24,7 @@
+ static __u8 *mr_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ unsigned int *rsize)
+ {
+- if (*rsize >= 30 && rdesc[29] == 0x05 && rdesc[30] == 0x09) {
++ if (*rsize >= 31 && rdesc[29] == 0x05 && rdesc[30] == 0x09) {
+ hid_info(hdev, "fixing up button/consumer in HID report descriptor\n");
+ rdesc[30] = 0x0c;
+ }
+diff --git a/drivers/hid/hid-petalynx.c b/drivers/hid/hid-petalynx.c
+index 736b2502df4f..6aca4f2554bf 100644
+--- a/drivers/hid/hid-petalynx.c
++++ b/drivers/hid/hid-petalynx.c
+@@ -25,7 +25,7 @@
+ static __u8 *pl_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ unsigned int *rsize)
+ {
+- if (*rsize >= 60 && rdesc[39] == 0x2a && rdesc[40] == 0xf5 &&
++ if (*rsize >= 62 && rdesc[39] == 0x2a && rdesc[40] == 0xf5 &&
+ rdesc[41] == 0x00 && rdesc[59] == 0x26 &&
+ rdesc[60] == 0xf9 && rdesc[61] == 0x00) {
+ hid_info(hdev, "fixing up Petalynx Maxter Remote report descriptor\n");
+diff --git a/drivers/hid/hid-sunplus.c b/drivers/hid/hid-sunplus.c
+index 87fc91e1c8de..91072fa54663 100644
+--- a/drivers/hid/hid-sunplus.c
++++ b/drivers/hid/hid-sunplus.c
+@@ -24,7 +24,7 @@
+ static __u8 *sp_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ unsigned int *rsize)
+ {
+- if (*rsize >= 107 && rdesc[104] == 0x26 && rdesc[105] == 0x80 &&
++ if (*rsize >= 112 && rdesc[104] == 0x26 && rdesc[105] == 0x80 &&
+ rdesc[106] == 0x03) {
+ hid_info(hdev, "fixing up Sunplus Wireless Desktop report descriptor\n");
+ rdesc[105] = rdesc[110] = 0x03;
+diff --git a/drivers/hwmon/ads1015.c b/drivers/hwmon/ads1015.c
+index 7f9dc2f86b63..126516414c11 100644
+--- a/drivers/hwmon/ads1015.c
++++ b/drivers/hwmon/ads1015.c
+@@ -198,7 +198,7 @@ static int ads1015_get_channels_config_of(struct i2c_client *client)
+ }
+
+ channel = be32_to_cpup(property);
+- if (channel > ADS1015_CHANNELS) {
++ if (channel >= ADS1015_CHANNELS) {
+ dev_err(&client->dev,
+ "invalid channel index %d on %s\n",
+ channel, node->full_name);
+@@ -212,6 +212,7 @@ static int ads1015_get_channels_config_of(struct i2c_client *client)
+ dev_err(&client->dev,
+ "invalid gain on %s\n",
+ node->full_name);
++ return -EINVAL;
+ }
+ }
+
+@@ -222,6 +223,7 @@ static int ads1015_get_channels_config_of(struct i2c_client *client)
+ dev_err(&client->dev,
+ "invalid data_rate on %s\n",
+ node->full_name);
++ return -EINVAL;
+ }
+ }
+
+diff --git a/drivers/hwmon/amc6821.c b/drivers/hwmon/amc6821.c
+index 9f2be3dd28f3..8a67ec6279a4 100644
+--- a/drivers/hwmon/amc6821.c
++++ b/drivers/hwmon/amc6821.c
+@@ -360,11 +360,13 @@ static ssize_t set_pwm1_enable(
+ if (config)
+ return config;
+
++ mutex_lock(&data->update_lock);
+ config = i2c_smbus_read_byte_data(client, AMC6821_REG_CONF1);
+ if (config < 0) {
+ dev_err(&client->dev,
+ "Error reading configuration register, aborting.\n");
+- return config;
++ count = config;
++ goto unlock;
+ }
+
+ switch (val) {
+@@ -381,14 +383,15 @@ static ssize_t set_pwm1_enable(
+ config |= AMC6821_CONF1_FDRC1;
+ break;
+ default:
+- return -EINVAL;
++ count = -EINVAL;
++ goto unlock;
+ }
+- mutex_lock(&data->update_lock);
+ if (i2c_smbus_write_byte_data(client, AMC6821_REG_CONF1, config)) {
+ dev_err(&client->dev,
+ "Configuration register write error, aborting.\n");
+ count = -EIO;
+ }
++unlock:
+ mutex_unlock(&data->update_lock);
+ return count;
+ }
+@@ -493,8 +496,9 @@ static ssize_t set_temp_auto_point_temp(
+ return -EINVAL;
+ }
+
+- data->valid = 0;
+ mutex_lock(&data->update_lock);
++ data->valid = 0;
++
+ switch (ix) {
+ case 0:
+ ptemp[0] = clamp_val(val / 1000, 0,
+@@ -658,13 +662,14 @@ static ssize_t set_fan1_div(
+ if (config)
+ return config;
+
++ mutex_lock(&data->update_lock);
+ config = i2c_smbus_read_byte_data(client, AMC6821_REG_CONF4);
+ if (config < 0) {
+ dev_err(&client->dev,
+ "Error reading configuration register, aborting.\n");
+- return config;
++ count = config;
++ goto EXIT;
+ }
+- mutex_lock(&data->update_lock);
+ switch (val) {
+ case 2:
+ config &= ~AMC6821_CONF4_PSPR;
+diff --git a/drivers/hwmon/dme1737.c b/drivers/hwmon/dme1737.c
+index 4ae3fff13f44..bea0a344fab5 100644
+--- a/drivers/hwmon/dme1737.c
++++ b/drivers/hwmon/dme1737.c
+@@ -247,8 +247,8 @@ struct dme1737_data {
+ u8 pwm_acz[3];
+ u8 pwm_freq[6];
+ u8 pwm_rr[2];
+- u8 zone_low[3];
+- u8 zone_abs[3];
++ s8 zone_low[3];
++ s8 zone_abs[3];
+ u8 zone_hyst[2];
+ u32 alarms;
+ };
+@@ -277,7 +277,7 @@ static inline int IN_FROM_REG(int reg, int nominal, int res)
+ return (reg * nominal + (3 << (res - 3))) / (3 << (res - 2));
+ }
+
+-static inline int IN_TO_REG(int val, int nominal)
++static inline int IN_TO_REG(long val, int nominal)
+ {
+ return clamp_val((val * 192 + nominal / 2) / nominal, 0, 255);
+ }
+@@ -293,7 +293,7 @@ static inline int TEMP_FROM_REG(int reg, int res)
+ return (reg * 1000) >> (res - 8);
+ }
+
+-static inline int TEMP_TO_REG(int val)
++static inline int TEMP_TO_REG(long val)
+ {
+ return clamp_val((val < 0 ? val - 500 : val + 500) / 1000, -128, 127);
+ }
+@@ -308,7 +308,7 @@ static inline int TEMP_RANGE_FROM_REG(int reg)
+ return TEMP_RANGE[(reg >> 4) & 0x0f];
+ }
+
+-static int TEMP_RANGE_TO_REG(int val, int reg)
++static int TEMP_RANGE_TO_REG(long val, int reg)
+ {
+ int i;
+
+@@ -331,7 +331,7 @@ static inline int TEMP_HYST_FROM_REG(int reg, int ix)
+ return (((ix == 1) ? reg : reg >> 4) & 0x0f) * 1000;
+ }
+
+-static inline int TEMP_HYST_TO_REG(int val, int ix, int reg)
++static inline int TEMP_HYST_TO_REG(long val, int ix, int reg)
+ {
+ int hyst = clamp_val((val + 500) / 1000, 0, 15);
+
+@@ -347,7 +347,7 @@ static inline int FAN_FROM_REG(int reg, int tpc)
+ return (reg == 0 || reg == 0xffff) ? 0 : 90000 * 60 / reg;
+ }
+
+-static inline int FAN_TO_REG(int val, int tpc)
++static inline int FAN_TO_REG(long val, int tpc)
+ {
+ if (tpc) {
+ return clamp_val(val / tpc, 0, 0xffff);
+@@ -379,7 +379,7 @@ static inline int FAN_TYPE_FROM_REG(int reg)
+ return (edge > 0) ? 1 << (edge - 1) : 0;
+ }
+
+-static inline int FAN_TYPE_TO_REG(int val, int reg)
++static inline int FAN_TYPE_TO_REG(long val, int reg)
+ {
+ int edge = (val == 4) ? 3 : val;
+
+@@ -402,7 +402,7 @@ static int FAN_MAX_FROM_REG(int reg)
+ return 1000 + i * 500;
+ }
+
+-static int FAN_MAX_TO_REG(int val)
++static int FAN_MAX_TO_REG(long val)
+ {
+ int i;
+
+@@ -460,7 +460,7 @@ static inline int PWM_ACZ_FROM_REG(int reg)
+ return acz[(reg >> 5) & 0x07];
+ }
+
+-static inline int PWM_ACZ_TO_REG(int val, int reg)
++static inline int PWM_ACZ_TO_REG(long val, int reg)
+ {
+ int acz = (val == 4) ? 2 : val - 1;
+
+@@ -476,7 +476,7 @@ static inline int PWM_FREQ_FROM_REG(int reg)
+ return PWM_FREQ[reg & 0x0f];
+ }
+
+-static int PWM_FREQ_TO_REG(int val, int reg)
++static int PWM_FREQ_TO_REG(long val, int reg)
+ {
+ int i;
+
+@@ -510,7 +510,7 @@ static inline int PWM_RR_FROM_REG(int reg, int ix)
+ return (rr & 0x08) ? PWM_RR[rr & 0x07] : 0;
+ }
+
+-static int PWM_RR_TO_REG(int val, int ix, int reg)
++static int PWM_RR_TO_REG(long val, int ix, int reg)
+ {
+ int i;
+
+@@ -528,7 +528,7 @@ static inline int PWM_RR_EN_FROM_REG(int reg, int ix)
+ return PWM_RR_FROM_REG(reg, ix) ? 1 : 0;
+ }
+
+-static inline int PWM_RR_EN_TO_REG(int val, int ix, int reg)
++static inline int PWM_RR_EN_TO_REG(long val, int ix, int reg)
+ {
+ int en = (ix == 1) ? 0x80 : 0x08;
+
+@@ -1481,13 +1481,16 @@ static ssize_t set_vrm(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+ {
+ struct dme1737_data *data = dev_get_drvdata(dev);
+- long val;
++ unsigned long val;
+ int err;
+
+- err = kstrtol(buf, 10, &val);
++ err = kstrtoul(buf, 10, &val);
+ if (err)
+ return err;
+
++ if (val > 255)
++ return -EINVAL;
++
+ data->vrm = val;
+ return count;
+ }
+diff --git a/drivers/hwmon/gpio-fan.c b/drivers/hwmon/gpio-fan.c
+index 2566c43dd1e9..d10aa7b46cca 100644
+--- a/drivers/hwmon/gpio-fan.c
++++ b/drivers/hwmon/gpio-fan.c
+@@ -173,7 +173,7 @@ static int get_fan_speed_index(struct gpio_fan_data *fan_data)
+ return -ENODEV;
+ }
+
+-static int rpm_to_speed_index(struct gpio_fan_data *fan_data, int rpm)
++static int rpm_to_speed_index(struct gpio_fan_data *fan_data, unsigned long rpm)
+ {
+ struct gpio_fan_speed *speed = fan_data->speed;
+ int i;
+diff --git a/drivers/hwmon/lm78.c b/drivers/hwmon/lm78.c
+index 9efadfc851bc..c1eb464f0fd0 100644
+--- a/drivers/hwmon/lm78.c
++++ b/drivers/hwmon/lm78.c
+@@ -108,7 +108,7 @@ static inline int FAN_FROM_REG(u8 val, int div)
+ * TEMP: mC (-128C to +127C)
+ * REG: 1C/bit, two's complement
+ */
+-static inline s8 TEMP_TO_REG(int val)
++static inline s8 TEMP_TO_REG(long val)
+ {
+ int nval = clamp_val(val, -128000, 127000) ;
+ return nval < 0 ? (nval - 500) / 1000 : (nval + 500) / 1000;
+diff --git a/drivers/hwmon/lm85.c b/drivers/hwmon/lm85.c
+index b0129a54e1a6..ef627ea71cc8 100644
+--- a/drivers/hwmon/lm85.c
++++ b/drivers/hwmon/lm85.c
+@@ -155,7 +155,7 @@ static inline u16 FAN_TO_REG(unsigned long val)
+
+ /* Temperature is reported in .001 degC increments */
+ #define TEMP_TO_REG(val) \
+- clamp_val(SCALE(val, 1000, 1), -127, 127)
++ DIV_ROUND_CLOSEST(clamp_val((val), -127000, 127000), 1000)
+ #define TEMPEXT_FROM_REG(val, ext) \
+ SCALE(((val) << 4) + (ext), 16, 1000)
+ #define TEMP_FROM_REG(val) ((val) * 1000)
+@@ -189,7 +189,7 @@ static const int lm85_range_map[] = {
+ 13300, 16000, 20000, 26600, 32000, 40000, 53300, 80000
+ };
+
+-static int RANGE_TO_REG(int range)
++static int RANGE_TO_REG(long range)
+ {
+ int i;
+
+@@ -211,7 +211,7 @@ static const int adm1027_freq_map[8] = { /* 1 Hz */
+ 11, 15, 22, 29, 35, 44, 59, 88
+ };
+
+-static int FREQ_TO_REG(const int *map, int freq)
++static int FREQ_TO_REG(const int *map, unsigned long freq)
+ {
+ int i;
+
+@@ -460,6 +460,9 @@ static ssize_t store_vrm_reg(struct device *dev, struct device_attribute *attr,
+ if (err)
+ return err;
+
++ if (val > 255)
++ return -EINVAL;
++
+ data->vrm = val;
+ return count;
+ }
+diff --git a/drivers/hwmon/lm92.c b/drivers/hwmon/lm92.c
+index d2060e245ff5..cfaf70b9cba7 100644
+--- a/drivers/hwmon/lm92.c
++++ b/drivers/hwmon/lm92.c
+@@ -74,12 +74,9 @@ static inline int TEMP_FROM_REG(s16 reg)
+ return reg / 8 * 625 / 10;
+ }
+
+-static inline s16 TEMP_TO_REG(int val)
++static inline s16 TEMP_TO_REG(long val)
+ {
+- if (val <= -60000)
+- return -60000 * 10 / 625 * 8;
+- if (val >= 160000)
+- return 160000 * 10 / 625 * 8;
++ val = clamp_val(val, -60000, 160000);
+ return val * 10 / 625 * 8;
+ }
+
+@@ -206,10 +203,12 @@ static ssize_t set_temp_hyst(struct device *dev,
+ if (err)
+ return err;
+
++ val = clamp_val(val, -120000, 220000);
+ mutex_lock(&data->update_lock);
+- data->temp[t_hyst] = TEMP_FROM_REG(data->temp[attr->index]) - val;
++ data->temp[t_hyst] =
++ TEMP_TO_REG(TEMP_FROM_REG(data->temp[attr->index]) - val);
+ i2c_smbus_write_word_swapped(client, LM92_REG_TEMP_HYST,
+- TEMP_TO_REG(data->temp[t_hyst]));
++ data->temp[t_hyst]);
+ mutex_unlock(&data->update_lock);
+ return count;
+ }
+diff --git a/drivers/hwmon/sis5595.c b/drivers/hwmon/sis5595.c
+index 3532026e25da..bf1d7893d51c 100644
+--- a/drivers/hwmon/sis5595.c
++++ b/drivers/hwmon/sis5595.c
+@@ -159,7 +159,7 @@ static inline int TEMP_FROM_REG(s8 val)
+ {
+ return val * 830 + 52120;
+ }
+-static inline s8 TEMP_TO_REG(int val)
++static inline s8 TEMP_TO_REG(long val)
+ {
+ int nval = clamp_val(val, -54120, 157530) ;
+ return nval < 0 ? (nval - 5212 - 415) / 830 : (nval - 5212 + 415) / 830;
+diff --git a/drivers/i2c/busses/i2c-at91.c b/drivers/i2c/busses/i2c-at91.c
+index e95f9ba96790..83c989382be9 100644
+--- a/drivers/i2c/busses/i2c-at91.c
++++ b/drivers/i2c/busses/i2c-at91.c
+@@ -210,7 +210,7 @@ static void at91_twi_write_data_dma_callback(void *data)
+ struct at91_twi_dev *dev = (struct at91_twi_dev *)data;
+
+ dma_unmap_single(dev->dev, sg_dma_address(&dev->dma.sg),
+- dev->buf_len, DMA_MEM_TO_DEV);
++ dev->buf_len, DMA_TO_DEVICE);
+
+ at91_twi_write(dev, AT91_TWI_CR, AT91_TWI_STOP);
+ }
+@@ -289,7 +289,7 @@ static void at91_twi_read_data_dma_callback(void *data)
+ struct at91_twi_dev *dev = (struct at91_twi_dev *)data;
+
+ dma_unmap_single(dev->dev, sg_dma_address(&dev->dma.sg),
+- dev->buf_len, DMA_DEV_TO_MEM);
++ dev->buf_len, DMA_FROM_DEVICE);
+
+ /* The last two bytes have to be read without using dma */
+ dev->buf += dev->buf_len - 2;
+diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
+index a9791509966a..69e11853e8bf 100644
+--- a/drivers/i2c/busses/i2c-rk3x.c
++++ b/drivers/i2c/busses/i2c-rk3x.c
+@@ -399,7 +399,7 @@ static irqreturn_t rk3x_i2c_irq(int irqno, void *dev_id)
+ }
+
+ /* is there anything left to handle? */
+- if (unlikely(ipd == 0))
++ if (unlikely((ipd & REG_INT_ALL) == 0))
+ goto out;
+
+ switch (i2c->state) {
+diff --git a/drivers/misc/mei/client.c b/drivers/misc/mei/client.c
+index 59d20c599b16..2da05c0e113d 100644
+--- a/drivers/misc/mei/client.c
++++ b/drivers/misc/mei/client.c
+@@ -459,7 +459,7 @@ int mei_cl_disconnect(struct mei_cl *cl)
+ {
+ struct mei_device *dev;
+ struct mei_cl_cb *cb;
+- int rets, err;
++ int rets;
+
+ if (WARN_ON(!cl || !cl->dev))
+ return -ENODEV;
+@@ -491,6 +491,7 @@ int mei_cl_disconnect(struct mei_cl *cl)
+ cl_err(dev, cl, "failed to disconnect.\n");
+ goto free;
+ }
++ cl->timer_count = MEI_CONNECT_TIMEOUT;
+ mdelay(10); /* Wait for hardware disconnection ready */
+ list_add_tail(&cb->list, &dev->ctrl_rd_list.list);
+ } else {
+@@ -500,23 +501,18 @@ int mei_cl_disconnect(struct mei_cl *cl)
+ }
+ mutex_unlock(&dev->device_lock);
+
+- err = wait_event_timeout(dev->wait_recvd_msg,
++ wait_event_timeout(dev->wait_recvd_msg,
+ MEI_FILE_DISCONNECTED == cl->state,
+ mei_secs_to_jiffies(MEI_CL_CONNECT_TIMEOUT));
+
+ mutex_lock(&dev->device_lock);
++
+ if (MEI_FILE_DISCONNECTED == cl->state) {
+ rets = 0;
+ cl_dbg(dev, cl, "successfully disconnected from FW client.\n");
+ } else {
+- rets = -ENODEV;
+- if (MEI_FILE_DISCONNECTED != cl->state)
+- cl_err(dev, cl, "wrong status client disconnect.\n");
+-
+- if (err)
+- cl_dbg(dev, cl, "wait failed disconnect err=%d\n", err);
+-
+- cl_err(dev, cl, "failed to disconnect from FW client.\n");
++ cl_dbg(dev, cl, "timeout on disconnect from FW client.\n");
++ rets = -ETIME;
+ }
+
+ mei_io_list_flush(&dev->ctrl_rd_list, cl);
+@@ -605,6 +601,7 @@ int mei_cl_connect(struct mei_cl *cl, struct file *file)
+ cl->timer_count = MEI_CONNECT_TIMEOUT;
+ list_add_tail(&cb->list, &dev->ctrl_rd_list.list);
+ } else {
++ cl->state = MEI_FILE_INITIALIZING;
+ list_add_tail(&cb->list, &dev->ctrl_wr_list.list);
+ }
+
+@@ -616,6 +613,7 @@ int mei_cl_connect(struct mei_cl *cl, struct file *file)
+ mutex_lock(&dev->device_lock);
+
+ if (cl->state != MEI_FILE_CONNECTED) {
++ cl->state = MEI_FILE_DISCONNECTED;
+ /* something went really wrong */
+ if (!cl->status)
+ cl->status = -EFAULT;
+diff --git a/drivers/misc/mei/nfc.c b/drivers/misc/mei/nfc.c
+index 3095fc514a65..5ccc23bc7690 100644
+--- a/drivers/misc/mei/nfc.c
++++ b/drivers/misc/mei/nfc.c
+@@ -342,9 +342,10 @@ static int mei_nfc_send(struct mei_cl_device *cldev, u8 *buf, size_t length)
+ ndev = (struct mei_nfc_dev *) cldev->priv_data;
+ dev = ndev->cl->dev;
+
++ err = -ENOMEM;
+ mei_buf = kzalloc(length + MEI_NFC_HEADER_SIZE, GFP_KERNEL);
+ if (!mei_buf)
+- return -ENOMEM;
++ goto out;
+
+ hdr = (struct mei_nfc_hci_hdr *) mei_buf;
+ hdr->cmd = MEI_NFC_CMD_HCI_SEND;
+@@ -354,12 +355,9 @@ static int mei_nfc_send(struct mei_cl_device *cldev, u8 *buf, size_t length)
+ hdr->data_size = length;
+
+ memcpy(mei_buf + MEI_NFC_HEADER_SIZE, buf, length);
+-
+ err = __mei_cl_send(ndev->cl, mei_buf, length + MEI_NFC_HEADER_SIZE);
+ if (err < 0)
+- return err;
+-
+- kfree(mei_buf);
++ goto out;
+
+ if (!wait_event_interruptible_timeout(ndev->send_wq,
+ ndev->recv_req_id == ndev->req_id, HZ)) {
+@@ -368,7 +366,8 @@ static int mei_nfc_send(struct mei_cl_device *cldev, u8 *buf, size_t length)
+ } else {
+ ndev->req_id++;
+ }
+-
++out:
++ kfree(mei_buf);
+ return err;
+ }
+
+diff --git a/drivers/misc/mei/pci-me.c b/drivers/misc/mei/pci-me.c
+index 1b46c64a649f..4b821b4360e1 100644
+--- a/drivers/misc/mei/pci-me.c
++++ b/drivers/misc/mei/pci-me.c
+@@ -369,7 +369,7 @@ static int mei_me_pm_runtime_idle(struct device *device)
+ if (!dev)
+ return -ENODEV;
+ if (mei_write_is_idle(dev))
+- pm_schedule_suspend(device, MEI_ME_RPM_TIMEOUT * 2);
++ pm_runtime_autosuspend(device);
+
+ return -EBUSY;
+ }
+diff --git a/drivers/misc/mei/pci-txe.c b/drivers/misc/mei/pci-txe.c
+index 2343c6236df9..32fef4d5b0b6 100644
+--- a/drivers/misc/mei/pci-txe.c
++++ b/drivers/misc/mei/pci-txe.c
+@@ -306,7 +306,7 @@ static int mei_txe_pm_runtime_idle(struct device *device)
+ if (!dev)
+ return -ENODEV;
+ if (mei_write_is_idle(dev))
+- pm_schedule_suspend(device, MEI_TXI_RPM_TIMEOUT * 2);
++ pm_runtime_autosuspend(device);
+
+ return -EBUSY;
+ }
+diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
+index 7ad463e9741c..249ab80cbb45 100644
+--- a/drivers/mmc/host/mmci.c
++++ b/drivers/mmc/host/mmci.c
+@@ -834,6 +834,10 @@ static void
+ mmci_data_irq(struct mmci_host *host, struct mmc_data *data,
+ unsigned int status)
+ {
++ /* Make sure we have data to handle */
++ if (!data)
++ return;
++
+ /* First check for errors */
+ if (status & (MCI_DATACRCFAIL|MCI_DATATIMEOUT|MCI_STARTBITERR|
+ MCI_TXUNDERRUN|MCI_RXOVERRUN)) {
+@@ -902,9 +906,17 @@ mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
+ unsigned int status)
+ {
+ void __iomem *base = host->base;
+- bool sbc = (cmd == host->mrq->sbc);
+- bool busy_resp = host->variant->busy_detect &&
+- (cmd->flags & MMC_RSP_BUSY);
++ bool sbc, busy_resp;
++
++ if (!cmd)
++ return;
++
++ sbc = (cmd == host->mrq->sbc);
++ busy_resp = host->variant->busy_detect && (cmd->flags & MMC_RSP_BUSY);
++
++ if (!((status|host->busy_status) & (MCI_CMDCRCFAIL|MCI_CMDTIMEOUT|
++ MCI_CMDSENT|MCI_CMDRESPEND)))
++ return;
+
+ /* Check if we need to wait for busy completion. */
+ if (host->busy_status && (status & MCI_ST_CARDBUSY))
+@@ -1132,9 +1144,6 @@ static irqreturn_t mmci_irq(int irq, void *dev_id)
+ spin_lock(&host->lock);
+
+ do {
+- struct mmc_command *cmd;
+- struct mmc_data *data;
+-
+ status = readl(host->base + MMCISTATUS);
+
+ if (host->singleirq) {
+@@ -1154,16 +1163,8 @@ static irqreturn_t mmci_irq(int irq, void *dev_id)
+
+ dev_dbg(mmc_dev(host->mmc), "irq0 (data+cmd) %08x\n", status);
+
+- cmd = host->cmd;
+- if ((status|host->busy_status) & (MCI_CMDCRCFAIL|MCI_CMDTIMEOUT|
+- MCI_CMDSENT|MCI_CMDRESPEND) && cmd)
+- mmci_cmd_irq(host, cmd, status);
+-
+- data = host->data;
+- if (status & (MCI_DATACRCFAIL|MCI_DATATIMEOUT|MCI_STARTBITERR|
+- MCI_TXUNDERRUN|MCI_RXOVERRUN|MCI_DATAEND|
+- MCI_DATABLOCKEND) && data)
+- mmci_data_irq(host, data, status);
++ mmci_cmd_irq(host, host->cmd, status);
++ mmci_data_irq(host, host->data, status);
+
+ /* Don't poll for busy completion in irq context. */
+ if (host->busy_status)
+diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
+index 42914e04d110..056841651a80 100644
+--- a/drivers/pci/hotplug/pciehp_hpc.c
++++ b/drivers/pci/hotplug/pciehp_hpc.c
+@@ -794,7 +794,7 @@ struct controller *pcie_init(struct pcie_device *dev)
+ pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
+ PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
+ PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_PDC |
+- PCI_EXP_SLTSTA_CC);
++ PCI_EXP_SLTSTA_CC | PCI_EXP_SLTSTA_DLLSC);
+
+ /* Disable software notification */
+ pcie_disable_notification(ctrl);
+diff --git a/drivers/pci/pci-label.c b/drivers/pci/pci-label.c
+index a3fbe2012ea3..2ab1b47c7651 100644
+--- a/drivers/pci/pci-label.c
++++ b/drivers/pci/pci-label.c
+@@ -161,8 +161,8 @@ enum acpi_attr_enum {
+ static void dsm_label_utf16s_to_utf8s(union acpi_object *obj, char *buf)
+ {
+ int len;
+- len = utf16s_to_utf8s((const wchar_t *)obj->string.pointer,
+- obj->string.length,
++ len = utf16s_to_utf8s((const wchar_t *)obj->buffer.pointer,
++ obj->buffer.length,
+ UTF16_LITTLE_ENDIAN,
+ buf, PAGE_SIZE);
+ buf[len] = '\n';
+@@ -187,16 +187,22 @@ static int dsm_get_label(struct device *dev, char *buf,
+ tmp = obj->package.elements;
+ if (obj->type == ACPI_TYPE_PACKAGE && obj->package.count == 2 &&
+ tmp[0].type == ACPI_TYPE_INTEGER &&
+- tmp[1].type == ACPI_TYPE_STRING) {
++ (tmp[1].type == ACPI_TYPE_STRING ||
++ tmp[1].type == ACPI_TYPE_BUFFER)) {
+ /*
+ * The second string element is optional even when
+ * this _DSM is implemented; when not implemented,
+ * this entry must return a null string.
+ */
+- if (attr == ACPI_ATTR_INDEX_SHOW)
++ if (attr == ACPI_ATTR_INDEX_SHOW) {
+ scnprintf(buf, PAGE_SIZE, "%llu\n", tmp->integer.value);
+- else if (attr == ACPI_ATTR_LABEL_SHOW)
+- dsm_label_utf16s_to_utf8s(tmp + 1, buf);
++ } else if (attr == ACPI_ATTR_LABEL_SHOW) {
++ if (tmp[1].type == ACPI_TYPE_STRING)
++ scnprintf(buf, PAGE_SIZE, "%s\n",
++ tmp[1].string.pointer);
++ else if (tmp[1].type == ACPI_TYPE_BUFFER)
++ dsm_label_utf16s_to_utf8s(tmp + 1, buf);
++ }
+ len = strlen(buf) > 0 ? strlen(buf) : -1;
+ }
+
+diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
+index 1c8592b0e146..81d49d3ab221 100644
+--- a/drivers/pci/pci.c
++++ b/drivers/pci/pci.c
+@@ -839,12 +839,6 @@ int pci_set_power_state(struct pci_dev *dev, pci_power_t state)
+
+ if (!__pci_complete_power_transition(dev, state))
+ error = 0;
+- /*
+- * When aspm_policy is "powersave" this call ensures
+- * that ASPM is configured.
+- */
+- if (!error && dev->bus->self)
+- pcie_aspm_powersave_config_link(dev->bus->self);
+
+ return error;
+ }
+@@ -1195,12 +1189,18 @@ int __weak pcibios_enable_device(struct pci_dev *dev, int bars)
+ static int do_pci_enable_device(struct pci_dev *dev, int bars)
+ {
+ int err;
++ struct pci_dev *bridge;
+ u16 cmd;
+ u8 pin;
+
+ err = pci_set_power_state(dev, PCI_D0);
+ if (err < 0 && err != -EIO)
+ return err;
++
++ bridge = pci_upstream_bridge(dev);
++ if (bridge)
++ pcie_aspm_powersave_config_link(bridge);
++
+ err = pcibios_enable_device(dev, bars);
+ if (err < 0)
+ return err;
+diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
+index caed1ce6facd..481c4e18693a 100644
+--- a/drivers/pci/setup-res.c
++++ b/drivers/pci/setup-res.c
+@@ -320,9 +320,11 @@ int pci_reassign_resource(struct pci_dev *dev, int resno, resource_size_t addsiz
+ resource_size_t min_align)
+ {
+ struct resource *res = dev->resource + resno;
++ unsigned long flags;
+ resource_size_t new_size;
+ int ret;
+
++ flags = res->flags;
+ res->flags |= IORESOURCE_UNSET;
+ if (!res->parent) {
+ dev_info(&dev->dev, "BAR %d: can't reassign an unassigned resource %pR\n",
+@@ -339,7 +341,12 @@ int pci_reassign_resource(struct pci_dev *dev, int resno, resource_size_t addsiz
+ dev_info(&dev->dev, "BAR %d: reassigned %pR\n", resno, res);
+ if (resno < PCI_BRIDGE_RESOURCES)
+ pci_update_resource(dev, resno);
++ } else {
++ res->flags = flags;
++ dev_info(&dev->dev, "BAR %d: %pR (failed to expand by %#llx)\n",
++ resno, res, (unsigned long long) addsize);
+ }
++
+ return ret;
+ }
+
+diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
+index 31184b35370f..489e83b6b5e1 100644
+--- a/drivers/scsi/hpsa.c
++++ b/drivers/scsi/hpsa.c
+@@ -5092,7 +5092,7 @@ static int hpsa_big_passthru_ioctl(struct ctlr_info *h, void __user *argp)
+ }
+ if (ioc->Request.Type.Direction & XFER_WRITE) {
+ if (copy_from_user(buff[sg_used], data_ptr, sz)) {
+- status = -ENOMEM;
++ status = -EFAULT;
+ goto cleanup1;
+ }
+ } else
+@@ -6365,9 +6365,9 @@ static inline void hpsa_set_driver_support_bits(struct ctlr_info *h)
+ {
+ u32 driver_support;
+
+-#ifdef CONFIG_X86
+- /* Need to enable prefetch in the SCSI core for 6400 in x86 */
+ driver_support = readl(&(h->cfgtable->driver_support));
++ /* Need to enable prefetch in the SCSI core for 6400 in x86 */
++#ifdef CONFIG_X86
+ driver_support |= ENABLE_SCSI_PREFETCH;
+ #endif
+ driver_support |= ENABLE_UNIT_ATTN;
+diff --git a/drivers/staging/et131x/et131x.c b/drivers/staging/et131x/et131x.c
+index 08356b6955a4..2d36eac6889c 100644
+--- a/drivers/staging/et131x/et131x.c
++++ b/drivers/staging/et131x/et131x.c
+@@ -1423,22 +1423,16 @@ static int et131x_mii_read(struct et131x_adapter *adapter, u8 reg, u16 *value)
+ * @reg: the register to read
+ * @value: 16-bit value to write
+ */
+-static int et131x_mii_write(struct et131x_adapter *adapter, u8 reg, u16 value)
++static int et131x_mii_write(struct et131x_adapter *adapter, u8 addr, u8 reg,
++ u16 value)
+ {
+ struct mac_regs __iomem *mac = &adapter->regs->mac;
+- struct phy_device *phydev = adapter->phydev;
+ int status = 0;
+- u8 addr;
+ u32 delay = 0;
+ u32 mii_addr;
+ u32 mii_cmd;
+ u32 mii_indicator;
+
+- if (!phydev)
+- return -EIO;
+-
+- addr = phydev->addr;
+-
+ /* Save a local copy of the registers we are dealing with so we can
+ * set them back
+ */
+@@ -1633,17 +1627,7 @@ static int et131x_mdio_write(struct mii_bus *bus, int phy_addr,
+ struct net_device *netdev = bus->priv;
+ struct et131x_adapter *adapter = netdev_priv(netdev);
+
+- return et131x_mii_write(adapter, reg, value);
+-}
+-
+-static int et131x_mdio_reset(struct mii_bus *bus)
+-{
+- struct net_device *netdev = bus->priv;
+- struct et131x_adapter *adapter = netdev_priv(netdev);
+-
+- et131x_mii_write(adapter, MII_BMCR, BMCR_RESET);
+-
+- return 0;
++ return et131x_mii_write(adapter, phy_addr, reg, value);
+ }
+
+ /* et1310_phy_power_switch - PHY power control
+@@ -1658,18 +1642,20 @@ static int et131x_mdio_reset(struct mii_bus *bus)
+ static void et1310_phy_power_switch(struct et131x_adapter *adapter, bool down)
+ {
+ u16 data;
++ struct phy_device *phydev = adapter->phydev;
+
+ et131x_mii_read(adapter, MII_BMCR, &data);
+ data &= ~BMCR_PDOWN;
+ if (down)
+ data |= BMCR_PDOWN;
+- et131x_mii_write(adapter, MII_BMCR, data);
++ et131x_mii_write(adapter, phydev->addr, MII_BMCR, data);
+ }
+
+ /* et131x_xcvr_init - Init the phy if we are setting it into force mode */
+ static void et131x_xcvr_init(struct et131x_adapter *adapter)
+ {
+ u16 lcr2;
++ struct phy_device *phydev = adapter->phydev;
+
+ /* Set the LED behavior such that LED 1 indicates speed (off =
+ * 10Mbits, blink = 100Mbits, on = 1000Mbits) and LED 2 indicates
+@@ -1690,7 +1676,7 @@ static void et131x_xcvr_init(struct et131x_adapter *adapter)
+ else
+ lcr2 |= (LED_VAL_LINKON << LED_TXRX_SHIFT);
+
+- et131x_mii_write(adapter, PHY_LED_2, lcr2);
++ et131x_mii_write(adapter, phydev->addr, PHY_LED_2, lcr2);
+ }
+ }
+
+@@ -3645,14 +3631,14 @@ static void et131x_adjust_link(struct net_device *netdev)
+
+ et131x_mii_read(adapter, PHY_MPHY_CONTROL_REG,
+ ®ister18);
+- et131x_mii_write(adapter, PHY_MPHY_CONTROL_REG,
+- register18 | 0x4);
+- et131x_mii_write(adapter, PHY_INDEX_REG,
++ et131x_mii_write(adapter, phydev->addr,
++ PHY_MPHY_CONTROL_REG, register18 | 0x4);
++ et131x_mii_write(adapter, phydev->addr, PHY_INDEX_REG,
+ register18 | 0x8402);
+- et131x_mii_write(adapter, PHY_DATA_REG,
++ et131x_mii_write(adapter, phydev->addr, PHY_DATA_REG,
+ register18 | 511);
+- et131x_mii_write(adapter, PHY_MPHY_CONTROL_REG,
+- register18);
++ et131x_mii_write(adapter, phydev->addr,
++ PHY_MPHY_CONTROL_REG, register18);
+ }
+
+ et1310_config_flow_control(adapter);
+@@ -3664,7 +3650,8 @@ static void et131x_adjust_link(struct net_device *netdev)
+ et131x_mii_read(adapter, PHY_CONFIG, ®);
+ reg &= ~ET_PHY_CONFIG_TX_FIFO_DEPTH;
+ reg |= ET_PHY_CONFIG_FIFO_DEPTH_32;
+- et131x_mii_write(adapter, PHY_CONFIG, reg);
++ et131x_mii_write(adapter, phydev->addr, PHY_CONFIG,
++ reg);
+ }
+
+ et131x_set_rx_dma_timer(adapter);
+@@ -3677,14 +3664,14 @@ static void et131x_adjust_link(struct net_device *netdev)
+
+ et131x_mii_read(adapter, PHY_MPHY_CONTROL_REG,
+ ®ister18);
+- et131x_mii_write(adapter, PHY_MPHY_CONTROL_REG,
+- register18 | 0x4);
+- et131x_mii_write(adapter, PHY_INDEX_REG,
+- register18 | 0x8402);
+- et131x_mii_write(adapter, PHY_DATA_REG,
+- register18 | 511);
+- et131x_mii_write(adapter, PHY_MPHY_CONTROL_REG,
+- register18);
++ et131x_mii_write(adapter, phydev->addr,
++ PHY_MPHY_CONTROL_REG, register18 | 0x4);
++ et131x_mii_write(adapter, phydev->addr,
++ PHY_INDEX_REG, register18 | 0x8402);
++ et131x_mii_write(adapter, phydev->addr,
++ PHY_DATA_REG, register18 | 511);
++ et131x_mii_write(adapter, phydev->addr,
++ PHY_MPHY_CONTROL_REG, register18);
+ }
+
+ /* Free the packets being actively sent & stopped */
+@@ -4646,10 +4633,6 @@ static int et131x_pci_setup(struct pci_dev *pdev,
+ /* Copy address into the net_device struct */
+ memcpy(netdev->dev_addr, adapter->addr, ETH_ALEN);
+
+- /* Init variable for counting how long we do not have link status */
+- adapter->boot_coma = 0;
+- et1310_disable_phy_coma(adapter);
+-
+ rc = -ENOMEM;
+
+ /* Setup the mii_bus struct */
+@@ -4665,7 +4648,6 @@ static int et131x_pci_setup(struct pci_dev *pdev,
+ adapter->mii_bus->priv = netdev;
+ adapter->mii_bus->read = et131x_mdio_read;
+ adapter->mii_bus->write = et131x_mdio_write;
+- adapter->mii_bus->reset = et131x_mdio_reset;
+ adapter->mii_bus->irq = kmalloc_array(PHY_MAX_ADDR, sizeof(int),
+ GFP_KERNEL);
+ if (!adapter->mii_bus->irq)
+@@ -4689,6 +4671,10 @@ static int et131x_pci_setup(struct pci_dev *pdev,
+ /* Setup et1310 as per the documentation */
+ et131x_adapter_setup(adapter);
+
++ /* Init variable for counting how long we do not have link status */
++ adapter->boot_coma = 0;
++ et1310_disable_phy_coma(adapter);
++
+ /* We can enable interrupts now
+ *
+ * NOTE - Because registration of interrupt handler is done in the
+diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c b/drivers/staging/lustre/lustre/obdclass/class_obd.c
+index dde04b767a6d..b16687625c44 100644
+--- a/drivers/staging/lustre/lustre/obdclass/class_obd.c
++++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c
+@@ -35,7 +35,7 @@
+ */
+
+ #define DEBUG_SUBSYSTEM S_CLASS
+-# include <asm/atomic.h>
++# include <linux/atomic.h>
+
+ #include <obd_support.h>
+ #include <obd_class.h>
+diff --git a/drivers/staging/rtl8188eu/os_dep/usb_intf.c b/drivers/staging/rtl8188eu/os_dep/usb_intf.c
+index 7526b989dcbf..c4273cd5f7ed 100644
+--- a/drivers/staging/rtl8188eu/os_dep/usb_intf.c
++++ b/drivers/staging/rtl8188eu/os_dep/usb_intf.c
+@@ -54,9 +54,11 @@ static struct usb_device_id rtw_usb_id_tbl[] = {
+ {USB_DEVICE(USB_VENDER_ID_REALTEK, 0x0179)}, /* 8188ETV */
+ /*=== Customer ID ===*/
+ /****** 8188EUS ********/
++ {USB_DEVICE(0x056e, 0x4008)}, /* Elecom WDC-150SU2M */
+ {USB_DEVICE(0x07b8, 0x8179)}, /* Abocom - Abocom */
+ {USB_DEVICE(0x2001, 0x330F)}, /* DLink DWA-125 REV D1 */
+ {USB_DEVICE(0x2001, 0x3310)}, /* Dlink DWA-123 REV D1 */
++ {USB_DEVICE(0x0df6, 0x0076)}, /* Sitecom N150 v2 */
+ {} /* Terminating entry */
+ };
+
+diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
+index fbf6c5ad222f..ef2fb367d179 100644
+--- a/drivers/tty/serial/serial_core.c
++++ b/drivers/tty/serial/serial_core.c
+@@ -243,6 +243,9 @@ static void uart_shutdown(struct tty_struct *tty, struct uart_state *state)
+ /*
+ * Turn off DTR and RTS early.
+ */
++ if (uart_console(uport) && tty)
++ uport->cons->cflag = tty->termios.c_cflag;
++
+ if (!tty || (tty->termios.c_cflag & HUPCL))
+ uart_clear_mctrl(uport, TIOCM_DTR | TIOCM_RTS);
+
+diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
+index 257876ea03a1..0b59731c3021 100644
+--- a/drivers/usb/core/devio.c
++++ b/drivers/usb/core/devio.c
+@@ -1509,7 +1509,7 @@ static int proc_do_submiturb(struct usb_dev_state *ps, struct usbdevfs_urb *uurb
+ u = (is_in ? URB_DIR_IN : URB_DIR_OUT);
+ if (uurb->flags & USBDEVFS_URB_ISO_ASAP)
+ u |= URB_ISO_ASAP;
+- if (uurb->flags & USBDEVFS_URB_SHORT_NOT_OK)
++ if (uurb->flags & USBDEVFS_URB_SHORT_NOT_OK && is_in)
+ u |= URB_SHORT_NOT_OK;
+ if (uurb->flags & USBDEVFS_URB_NO_FSBR)
+ u |= URB_NO_FSBR;
+diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
+index 0e950ad8cb25..27f217107ef1 100644
+--- a/drivers/usb/core/hub.c
++++ b/drivers/usb/core/hub.c
+@@ -1728,8 +1728,14 @@ static int hub_probe(struct usb_interface *intf, const struct usb_device_id *id)
+ * - Change autosuspend delay of hub can avoid unnecessary auto
+ * suspend timer for hub, also may decrease power consumption
+ * of USB bus.
++ *
++ * - If user has indicated to prevent autosuspend by passing
++ * usbcore.autosuspend = -1 then keep autosuspend disabled.
+ */
+- pm_runtime_set_autosuspend_delay(&hdev->dev, 0);
++#ifdef CONFIG_PM_RUNTIME
++ if (hdev->dev.power.autosuspend_delay >= 0)
++ pm_runtime_set_autosuspend_delay(&hdev->dev, 0);
++#endif
+
+ /*
+ * Hubs have proper suspend/resume support, except for root hubs
+@@ -3264,6 +3270,43 @@ static int finish_port_resume(struct usb_device *udev)
+ }
+
+ /*
++ * There are some SS USB devices which take longer time for link training.
++ * XHCI specs 4.19.4 says that when Link training is successful, port
++ * sets CSC bit to 1. So if SW reads port status before successful link
++ * training, then it will not find device to be present.
++ * USB Analyzer log with such buggy devices show that in some cases
++ * device switch on the RX termination after long delay of host enabling
++ * the VBUS. In few other cases it has been seen that device fails to
++ * negotiate link training in first attempt. It has been
++ * reported till now that few devices take as long as 2000 ms to train
++ * the link after host enabling its VBUS and termination. Following
++ * routine implements a 2000 ms timeout for link training. If in a case
++ * link trains before timeout, loop will exit earlier.
++ *
++ * FIXME: If a device was connected before suspend, but was removed
++ * while system was asleep, then the loop in the following routine will
++ * only exit at timeout.
++ *
++ * This routine should only be called when persist is enabled for a SS
++ * device.
++ */
++static int wait_for_ss_port_enable(struct usb_device *udev,
++ struct usb_hub *hub, int *port1,
++ u16 *portchange, u16 *portstatus)
++{
++ int status = 0, delay_ms = 0;
++
++ while (delay_ms < 2000) {
++ if (status || *portstatus & USB_PORT_STAT_CONNECTION)
++ break;
++ msleep(20);
++ delay_ms += 20;
++ status = hub_port_status(hub, *port1, portstatus, portchange);
++ }
++ return status;
++}
++
++/*
+ * usb_port_resume - re-activate a suspended usb device's upstream port
+ * @udev: device to re-activate, not a root hub
+ * Context: must be able to sleep; device not locked; pm locks held
+@@ -3359,6 +3402,10 @@ int usb_port_resume(struct usb_device *udev, pm_message_t msg)
+ }
+ }
+
++ if (udev->persist_enabled && hub_is_superspeed(hub->hdev))
++ status = wait_for_ss_port_enable(udev, hub, &port1, &portchange,
++ &portstatus);
++
+ status = check_port_resume_type(udev,
+ hub, port1, status, portchange, portstatus);
+ if (status == 0)
+@@ -4550,6 +4597,7 @@ static void hub_port_connect(struct usb_hub *hub, int port1, u16 portstatus,
+ struct usb_hcd *hcd = bus_to_hcd(hdev->bus);
+ struct usb_port *port_dev = hub->ports[port1 - 1];
+ struct usb_device *udev = port_dev->child;
++ static int unreliable_port = -1;
+
+ /* Disconnect any existing devices under this port */
+ if (udev) {
+@@ -4570,10 +4618,12 @@ static void hub_port_connect(struct usb_hub *hub, int port1, u16 portstatus,
+ USB_PORT_STAT_C_ENABLE)) {
+ status = hub_port_debounce_be_stable(hub, port1);
+ if (status < 0) {
+- if (status != -ENODEV && printk_ratelimit())
+- dev_err(&port_dev->dev,
+- "connect-debounce failed\n");
++ if (status != -ENODEV &&
++ port1 != unreliable_port &&
++ printk_ratelimit())
++ dev_err(&port_dev->dev, "connect-debounce failed\n");
+ portstatus &= ~USB_PORT_STAT_CONNECTION;
++ unreliable_port = port1;
+ } else {
+ portstatus = status;
+ }
+diff --git a/drivers/usb/host/ehci-hub.c b/drivers/usb/host/ehci-hub.c
+index cc305c71ac3d..6130b7574908 100644
+--- a/drivers/usb/host/ehci-hub.c
++++ b/drivers/usb/host/ehci-hub.c
+@@ -1230,7 +1230,7 @@ int ehci_hub_control(
+ if (selector == EHSET_TEST_SINGLE_STEP_SET_FEATURE) {
+ spin_unlock_irqrestore(&ehci->lock, flags);
+ retval = ehset_single_step_set_feature(hcd,
+- wIndex);
++ wIndex + 1);
+ spin_lock_irqsave(&ehci->lock, flags);
+ break;
+ }
+diff --git a/drivers/usb/host/ehci-pci.c b/drivers/usb/host/ehci-pci.c
+index 3e86bf4371b3..ca7b964124af 100644
+--- a/drivers/usb/host/ehci-pci.c
++++ b/drivers/usb/host/ehci-pci.c
+@@ -35,6 +35,21 @@ static const char hcd_name[] = "ehci-pci";
+ #define PCI_DEVICE_ID_INTEL_CE4100_USB 0x2e70
+
+ /*-------------------------------------------------------------------------*/
++#define PCI_DEVICE_ID_INTEL_QUARK_X1000_SOC 0x0939
++static inline bool is_intel_quark_x1000(struct pci_dev *pdev)
++{
++ return pdev->vendor == PCI_VENDOR_ID_INTEL &&
++ pdev->device == PCI_DEVICE_ID_INTEL_QUARK_X1000_SOC;
++}
++
++/*
++ * 0x84 is the offset of in/out threshold register,
++ * and it is the same offset as the register of 'hostpc'.
++ */
++#define intel_quark_x1000_insnreg01 hostpc
++
++/* Maximum usable threshold value is 0x7f dwords for both IN and OUT */
++#define INTEL_QUARK_X1000_EHCI_MAX_THRESHOLD 0x007f007f
+
+ /* called after powerup, by probe or system-pm "wakeup" */
+ static int ehci_pci_reinit(struct ehci_hcd *ehci, struct pci_dev *pdev)
+@@ -50,6 +65,16 @@ static int ehci_pci_reinit(struct ehci_hcd *ehci, struct pci_dev *pdev)
+ if (!retval)
+ ehci_dbg(ehci, "MWI active\n");
+
++ /* Reset the threshold limit */
++ if (is_intel_quark_x1000(pdev)) {
++ /*
++ * For the Intel QUARK X1000, raise the I/O threshold to the
++ * maximum usable value in order to improve performance.
++ */
++ ehci_writel(ehci, INTEL_QUARK_X1000_EHCI_MAX_THRESHOLD,
++ ehci->regs->intel_quark_x1000_insnreg01);
++ }
++
+ return 0;
+ }
+
+diff --git a/drivers/usb/host/ohci-dbg.c b/drivers/usb/host/ohci-dbg.c
+index 45032e933e18..04f2186939d2 100644
+--- a/drivers/usb/host/ohci-dbg.c
++++ b/drivers/usb/host/ohci-dbg.c
+@@ -236,7 +236,7 @@ ohci_dump_roothub (
+ }
+ }
+
+-static void ohci_dump (struct ohci_hcd *controller, int verbose)
++static void ohci_dump(struct ohci_hcd *controller)
+ {
+ ohci_dbg (controller, "OHCI controller state\n");
+
+@@ -464,15 +464,16 @@ show_list (struct ohci_hcd *ohci, char *buf, size_t count, struct ed *ed)
+ static ssize_t fill_async_buffer(struct debug_buffer *buf)
+ {
+ struct ohci_hcd *ohci;
+- size_t temp;
++ size_t temp, size;
+ unsigned long flags;
+
+ ohci = buf->ohci;
++ size = PAGE_SIZE;
+
+ /* display control and bulk lists together, for simplicity */
+ spin_lock_irqsave (&ohci->lock, flags);
+- temp = show_list(ohci, buf->page, buf->count, ohci->ed_controltail);
+- temp += show_list(ohci, buf->page + temp, buf->count - temp,
++ temp = show_list(ohci, buf->page, size, ohci->ed_controltail);
++ temp += show_list(ohci, buf->page + temp, size - temp,
+ ohci->ed_bulktail);
+ spin_unlock_irqrestore (&ohci->lock, flags);
+
+diff --git a/drivers/usb/host/ohci-hcd.c b/drivers/usb/host/ohci-hcd.c
+index f98d03f3144c..a21a36500fd7 100644
+--- a/drivers/usb/host/ohci-hcd.c
++++ b/drivers/usb/host/ohci-hcd.c
+@@ -76,8 +76,8 @@ static const char hcd_name [] = "ohci_hcd";
+ #include "ohci.h"
+ #include "pci-quirks.h"
+
+-static void ohci_dump (struct ohci_hcd *ohci, int verbose);
+-static void ohci_stop (struct usb_hcd *hcd);
++static void ohci_dump(struct ohci_hcd *ohci);
++static void ohci_stop(struct usb_hcd *hcd);
+
+ #include "ohci-hub.c"
+ #include "ohci-dbg.c"
+@@ -744,7 +744,7 @@ retry:
+ ohci->ed_to_check = NULL;
+ }
+
+- ohci_dump (ohci, 1);
++ ohci_dump(ohci);
+
+ return 0;
+ }
+@@ -825,7 +825,7 @@ static irqreturn_t ohci_irq (struct usb_hcd *hcd)
+ usb_hc_died(hcd);
+ }
+
+- ohci_dump (ohci, 1);
++ ohci_dump(ohci);
+ ohci_usb_reset (ohci);
+ }
+
+@@ -925,7 +925,7 @@ static void ohci_stop (struct usb_hcd *hcd)
+ {
+ struct ohci_hcd *ohci = hcd_to_ohci (hcd);
+
+- ohci_dump (ohci, 1);
++ ohci_dump(ohci);
+
+ if (quirk_nec(ohci))
+ flush_work(&ohci->nec_work);
+diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
+index d4253e319428..a8bde5b8cbdd 100644
+--- a/drivers/usb/host/ohci-q.c
++++ b/drivers/usb/host/ohci-q.c
+@@ -311,8 +311,7 @@ static void periodic_unlink (struct ohci_hcd *ohci, struct ed *ed)
+ * - ED_OPER: when there's any request queued, the ED gets rescheduled
+ * immediately. HC should be working on them.
+ *
+- * - ED_IDLE: when there's no TD queue. there's no reason for the HC
+- * to care about this ED; safe to disable the endpoint.
++ * - ED_IDLE: when there's no TD queue or the HC isn't running.
+ *
+ * When finish_unlinks() runs later, after SOF interrupt, it will often
+ * complete one or more URB unlinks before making that state change.
+@@ -926,6 +925,10 @@ rescan_all:
+ int completed, modified;
+ __hc32 *prev;
+
++ /* Is this ED already invisible to the hardware? */
++ if (ed->state == ED_IDLE)
++ goto ed_idle;
++
+ /* only take off EDs that the HC isn't using, accounting for
+ * frame counter wraps and EDs with partially retired TDs
+ */
+@@ -955,12 +958,20 @@ skip_ed:
+ }
+ }
+
++ /* ED's now officially unlinked, hc doesn't see */
++ ed->state = ED_IDLE;
++ if (quirk_zfmicro(ohci) && ed->type == PIPE_INTERRUPT)
++ ohci->eds_scheduled--;
++ ed->hwHeadP &= ~cpu_to_hc32(ohci, ED_H);
++ ed->hwNextED = 0;
++ wmb();
++ ed->hwINFO &= ~cpu_to_hc32(ohci, ED_SKIP | ED_DEQUEUE);
++ed_idle:
++
+ /* reentrancy: if we drop the schedule lock, someone might
+ * have modified this list. normally it's just prepending
+ * entries (which we'd ignore), but paranoia won't hurt.
+ */
+- *last = ed->ed_next;
+- ed->ed_next = NULL;
+ modified = 0;
+
+ /* unlink urbs as requested, but rescan the list after
+@@ -1018,19 +1029,20 @@ rescan_this:
+ if (completed && !list_empty (&ed->td_list))
+ goto rescan_this;
+
+- /* ED's now officially unlinked, hc doesn't see */
+- ed->state = ED_IDLE;
+- if (quirk_zfmicro(ohci) && ed->type == PIPE_INTERRUPT)
+- ohci->eds_scheduled--;
+- ed->hwHeadP &= ~cpu_to_hc32(ohci, ED_H);
+- ed->hwNextED = 0;
+- wmb ();
+- ed->hwINFO &= ~cpu_to_hc32 (ohci, ED_SKIP | ED_DEQUEUE);
+-
+- /* but if there's work queued, reschedule */
+- if (!list_empty (&ed->td_list)) {
+- if (ohci->rh_state == OHCI_RH_RUNNING)
+- ed_schedule (ohci, ed);
++ /*
++ * If no TDs are queued, take ED off the ed_rm_list.
++ * Otherwise, if the HC is running, reschedule.
++ * If not, leave it on the list for further dequeues.
++ */
++ if (list_empty(&ed->td_list)) {
++ *last = ed->ed_next;
++ ed->ed_next = NULL;
++ } else if (ohci->rh_state == OHCI_RH_RUNNING) {
++ *last = ed->ed_next;
++ ed->ed_next = NULL;
++ ed_schedule(ohci, ed);
++ } else {
++ last = &ed->ed_next;
+ }
+
+ if (modified)
+diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
+index e20520f42753..994a36e582ca 100644
+--- a/drivers/usb/host/xhci-pci.c
++++ b/drivers/usb/host/xhci-pci.c
+@@ -101,6 +101,10 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
+ /* AMD PLL quirk */
+ if (pdev->vendor == PCI_VENDOR_ID_AMD && usb_amd_find_chipset_info())
+ xhci->quirks |= XHCI_AMD_PLL_FIX;
++
++ if (pdev->vendor == PCI_VENDOR_ID_AMD)
++ xhci->quirks |= XHCI_TRUST_TX_LENGTH;
++
+ if (pdev->vendor == PCI_VENDOR_ID_INTEL) {
+ xhci->quirks |= XHCI_LPM_SUPPORT;
+ xhci->quirks |= XHCI_INTEL_HOST;
+@@ -143,6 +147,7 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
+ pdev->device == PCI_DEVICE_ID_ASROCK_P67) {
+ xhci->quirks |= XHCI_RESET_ON_RESUME;
+ xhci->quirks |= XHCI_TRUST_TX_LENGTH;
++ xhci->quirks |= XHCI_BROKEN_STREAMS;
+ }
+ if (pdev->vendor == PCI_VENDOR_ID_RENESAS &&
+ pdev->device == 0x0015)
+@@ -150,6 +155,11 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
+ if (pdev->vendor == PCI_VENDOR_ID_VIA)
+ xhci->quirks |= XHCI_RESET_ON_RESUME;
+
++ /* See https://bugzilla.kernel.org/show_bug.cgi?id=79511 */
++ if (pdev->vendor == PCI_VENDOR_ID_VIA &&
++ pdev->device == 0x3432)
++ xhci->quirks |= XHCI_BROKEN_STREAMS;
++
+ if (xhci->quirks & XHCI_RESET_ON_RESUME)
+ xhci_dbg_trace(xhci, trace_xhci_dbg_quirks,
+ "QUIRK: Resetting on resume");
+@@ -230,7 +240,8 @@ static int xhci_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
+ goto put_usb3_hcd;
+ /* Roothub already marked as USB 3.0 speed */
+
+- if (HCC_MAX_PSA(xhci->hcc_params) >= 4)
++ if (!(xhci->quirks & XHCI_BROKEN_STREAMS) &&
++ HCC_MAX_PSA(xhci->hcc_params) >= 4)
+ xhci->shared_hcd->can_do_streams = 1;
+
+ /* USB-2 and USB-3 roothubs initialized, allow runtime pm suspend */
+diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
+index 749fc68eb5c1..28a929d45cfe 100644
+--- a/drivers/usb/host/xhci-ring.c
++++ b/drivers/usb/host/xhci-ring.c
+@@ -364,32 +364,6 @@ static void ring_doorbell_for_active_rings(struct xhci_hcd *xhci,
+ }
+ }
+
+-/*
+- * Find the segment that trb is in. Start searching in start_seg.
+- * If we must move past a segment that has a link TRB with a toggle cycle state
+- * bit set, then we will toggle the value pointed at by cycle_state.
+- */
+-static struct xhci_segment *find_trb_seg(
+- struct xhci_segment *start_seg,
+- union xhci_trb *trb, int *cycle_state)
+-{
+- struct xhci_segment *cur_seg = start_seg;
+- struct xhci_generic_trb *generic_trb;
+-
+- while (cur_seg->trbs > trb ||
+- &cur_seg->trbs[TRBS_PER_SEGMENT - 1] < trb) {
+- generic_trb = &cur_seg->trbs[TRBS_PER_SEGMENT - 1].generic;
+- if (generic_trb->field[3] & cpu_to_le32(LINK_TOGGLE))
+- *cycle_state ^= 0x1;
+- cur_seg = cur_seg->next;
+- if (cur_seg == start_seg)
+- /* Looped over the entire list. Oops! */
+- return NULL;
+- }
+- return cur_seg;
+-}
+-
+-
+ static struct xhci_ring *xhci_triad_to_transfer_ring(struct xhci_hcd *xhci,
+ unsigned int slot_id, unsigned int ep_index,
+ unsigned int stream_id)
+@@ -459,9 +433,12 @@ void xhci_find_new_dequeue_state(struct xhci_hcd *xhci,
+ struct xhci_virt_device *dev = xhci->devs[slot_id];
+ struct xhci_virt_ep *ep = &dev->eps[ep_index];
+ struct xhci_ring *ep_ring;
+- struct xhci_generic_trb *trb;
++ struct xhci_segment *new_seg;
++ union xhci_trb *new_deq;
+ dma_addr_t addr;
+ u64 hw_dequeue;
++ bool cycle_found = false;
++ bool td_last_trb_found = false;
+
+ ep_ring = xhci_triad_to_transfer_ring(xhci, slot_id,
+ ep_index, stream_id);
+@@ -486,45 +463,45 @@ void xhci_find_new_dequeue_state(struct xhci_hcd *xhci,
+ hw_dequeue = le64_to_cpu(ep_ctx->deq);
+ }
+
+- /* Find virtual address and segment of hardware dequeue pointer */
+- state->new_deq_seg = ep_ring->deq_seg;
+- state->new_deq_ptr = ep_ring->dequeue;
+- while (xhci_trb_virt_to_dma(state->new_deq_seg, state->new_deq_ptr)
+- != (dma_addr_t)(hw_dequeue & ~0xf)) {
+- next_trb(xhci, ep_ring, &state->new_deq_seg,
+- &state->new_deq_ptr);
+- if (state->new_deq_ptr == ep_ring->dequeue) {
+- WARN_ON(1);
+- return;
+- }
+- }
++ new_seg = ep_ring->deq_seg;
++ new_deq = ep_ring->dequeue;
++ state->new_cycle_state = hw_dequeue & 0x1;
++
+ /*
+- * Find cycle state for last_trb, starting at old cycle state of
+- * hw_dequeue. If there is only one segment ring, find_trb_seg() will
+- * return immediately and cannot toggle the cycle state if this search
+- * wraps around, so add one more toggle manually in that case.
++ * We want to find the pointer, segment and cycle state of the new trb
++ * (the one after current TD's last_trb). We know the cycle state at
++ * hw_dequeue, so walk the ring until both hw_dequeue and last_trb are
++ * found.
+ */
+- state->new_cycle_state = hw_dequeue & 0x1;
+- if (ep_ring->first_seg == ep_ring->first_seg->next &&
+- cur_td->last_trb < state->new_deq_ptr)
+- state->new_cycle_state ^= 0x1;
++ do {
++ if (!cycle_found && xhci_trb_virt_to_dma(new_seg, new_deq)
++ == (dma_addr_t)(hw_dequeue & ~0xf)) {
++ cycle_found = true;
++ if (td_last_trb_found)
++ break;
++ }
++ if (new_deq == cur_td->last_trb)
++ td_last_trb_found = true;
+
+- state->new_deq_ptr = cur_td->last_trb;
+- xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb,
+- "Finding segment containing last TRB in TD.");
+- state->new_deq_seg = find_trb_seg(state->new_deq_seg,
+- state->new_deq_ptr, &state->new_cycle_state);
+- if (!state->new_deq_seg) {
+- WARN_ON(1);
+- return;
+- }
++ if (cycle_found &&
++ TRB_TYPE_LINK_LE32(new_deq->generic.field[3]) &&
++ new_deq->generic.field[3] & cpu_to_le32(LINK_TOGGLE))
++ state->new_cycle_state ^= 0x1;
++
++ next_trb(xhci, ep_ring, &new_seg, &new_deq);
++
++ /* Search wrapped around, bail out */
++ if (new_deq == ep->ring->dequeue) {
++ xhci_err(xhci, "Error: Failed finding new dequeue state\n");
++ state->new_deq_seg = NULL;
++ state->new_deq_ptr = NULL;
++ return;
++ }
++
++ } while (!cycle_found || !td_last_trb_found);
+
+- /* Increment to find next TRB after last_trb. Cycle if appropriate. */
+- trb = &state->new_deq_ptr->generic;
+- if (TRB_TYPE_LINK_LE32(trb->field[3]) &&
+- (trb->field[3] & cpu_to_le32(LINK_TOGGLE)))
+- state->new_cycle_state ^= 0x1;
+- next_trb(xhci, ep_ring, &state->new_deq_seg, &state->new_deq_ptr);
++ state->new_deq_seg = new_seg;
++ state->new_deq_ptr = new_deq;
+
+ /* Don't update the ring cycle state for the producer (us). */
+ xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb,
+@@ -2483,7 +2460,8 @@ static int handle_tx_event(struct xhci_hcd *xhci,
+ * last TRB of the previous TD. The command completion handle
+ * will take care the rest.
+ */
+- if (!event_seg && trb_comp_code == COMP_STOP_INVAL) {
++ if (!event_seg && (trb_comp_code == COMP_STOP ||
++ trb_comp_code == COMP_STOP_INVAL)) {
+ ret = 0;
+ goto cleanup;
+ }
+diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
+index 7436d5f5e67a..e32cc6cf86dc 100644
+--- a/drivers/usb/host/xhci.c
++++ b/drivers/usb/host/xhci.c
+@@ -2891,6 +2891,9 @@ void xhci_cleanup_stalled_ring(struct xhci_hcd *xhci,
+ ep_index, ep->stopped_stream, ep->stopped_td,
+ &deq_state);
+
++ if (!deq_state.new_deq_ptr || !deq_state.new_deq_seg)
++ return;
++
+ /* HW with the reset endpoint quirk will use the saved dequeue state to
+ * issue a configure endpoint command later.
+ */
+@@ -3163,7 +3166,8 @@ int xhci_alloc_streams(struct usb_hcd *hcd, struct usb_device *udev,
+ num_streams);
+
+ /* MaxPSASize value 0 (2 streams) means streams are not supported */
+- if (HCC_MAX_PSA(xhci->hcc_params) < 4) {
++ if ((xhci->quirks & XHCI_BROKEN_STREAMS) ||
++ HCC_MAX_PSA(xhci->hcc_params) < 4) {
+ xhci_dbg(xhci, "xHCI controller does not support streams.\n");
+ return -ENOSYS;
+ }
+diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
+index 9ffecd56600d..dace5152e179 100644
+--- a/drivers/usb/host/xhci.h
++++ b/drivers/usb/host/xhci.h
+@@ -1558,6 +1558,8 @@ struct xhci_hcd {
+ #define XHCI_PLAT (1 << 16)
+ #define XHCI_SLOW_SUSPEND (1 << 17)
+ #define XHCI_SPURIOUS_WAKEUP (1 << 18)
++/* For controllers with a broken beyond repair streams implementation */
++#define XHCI_BROKEN_STREAMS (1 << 19)
+ unsigned int num_active_eps;
+ unsigned int limit_active_eps;
+ /* There are two roothubs to keep track of bus suspend info for */
+diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
+index 8a3813be1b28..8b0f517abb6b 100644
+--- a/drivers/usb/serial/ftdi_sio.c
++++ b/drivers/usb/serial/ftdi_sio.c
+@@ -151,6 +151,7 @@ static const struct usb_device_id id_table_combined[] = {
+ { USB_DEVICE(FTDI_VID, FTDI_AMC232_PID) },
+ { USB_DEVICE(FTDI_VID, FTDI_CANUSB_PID) },
+ { USB_DEVICE(FTDI_VID, FTDI_CANDAPTER_PID) },
++ { USB_DEVICE(FTDI_VID, FTDI_BM_ATOM_NANO_PID) },
+ { USB_DEVICE(FTDI_VID, FTDI_NXTCAM_PID) },
+ { USB_DEVICE(FTDI_VID, FTDI_EV3CON_PID) },
+ { USB_DEVICE(FTDI_VID, FTDI_SCS_DEVICE_0_PID) },
+@@ -673,6 +674,8 @@ static const struct usb_device_id id_table_combined[] = {
+ { USB_DEVICE(FTDI_VID, XSENS_CONVERTER_5_PID) },
+ { USB_DEVICE(FTDI_VID, XSENS_CONVERTER_6_PID) },
+ { USB_DEVICE(FTDI_VID, XSENS_CONVERTER_7_PID) },
++ { USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) },
++ { USB_DEVICE(XSENS_VID, XSENS_MTW_PID) },
+ { USB_DEVICE(FTDI_VID, FTDI_OMNI1509) },
+ { USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) },
+ { USB_DEVICE(FTDI_VID, FTDI_ACTIVE_ROBOTS_PID) },
+@@ -945,6 +948,8 @@ static const struct usb_device_id id_table_combined[] = {
+ { USB_DEVICE(BRAINBOXES_VID, BRAINBOXES_US_842_2_PID) },
+ { USB_DEVICE(BRAINBOXES_VID, BRAINBOXES_US_842_3_PID) },
+ { USB_DEVICE(BRAINBOXES_VID, BRAINBOXES_US_842_4_PID) },
++ /* ekey Devices */
++ { USB_DEVICE(FTDI_VID, FTDI_EKEY_CONV_USB_PID) },
+ /* Infineon Devices */
+ { USB_DEVICE_INTERFACE_NUMBER(INFINEON_VID, INFINEON_TRIBOARD_PID, 1) },
+ { } /* Terminating entry */
+diff --git a/drivers/usb/serial/ftdi_sio_ids.h b/drivers/usb/serial/ftdi_sio_ids.h
+index c4777bc6aee0..70b0b1d88ae9 100644
+--- a/drivers/usb/serial/ftdi_sio_ids.h
++++ b/drivers/usb/serial/ftdi_sio_ids.h
+@@ -42,6 +42,8 @@
+ /* www.candapter.com Ewert Energy Systems CANdapter device */
+ #define FTDI_CANDAPTER_PID 0x9F80 /* Product Id */
+
++#define FTDI_BM_ATOM_NANO_PID 0xa559 /* Basic Micro ATOM Nano USB2Serial */
++
+ /*
+ * Texas Instruments XDS100v2 JTAG / BeagleBone A3
+ * http://processors.wiki.ti.com/index.php/XDS100
+@@ -140,12 +142,15 @@
+ /*
+ * Xsens Technologies BV products (http://www.xsens.com).
+ */
+-#define XSENS_CONVERTER_0_PID 0xD388
+-#define XSENS_CONVERTER_1_PID 0xD389
++#define XSENS_VID 0x2639
++#define XSENS_CONVERTER_PID 0xD00D /* Xsens USB-serial converter */
++#define XSENS_MTW_PID 0x0200 /* Xsens MTw */
++#define XSENS_CONVERTER_0_PID 0xD388 /* Xsens USB converter */
++#define XSENS_CONVERTER_1_PID 0xD389 /* Xsens Wireless Receiver */
+ #define XSENS_CONVERTER_2_PID 0xD38A
+-#define XSENS_CONVERTER_3_PID 0xD38B
+-#define XSENS_CONVERTER_4_PID 0xD38C
+-#define XSENS_CONVERTER_5_PID 0xD38D
++#define XSENS_CONVERTER_3_PID 0xD38B /* Xsens USB-serial converter */
++#define XSENS_CONVERTER_4_PID 0xD38C /* Xsens Wireless Receiver */
++#define XSENS_CONVERTER_5_PID 0xD38D /* Xsens Awinda Station */
+ #define XSENS_CONVERTER_6_PID 0xD38E
+ #define XSENS_CONVERTER_7_PID 0xD38F
+
+@@ -1375,3 +1380,8 @@
+ #define BRAINBOXES_US_160_6_PID 0x9006 /* US-160 16xRS232 1Mbaud Port 11 and 12 */
+ #define BRAINBOXES_US_160_7_PID 0x9007 /* US-160 16xRS232 1Mbaud Port 13 and 14 */
+ #define BRAINBOXES_US_160_8_PID 0x9008 /* US-160 16xRS232 1Mbaud Port 15 and 16 */
++
++/*
++ * ekey biometric systems GmbH (http://ekey.net/)
++ */
++#define FTDI_EKEY_CONV_USB_PID 0xCB08 /* Converter USB */
+diff --git a/drivers/usb/serial/whiteheat.c b/drivers/usb/serial/whiteheat.c
+index e62f2dff8b7d..6c3734d2b45a 100644
+--- a/drivers/usb/serial/whiteheat.c
++++ b/drivers/usb/serial/whiteheat.c
+@@ -514,6 +514,10 @@ static void command_port_read_callback(struct urb *urb)
+ dev_dbg(&urb->dev->dev, "%s - command_info is NULL, exiting.\n", __func__);
+ return;
+ }
++ if (!urb->actual_length) {
++ dev_dbg(&urb->dev->dev, "%s - empty response, exiting.\n", __func__);
++ return;
++ }
+ if (status) {
+ dev_dbg(&urb->dev->dev, "%s - nonzero urb status: %d\n", __func__, status);
+ if (status != -ENOENT)
+@@ -534,7 +538,8 @@ static void command_port_read_callback(struct urb *urb)
+ /* These are unsolicited reports from the firmware, hence no
+ waiting command to wakeup */
+ dev_dbg(&urb->dev->dev, "%s - event received\n", __func__);
+- } else if (data[0] == WHITEHEAT_GET_DTR_RTS) {
++ } else if ((data[0] == WHITEHEAT_GET_DTR_RTS) &&
++ (urb->actual_length - 1 <= sizeof(command_info->result_buffer))) {
+ memcpy(command_info->result_buffer, &data[1],
+ urb->actual_length - 1);
+ command_info->command_finished = WHITEHEAT_CMD_COMPLETE;
+diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
+index 511b22953167..3f42785f653c 100644
+--- a/drivers/usb/storage/uas.c
++++ b/drivers/usb/storage/uas.c
+@@ -1026,7 +1026,7 @@ static int uas_configure_endpoints(struct uas_dev_info *devinfo)
+ usb_endpoint_num(&eps[3]->desc));
+
+ if (udev->speed != USB_SPEED_SUPER) {
+- devinfo->qdepth = 256;
++ devinfo->qdepth = 32;
+ devinfo->use_streams = 0;
+ } else {
+ devinfo->qdepth = usb_alloc_streams(devinfo->intf, eps + 1,
+diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
+index 84b4bfb84344..500713882ad5 100644
+--- a/drivers/xen/events/events_fifo.c
++++ b/drivers/xen/events/events_fifo.c
+@@ -67,10 +67,9 @@ static event_word_t *event_array[MAX_EVENT_ARRAY_PAGES] __read_mostly;
+ static unsigned event_array_pages __read_mostly;
+
+ /*
+- * sync_set_bit() and friends must be unsigned long aligned on non-x86
+- * platforms.
++ * sync_set_bit() and friends must be unsigned long aligned.
+ */
+-#if !defined(CONFIG_X86) && BITS_PER_LONG > 32
++#if BITS_PER_LONG > 32
+
+ #define BM(w) (unsigned long *)((unsigned long)w & ~0x7UL)
+ #define EVTCHN_FIFO_BIT(b, w) \
+diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
+index 5a201d81049c..fbd76ded9a34 100644
+--- a/fs/btrfs/async-thread.c
++++ b/fs/btrfs/async-thread.c
+@@ -22,7 +22,6 @@
+ #include <linux/list.h>
+ #include <linux/spinlock.h>
+ #include <linux/freezer.h>
+-#include <linux/workqueue.h>
+ #include "async-thread.h"
+ #include "ctree.h"
+
+@@ -55,8 +54,39 @@ struct btrfs_workqueue {
+ struct __btrfs_workqueue *high;
+ };
+
+-static inline struct __btrfs_workqueue
+-*__btrfs_alloc_workqueue(const char *name, int flags, int max_active,
++static void normal_work_helper(struct btrfs_work *work);
++
++#define BTRFS_WORK_HELPER(name) \
++void btrfs_##name(struct work_struct *arg) \
++{ \
++ struct btrfs_work *work = container_of(arg, struct btrfs_work, \
++ normal_work); \
++ normal_work_helper(work); \
++}
++
++BTRFS_WORK_HELPER(worker_helper);
++BTRFS_WORK_HELPER(delalloc_helper);
++BTRFS_WORK_HELPER(flush_delalloc_helper);
++BTRFS_WORK_HELPER(cache_helper);
++BTRFS_WORK_HELPER(submit_helper);
++BTRFS_WORK_HELPER(fixup_helper);
++BTRFS_WORK_HELPER(endio_helper);
++BTRFS_WORK_HELPER(endio_meta_helper);
++BTRFS_WORK_HELPER(endio_meta_write_helper);
++BTRFS_WORK_HELPER(endio_raid56_helper);
++BTRFS_WORK_HELPER(rmw_helper);
++BTRFS_WORK_HELPER(endio_write_helper);
++BTRFS_WORK_HELPER(freespace_write_helper);
++BTRFS_WORK_HELPER(delayed_meta_helper);
++BTRFS_WORK_HELPER(readahead_helper);
++BTRFS_WORK_HELPER(qgroup_rescan_helper);
++BTRFS_WORK_HELPER(extent_refs_helper);
++BTRFS_WORK_HELPER(scrub_helper);
++BTRFS_WORK_HELPER(scrubwrc_helper);
++BTRFS_WORK_HELPER(scrubnc_helper);
++
++static struct __btrfs_workqueue *
++__btrfs_alloc_workqueue(const char *name, int flags, int max_active,
+ int thresh)
+ {
+ struct __btrfs_workqueue *ret = kzalloc(sizeof(*ret), GFP_NOFS);
+@@ -232,13 +262,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
+ spin_unlock_irqrestore(lock, flags);
+ }
+
+-static void normal_work_helper(struct work_struct *arg)
++static void normal_work_helper(struct btrfs_work *work)
+ {
+- struct btrfs_work *work;
+ struct __btrfs_workqueue *wq;
+ int need_order = 0;
+
+- work = container_of(arg, struct btrfs_work, normal_work);
+ /*
+ * We should not touch things inside work in the following cases:
+ * 1) after work->func() if it has no ordered_free
+@@ -262,7 +290,7 @@ static void normal_work_helper(struct work_struct *arg)
+ trace_btrfs_all_work_done(work);
+ }
+
+-void btrfs_init_work(struct btrfs_work *work,
++void btrfs_init_work(struct btrfs_work *work, btrfs_work_func_t uniq_func,
+ btrfs_func_t func,
+ btrfs_func_t ordered_func,
+ btrfs_func_t ordered_free)
+@@ -270,7 +298,7 @@ void btrfs_init_work(struct btrfs_work *work,
+ work->func = func;
+ work->ordered_func = ordered_func;
+ work->ordered_free = ordered_free;
+- INIT_WORK(&work->normal_work, normal_work_helper);
++ INIT_WORK(&work->normal_work, uniq_func);
+ INIT_LIST_HEAD(&work->ordered_list);
+ work->flags = 0;
+ }
+diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
+index 9c6b66d15fb0..e9e31c94758f 100644
+--- a/fs/btrfs/async-thread.h
++++ b/fs/btrfs/async-thread.h
+@@ -19,12 +19,14 @@
+
+ #ifndef __BTRFS_ASYNC_THREAD_
+ #define __BTRFS_ASYNC_THREAD_
++#include <linux/workqueue.h>
+
+ struct btrfs_workqueue;
+ /* Internal use only */
+ struct __btrfs_workqueue;
+ struct btrfs_work;
+ typedef void (*btrfs_func_t)(struct btrfs_work *arg);
++typedef void (*btrfs_work_func_t)(struct work_struct *arg);
+
+ struct btrfs_work {
+ btrfs_func_t func;
+@@ -38,11 +40,35 @@ struct btrfs_work {
+ unsigned long flags;
+ };
+
++#define BTRFS_WORK_HELPER_PROTO(name) \
++void btrfs_##name(struct work_struct *arg)
++
++BTRFS_WORK_HELPER_PROTO(worker_helper);
++BTRFS_WORK_HELPER_PROTO(delalloc_helper);
++BTRFS_WORK_HELPER_PROTO(flush_delalloc_helper);
++BTRFS_WORK_HELPER_PROTO(cache_helper);
++BTRFS_WORK_HELPER_PROTO(submit_helper);
++BTRFS_WORK_HELPER_PROTO(fixup_helper);
++BTRFS_WORK_HELPER_PROTO(endio_helper);
++BTRFS_WORK_HELPER_PROTO(endio_meta_helper);
++BTRFS_WORK_HELPER_PROTO(endio_meta_write_helper);
++BTRFS_WORK_HELPER_PROTO(endio_raid56_helper);
++BTRFS_WORK_HELPER_PROTO(rmw_helper);
++BTRFS_WORK_HELPER_PROTO(endio_write_helper);
++BTRFS_WORK_HELPER_PROTO(freespace_write_helper);
++BTRFS_WORK_HELPER_PROTO(delayed_meta_helper);
++BTRFS_WORK_HELPER_PROTO(readahead_helper);
++BTRFS_WORK_HELPER_PROTO(qgroup_rescan_helper);
++BTRFS_WORK_HELPER_PROTO(extent_refs_helper);
++BTRFS_WORK_HELPER_PROTO(scrub_helper);
++BTRFS_WORK_HELPER_PROTO(scrubwrc_helper);
++BTRFS_WORK_HELPER_PROTO(scrubnc_helper);
++
+ struct btrfs_workqueue *btrfs_alloc_workqueue(const char *name,
+ int flags,
+ int max_active,
+ int thresh);
+-void btrfs_init_work(struct btrfs_work *work,
++void btrfs_init_work(struct btrfs_work *work, btrfs_work_func_t helper,
+ btrfs_func_t func,
+ btrfs_func_t ordered_func,
+ btrfs_func_t ordered_free);
+diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
+index e25564bfcb46..54a201dac7f9 100644
+--- a/fs/btrfs/backref.c
++++ b/fs/btrfs/backref.c
+@@ -276,9 +276,8 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
+ }
+ if (ret > 0)
+ goto next;
+- ret = ulist_add_merge(parents, eb->start,
+- (uintptr_t)eie,
+- (u64 *)&old, GFP_NOFS);
++ ret = ulist_add_merge_ptr(parents, eb->start,
++ eie, (void **)&old, GFP_NOFS);
+ if (ret < 0)
+ break;
+ if (!ret && extent_item_pos) {
+@@ -1001,16 +1000,19 @@ again:
+ ret = -EIO;
+ goto out;
+ }
++ btrfs_tree_read_lock(eb);
++ btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+ ret = find_extent_in_eb(eb, bytenr,
+ *extent_item_pos, &eie);
++ btrfs_tree_read_unlock_blocking(eb);
+ free_extent_buffer(eb);
+ if (ret < 0)
+ goto out;
+ ref->inode_list = eie;
+ }
+- ret = ulist_add_merge(refs, ref->parent,
+- (uintptr_t)ref->inode_list,
+- (u64 *)&eie, GFP_NOFS);
++ ret = ulist_add_merge_ptr(refs, ref->parent,
++ ref->inode_list,
++ (void **)&eie, GFP_NOFS);
+ if (ret < 0)
+ goto out;
+ if (!ret && extent_item_pos) {
+diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
+index 4794923c410c..43527fd78825 100644
+--- a/fs/btrfs/btrfs_inode.h
++++ b/fs/btrfs/btrfs_inode.h
+@@ -84,12 +84,6 @@ struct btrfs_inode {
+ */
+ struct list_head delalloc_inodes;
+
+- /*
+- * list for tracking inodes that must be sent to disk before a
+- * rename or truncate commit
+- */
+- struct list_head ordered_operations;
+-
+ /* node for the red-black tree that links inodes in subvolume root */
+ struct rb_node rb_node;
+
+diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
+index da775bfdebc9..a2e90f855d7d 100644
+--- a/fs/btrfs/delayed-inode.c
++++ b/fs/btrfs/delayed-inode.c
+@@ -1395,8 +1395,8 @@ static int btrfs_wq_run_delayed_node(struct btrfs_delayed_root *delayed_root,
+ return -ENOMEM;
+
+ async_work->delayed_root = delayed_root;
+- btrfs_init_work(&async_work->work, btrfs_async_run_delayed_root,
+- NULL, NULL);
++ btrfs_init_work(&async_work->work, btrfs_delayed_meta_helper,
++ btrfs_async_run_delayed_root, NULL, NULL);
+ async_work->nr = nr;
+
+ btrfs_queue_work(root->fs_info->delayed_workers, &async_work->work);
+diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
+index 08e65e9cf2aa..0229c3720b30 100644
+--- a/fs/btrfs/disk-io.c
++++ b/fs/btrfs/disk-io.c
+@@ -39,7 +39,6 @@
+ #include "btrfs_inode.h"
+ #include "volumes.h"
+ #include "print-tree.h"
+-#include "async-thread.h"
+ #include "locking.h"
+ #include "tree-log.h"
+ #include "free-space-cache.h"
+@@ -60,8 +59,6 @@ static void end_workqueue_fn(struct btrfs_work *work);
+ static void free_fs_root(struct btrfs_root *root);
+ static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+ int read_only);
+-static void btrfs_destroy_ordered_operations(struct btrfs_transaction *t,
+- struct btrfs_root *root);
+ static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
+ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
+ struct btrfs_root *root);
+@@ -695,35 +692,41 @@ static void end_workqueue_bio(struct bio *bio, int err)
+ {
+ struct end_io_wq *end_io_wq = bio->bi_private;
+ struct btrfs_fs_info *fs_info;
++ struct btrfs_workqueue *wq;
++ btrfs_work_func_t func;
+
+ fs_info = end_io_wq->info;
+ end_io_wq->error = err;
+- btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
+
+ if (bio->bi_rw & REQ_WRITE) {
+- if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA)
+- btrfs_queue_work(fs_info->endio_meta_write_workers,
+- &end_io_wq->work);
+- else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_FREE_SPACE)
+- btrfs_queue_work(fs_info->endio_freespace_worker,
+- &end_io_wq->work);
+- else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
+- btrfs_queue_work(fs_info->endio_raid56_workers,
+- &end_io_wq->work);
+- else
+- btrfs_queue_work(fs_info->endio_write_workers,
+- &end_io_wq->work);
++ if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA) {
++ wq = fs_info->endio_meta_write_workers;
++ func = btrfs_endio_meta_write_helper;
++ } else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_FREE_SPACE) {
++ wq = fs_info->endio_freespace_worker;
++ func = btrfs_freespace_write_helper;
++ } else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56) {
++ wq = fs_info->endio_raid56_workers;
++ func = btrfs_endio_raid56_helper;
++ } else {
++ wq = fs_info->endio_write_workers;
++ func = btrfs_endio_write_helper;
++ }
+ } else {
+- if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
+- btrfs_queue_work(fs_info->endio_raid56_workers,
+- &end_io_wq->work);
+- else if (end_io_wq->metadata)
+- btrfs_queue_work(fs_info->endio_meta_workers,
+- &end_io_wq->work);
+- else
+- btrfs_queue_work(fs_info->endio_workers,
+- &end_io_wq->work);
++ if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56) {
++ wq = fs_info->endio_raid56_workers;
++ func = btrfs_endio_raid56_helper;
++ } else if (end_io_wq->metadata) {
++ wq = fs_info->endio_meta_workers;
++ func = btrfs_endio_meta_helper;
++ } else {
++ wq = fs_info->endio_workers;
++ func = btrfs_endio_helper;
++ }
+ }
++
++ btrfs_init_work(&end_io_wq->work, func, end_workqueue_fn, NULL, NULL);
++ btrfs_queue_work(wq, &end_io_wq->work);
+ }
+
+ /*
+@@ -830,7 +833,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct inode *inode,
+ async->submit_bio_start = submit_bio_start;
+ async->submit_bio_done = submit_bio_done;
+
+- btrfs_init_work(&async->work, run_one_async_start,
++ btrfs_init_work(&async->work, btrfs_worker_helper, run_one_async_start,
+ run_one_async_done, run_one_async_free);
+
+ async->bio_flags = bio_flags;
+@@ -3829,34 +3832,6 @@ static void btrfs_error_commit_super(struct btrfs_root *root)
+ btrfs_cleanup_transaction(root);
+ }
+
+-static void btrfs_destroy_ordered_operations(struct btrfs_transaction *t,
+- struct btrfs_root *root)
+-{
+- struct btrfs_inode *btrfs_inode;
+- struct list_head splice;
+-
+- INIT_LIST_HEAD(&splice);
+-
+- mutex_lock(&root->fs_info->ordered_operations_mutex);
+- spin_lock(&root->fs_info->ordered_root_lock);
+-
+- list_splice_init(&t->ordered_operations, &splice);
+- while (!list_empty(&splice)) {
+- btrfs_inode = list_entry(splice.next, struct btrfs_inode,
+- ordered_operations);
+-
+- list_del_init(&btrfs_inode->ordered_operations);
+- spin_unlock(&root->fs_info->ordered_root_lock);
+-
+- btrfs_invalidate_inodes(btrfs_inode->root);
+-
+- spin_lock(&root->fs_info->ordered_root_lock);
+- }
+-
+- spin_unlock(&root->fs_info->ordered_root_lock);
+- mutex_unlock(&root->fs_info->ordered_operations_mutex);
+-}
+-
+ static void btrfs_destroy_ordered_extents(struct btrfs_root *root)
+ {
+ struct btrfs_ordered_extent *ordered;
+@@ -4093,8 +4068,6 @@ again:
+ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
+ struct btrfs_root *root)
+ {
+- btrfs_destroy_ordered_operations(cur_trans, root);
+-
+ btrfs_destroy_delayed_refs(cur_trans, root);
+
+ cur_trans->state = TRANS_STATE_COMMIT_START;
+diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
+index 813537f362f9..8edb9fcc38d5 100644
+--- a/fs/btrfs/extent-tree.c
++++ b/fs/btrfs/extent-tree.c
+@@ -552,7 +552,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
+ caching_ctl->block_group = cache;
+ caching_ctl->progress = cache->key.objectid;
+ atomic_set(&caching_ctl->count, 1);
+- btrfs_init_work(&caching_ctl->work, caching_thread, NULL, NULL);
++ btrfs_init_work(&caching_ctl->work, btrfs_cache_helper,
++ caching_thread, NULL, NULL);
+
+ spin_lock(&cache->lock);
+ /*
+@@ -2749,8 +2750,8 @@ int btrfs_async_run_delayed_refs(struct btrfs_root *root,
+ async->sync = 0;
+ init_completion(&async->wait);
+
+- btrfs_init_work(&async->work, delayed_ref_async_start,
+- NULL, NULL);
++ btrfs_init_work(&async->work, btrfs_extent_refs_helper,
++ delayed_ref_async_start, NULL, NULL);
+
+ btrfs_queue_work(root->fs_info->extent_workers, &async->work);
+
+diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
+index a389820d158b..09b4e3165e2c 100644
+--- a/fs/btrfs/extent_io.c
++++ b/fs/btrfs/extent_io.c
+@@ -2532,6 +2532,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
+ test_bit(BIO_UPTODATE, &bio->bi_flags);
+ if (err)
+ uptodate = 0;
++ offset += len;
+ continue;
+ }
+ }
+diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
+index f46cfe45d686..54c84daec9b5 100644
+--- a/fs/btrfs/file-item.c
++++ b/fs/btrfs/file-item.c
+@@ -756,7 +756,7 @@ again:
+ found_next = 1;
+ if (ret != 0)
+ goto insert;
+- slot = 0;
++ slot = path->slots[0];
+ }
+ btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
+ if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
+index 1f2b99cb55ea..ab1fd668020d 100644
+--- a/fs/btrfs/file.c
++++ b/fs/btrfs/file.c
+@@ -1838,6 +1838,8 @@ out:
+
+ int btrfs_release_file(struct inode *inode, struct file *filp)
+ {
++ if (filp->private_data)
++ btrfs_ioctl_trans_end(filp);
+ /*
+ * ordered_data_close is set by settattr when we are about to truncate
+ * a file from a non-zero size to a zero size. This tries to
+@@ -1845,26 +1847,8 @@ int btrfs_release_file(struct inode *inode, struct file *filp)
+ * application were using truncate to replace a file in place.
+ */
+ if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+- &BTRFS_I(inode)->runtime_flags)) {
+- struct btrfs_trans_handle *trans;
+- struct btrfs_root *root = BTRFS_I(inode)->root;
+-
+- /*
+- * We need to block on a committing transaction to keep us from
+- * throwing a ordered operation on to the list and causing
+- * something like sync to deadlock trying to flush out this
+- * inode.
+- */
+- trans = btrfs_start_transaction(root, 0);
+- if (IS_ERR(trans))
+- return PTR_ERR(trans);
+- btrfs_add_ordered_operation(trans, BTRFS_I(inode)->root, inode);
+- btrfs_end_transaction(trans, root);
+- if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT)
++ &BTRFS_I(inode)->runtime_flags))
+ filemap_flush(inode->i_mapping);
+- }
+- if (filp->private_data)
+- btrfs_ioctl_trans_end(filp);
+ return 0;
+ }
+
+diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
+index 3668048e16f8..c6cd34e699d0 100644
+--- a/fs/btrfs/inode.c
++++ b/fs/btrfs/inode.c
+@@ -709,6 +709,18 @@ retry:
+ unlock_extent(io_tree, async_extent->start,
+ async_extent->start +
+ async_extent->ram_size - 1);
++
++ /*
++ * we need to redirty the pages if we decide to
++ * fallback to uncompressed IO, otherwise we
++ * will not submit these pages down to lower
++ * layers.
++ */
++ extent_range_redirty_for_io(inode,
++ async_extent->start,
++ async_extent->start +
++ async_extent->ram_size - 1);
++
+ goto retry;
+ }
+ goto out_free;
+@@ -1084,8 +1096,10 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
+ async_cow->end = cur_end;
+ INIT_LIST_HEAD(&async_cow->extents);
+
+- btrfs_init_work(&async_cow->work, async_cow_start,
+- async_cow_submit, async_cow_free);
++ btrfs_init_work(&async_cow->work,
++ btrfs_delalloc_helper,
++ async_cow_start, async_cow_submit,
++ async_cow_free);
+
+ nr_pages = (cur_end - start + PAGE_CACHE_SIZE) >>
+ PAGE_CACHE_SHIFT;
+@@ -1869,7 +1883,8 @@ static int btrfs_writepage_start_hook(struct page *page, u64 start, u64 end)
+
+ SetPageChecked(page);
+ page_cache_get(page);
+- btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL, NULL);
++ btrfs_init_work(&fixup->work, btrfs_fixup_helper,
++ btrfs_writepage_fixup_worker, NULL, NULL);
+ fixup->page = page;
+ btrfs_queue_work(root->fs_info->fixup_workers, &fixup->work);
+ return -EBUSY;
+@@ -2810,7 +2825,8 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+ struct inode *inode = page->mapping->host;
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct btrfs_ordered_extent *ordered_extent = NULL;
+- struct btrfs_workqueue *workers;
++ struct btrfs_workqueue *wq;
++ btrfs_work_func_t func;
+
+ trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
+
+@@ -2819,13 +2835,17 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+ end - start + 1, uptodate))
+ return 0;
+
+- btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL);
++ if (btrfs_is_free_space_inode(inode)) {
++ wq = root->fs_info->endio_freespace_worker;
++ func = btrfs_freespace_write_helper;
++ } else {
++ wq = root->fs_info->endio_write_workers;
++ func = btrfs_endio_write_helper;
++ }
+
+- if (btrfs_is_free_space_inode(inode))
+- workers = root->fs_info->endio_freespace_worker;
+- else
+- workers = root->fs_info->endio_write_workers;
+- btrfs_queue_work(workers, &ordered_extent->work);
++ btrfs_init_work(&ordered_extent->work, func, finish_ordered_fn, NULL,
++ NULL);
++ btrfs_queue_work(wq, &ordered_extent->work);
+
+ return 0;
+ }
+@@ -7146,7 +7166,8 @@ again:
+ if (!ret)
+ goto out_test;
+
+- btrfs_init_work(&ordered->work, finish_ordered_fn, NULL, NULL);
++ btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
++ finish_ordered_fn, NULL, NULL);
+ btrfs_queue_work(root->fs_info->endio_write_workers,
+ &ordered->work);
+ out_test:
+@@ -7939,27 +7960,6 @@ static int btrfs_truncate(struct inode *inode)
+ BUG_ON(ret);
+
+ /*
+- * setattr is responsible for setting the ordered_data_close flag,
+- * but that is only tested during the last file release. That
+- * could happen well after the next commit, leaving a great big
+- * window where new writes may get lost if someone chooses to write
+- * to this file after truncating to zero
+- *
+- * The inode doesn't have any dirty data here, and so if we commit
+- * this is a noop. If someone immediately starts writing to the inode
+- * it is very likely we'll catch some of their writes in this
+- * transaction, and the commit will find this file on the ordered
+- * data list with good things to send down.
+- *
+- * This is a best effort solution, there is still a window where
+- * using truncate to replace the contents of the file will
+- * end up with a zero length file after a crash.
+- */
+- if (inode->i_size == 0 && test_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+- &BTRFS_I(inode)->runtime_flags))
+- btrfs_add_ordered_operation(trans, root, inode);
+-
+- /*
+ * So if we truncate and then write and fsync we normally would just
+ * write the extents that changed, which is a problem if we need to
+ * first truncate that entire inode. So set this flag so we write out
+@@ -8106,7 +8106,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
+ mutex_init(&ei->delalloc_mutex);
+ btrfs_ordered_inode_tree_init(&ei->ordered_tree);
+ INIT_LIST_HEAD(&ei->delalloc_inodes);
+- INIT_LIST_HEAD(&ei->ordered_operations);
+ RB_CLEAR_NODE(&ei->rb_node);
+
+ return inode;
+@@ -8146,17 +8145,6 @@ void btrfs_destroy_inode(struct inode *inode)
+ if (!root)
+ goto free;
+
+- /*
+- * Make sure we're properly removed from the ordered operation
+- * lists.
+- */
+- smp_mb();
+- if (!list_empty(&BTRFS_I(inode)->ordered_operations)) {
+- spin_lock(&root->fs_info->ordered_root_lock);
+- list_del_init(&BTRFS_I(inode)->ordered_operations);
+- spin_unlock(&root->fs_info->ordered_root_lock);
+- }
+-
+ if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+ &BTRFS_I(inode)->runtime_flags)) {
+ btrfs_info(root->fs_info, "inode %llu still on the orphan list",
+@@ -8338,12 +8326,10 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+ ret = 0;
+
+ /*
+- * we're using rename to replace one file with another.
+- * and the replacement file is large. Start IO on it now so
+- * we don't add too much work to the end of the transaction
++ * we're using rename to replace one file with another. Start IO on it
++ * now so we don't add too much work to the end of the transaction
+ */
+- if (new_inode && S_ISREG(old_inode->i_mode) && new_inode->i_size &&
+- old_inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT)
++ if (new_inode && S_ISREG(old_inode->i_mode) && new_inode->i_size)
+ filemap_flush(old_inode->i_mapping);
+
+ /* close the racy window with snapshot create/destroy ioctl */
+@@ -8391,12 +8377,6 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+ */
+ btrfs_pin_log_trans(root);
+ }
+- /*
+- * make sure the inode gets flushed if it is replacing
+- * something.
+- */
+- if (new_inode && new_inode->i_size && S_ISREG(old_inode->i_mode))
+- btrfs_add_ordered_operation(trans, root, old_inode);
+
+ inode_inc_iversion(old_dir);
+ inode_inc_iversion(new_dir);
+@@ -8514,7 +8494,9 @@ struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode,
+ work->inode = inode;
+ work->wait = wait;
+ work->delay_iput = delay_iput;
+- btrfs_init_work(&work->work, btrfs_run_delalloc_work, NULL, NULL);
++ WARN_ON_ONCE(!inode);
++ btrfs_init_work(&work->work, btrfs_flush_delalloc_helper,
++ btrfs_run_delalloc_work, NULL, NULL);
+
+ return work;
+ }
+diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
+index 7187b14faa6c..ac734ec4cc20 100644
+--- a/fs/btrfs/ordered-data.c
++++ b/fs/btrfs/ordered-data.c
+@@ -571,18 +571,6 @@ void btrfs_remove_ordered_extent(struct inode *inode,
+
+ trace_btrfs_ordered_extent_remove(inode, entry);
+
+- /*
+- * we have no more ordered extents for this inode and
+- * no dirty pages. We can safely remove it from the
+- * list of ordered extents
+- */
+- if (RB_EMPTY_ROOT(&tree->tree) &&
+- !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+- spin_lock(&root->fs_info->ordered_root_lock);
+- list_del_init(&BTRFS_I(inode)->ordered_operations);
+- spin_unlock(&root->fs_info->ordered_root_lock);
+- }
+-
+ if (!root->nr_ordered_extents) {
+ spin_lock(&root->fs_info->ordered_root_lock);
+ BUG_ON(list_empty(&root->ordered_root));
+@@ -627,6 +615,7 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
+ spin_unlock(&root->ordered_extent_lock);
+
+ btrfs_init_work(&ordered->flush_work,
++ btrfs_flush_delalloc_helper,
+ btrfs_run_ordered_extent_work, NULL, NULL);
+ list_add_tail(&ordered->work_list, &works);
+ btrfs_queue_work(root->fs_info->flush_workers,
+@@ -687,81 +676,6 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr)
+ }
+
+ /*
+- * this is used during transaction commit to write all the inodes
+- * added to the ordered operation list. These files must be fully on
+- * disk before the transaction commits.
+- *
+- * we have two modes here, one is to just start the IO via filemap_flush
+- * and the other is to wait for all the io. When we wait, we have an
+- * extra check to make sure the ordered operation list really is empty
+- * before we return
+- */
+-int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
+- struct btrfs_root *root, int wait)
+-{
+- struct btrfs_inode *btrfs_inode;
+- struct inode *inode;
+- struct btrfs_transaction *cur_trans = trans->transaction;
+- struct list_head splice;
+- struct list_head works;
+- struct btrfs_delalloc_work *work, *next;
+- int ret = 0;
+-
+- INIT_LIST_HEAD(&splice);
+- INIT_LIST_HEAD(&works);
+-
+- mutex_lock(&root->fs_info->ordered_extent_flush_mutex);
+- spin_lock(&root->fs_info->ordered_root_lock);
+- list_splice_init(&cur_trans->ordered_operations, &splice);
+- while (!list_empty(&splice)) {
+- btrfs_inode = list_entry(splice.next, struct btrfs_inode,
+- ordered_operations);
+- inode = &btrfs_inode->vfs_inode;
+-
+- list_del_init(&btrfs_inode->ordered_operations);
+-
+- /*
+- * the inode may be getting freed (in sys_unlink path).
+- */
+- inode = igrab(inode);
+- if (!inode)
+- continue;
+-
+- if (!wait)
+- list_add_tail(&BTRFS_I(inode)->ordered_operations,
+- &cur_trans->ordered_operations);
+- spin_unlock(&root->fs_info->ordered_root_lock);
+-
+- work = btrfs_alloc_delalloc_work(inode, wait, 1);
+- if (!work) {
+- spin_lock(&root->fs_info->ordered_root_lock);
+- if (list_empty(&BTRFS_I(inode)->ordered_operations))
+- list_add_tail(&btrfs_inode->ordered_operations,
+- &splice);
+- list_splice_tail(&splice,
+- &cur_trans->ordered_operations);
+- spin_unlock(&root->fs_info->ordered_root_lock);
+- ret = -ENOMEM;
+- goto out;
+- }
+- list_add_tail(&work->list, &works);
+- btrfs_queue_work(root->fs_info->flush_workers,
+- &work->work);
+-
+- cond_resched();
+- spin_lock(&root->fs_info->ordered_root_lock);
+- }
+- spin_unlock(&root->fs_info->ordered_root_lock);
+-out:
+- list_for_each_entry_safe(work, next, &works, list) {
+- list_del_init(&work->list);
+- btrfs_wait_and_free_delalloc_work(work);
+- }
+- mutex_unlock(&root->fs_info->ordered_extent_flush_mutex);
+- return ret;
+-}
+-
+-/*
+ * Used to start IO or wait for a given ordered extent to finish.
+ *
+ * If wait is one, this effectively waits on page writeback for all the pages
+@@ -1120,42 +1034,6 @@ out:
+ return index;
+ }
+
+-
+-/*
+- * add a given inode to the list of inodes that must be fully on
+- * disk before a transaction commit finishes.
+- *
+- * This basically gives us the ext3 style data=ordered mode, and it is mostly
+- * used to make sure renamed files are fully on disk.
+- *
+- * It is a noop if the inode is already fully on disk.
+- *
+- * If trans is not null, we'll do a friendly check for a transaction that
+- * is already flushing things and force the IO down ourselves.
+- */
+-void btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
+- struct btrfs_root *root, struct inode *inode)
+-{
+- struct btrfs_transaction *cur_trans = trans->transaction;
+- u64 last_mod;
+-
+- last_mod = max(BTRFS_I(inode)->generation, BTRFS_I(inode)->last_trans);
+-
+- /*
+- * if this file hasn't been changed since the last transaction
+- * commit, we can safely return without doing anything
+- */
+- if (last_mod <= root->fs_info->last_trans_committed)
+- return;
+-
+- spin_lock(&root->fs_info->ordered_root_lock);
+- if (list_empty(&BTRFS_I(inode)->ordered_operations)) {
+- list_add_tail(&BTRFS_I(inode)->ordered_operations,
+- &cur_trans->ordered_operations);
+- }
+- spin_unlock(&root->fs_info->ordered_root_lock);
+-}
+-
+ int __init ordered_data_init(void)
+ {
+ btrfs_ordered_extent_cache = kmem_cache_create("btrfs_ordered_extent",
+diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
+index 246897058efb..d81a274d621e 100644
+--- a/fs/btrfs/ordered-data.h
++++ b/fs/btrfs/ordered-data.h
+@@ -190,11 +190,6 @@ int btrfs_ordered_update_i_size(struct inode *inode, u64 offset,
+ struct btrfs_ordered_extent *ordered);
+ int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr,
+ u32 *sum, int len);
+-int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
+- struct btrfs_root *root, int wait);
+-void btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
+- struct btrfs_root *root,
+- struct inode *inode);
+ int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr);
+ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr);
+ void btrfs_get_logged_extents(struct inode *inode,
+diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
+index 98cb6b2630f9..3eec914710b2 100644
+--- a/fs/btrfs/qgroup.c
++++ b/fs/btrfs/qgroup.c
+@@ -2551,6 +2551,7 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
+ memset(&fs_info->qgroup_rescan_work, 0,
+ sizeof(fs_info->qgroup_rescan_work));
+ btrfs_init_work(&fs_info->qgroup_rescan_work,
++ btrfs_qgroup_rescan_helper,
+ btrfs_qgroup_rescan_worker, NULL, NULL);
+
+ if (ret) {
+diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
+index 4a88f073fdd7..0a6b6e4bcbb9 100644
+--- a/fs/btrfs/raid56.c
++++ b/fs/btrfs/raid56.c
+@@ -1416,7 +1416,8 @@ cleanup:
+
+ static void async_rmw_stripe(struct btrfs_raid_bio *rbio)
+ {
+- btrfs_init_work(&rbio->work, rmw_work, NULL, NULL);
++ btrfs_init_work(&rbio->work, btrfs_rmw_helper,
++ rmw_work, NULL, NULL);
+
+ btrfs_queue_work(rbio->fs_info->rmw_workers,
+ &rbio->work);
+@@ -1424,7 +1425,8 @@ static void async_rmw_stripe(struct btrfs_raid_bio *rbio)
+
+ static void async_read_rebuild(struct btrfs_raid_bio *rbio)
+ {
+- btrfs_init_work(&rbio->work, read_rebuild_work, NULL, NULL);
++ btrfs_init_work(&rbio->work, btrfs_rmw_helper,
++ read_rebuild_work, NULL, NULL);
+
+ btrfs_queue_work(rbio->fs_info->rmw_workers,
+ &rbio->work);
+@@ -1665,7 +1667,8 @@ static void btrfs_raid_unplug(struct blk_plug_cb *cb, bool from_schedule)
+ plug = container_of(cb, struct btrfs_plug_cb, cb);
+
+ if (from_schedule) {
+- btrfs_init_work(&plug->work, unplug_work, NULL, NULL);
++ btrfs_init_work(&plug->work, btrfs_rmw_helper,
++ unplug_work, NULL, NULL);
+ btrfs_queue_work(plug->info->rmw_workers,
+ &plug->work);
+ return;
+diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
+index 09230cf3a244..20408c6b665a 100644
+--- a/fs/btrfs/reada.c
++++ b/fs/btrfs/reada.c
+@@ -798,7 +798,8 @@ static void reada_start_machine(struct btrfs_fs_info *fs_info)
+ /* FIXME we cannot handle this properly right now */
+ BUG();
+ }
+- btrfs_init_work(&rmw->work, reada_start_machine_worker, NULL, NULL);
++ btrfs_init_work(&rmw->work, btrfs_readahead_helper,
++ reada_start_machine_worker, NULL, NULL);
+ rmw->fs_info = fs_info;
+
+ btrfs_queue_work(fs_info->readahead_workers, &rmw->work);
+diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
+index b6d198f5181e..8dddedcfa961 100644
+--- a/fs/btrfs/scrub.c
++++ b/fs/btrfs/scrub.c
+@@ -428,8 +428,8 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
+ sbio->index = i;
+ sbio->sctx = sctx;
+ sbio->page_count = 0;
+- btrfs_init_work(&sbio->work, scrub_bio_end_io_worker,
+- NULL, NULL);
++ btrfs_init_work(&sbio->work, btrfs_scrub_helper,
++ scrub_bio_end_io_worker, NULL, NULL);
+
+ if (i != SCRUB_BIOS_PER_SCTX - 1)
+ sctx->bios[i]->next_free = i + 1;
+@@ -999,8 +999,8 @@ nodatasum_case:
+ fixup_nodatasum->root = fs_info->extent_root;
+ fixup_nodatasum->mirror_num = failed_mirror_index + 1;
+ scrub_pending_trans_workers_inc(sctx);
+- btrfs_init_work(&fixup_nodatasum->work, scrub_fixup_nodatasum,
+- NULL, NULL);
++ btrfs_init_work(&fixup_nodatasum->work, btrfs_scrub_helper,
++ scrub_fixup_nodatasum, NULL, NULL);
+ btrfs_queue_work(fs_info->scrub_workers,
+ &fixup_nodatasum->work);
+ goto out;
+@@ -1616,7 +1616,8 @@ static void scrub_wr_bio_end_io(struct bio *bio, int err)
+ sbio->err = err;
+ sbio->bio = bio;
+
+- btrfs_init_work(&sbio->work, scrub_wr_bio_end_io_worker, NULL, NULL);
++ btrfs_init_work(&sbio->work, btrfs_scrubwrc_helper,
++ scrub_wr_bio_end_io_worker, NULL, NULL);
+ btrfs_queue_work(fs_info->scrub_wr_completion_workers, &sbio->work);
+ }
+
+@@ -3203,7 +3204,8 @@ static int copy_nocow_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
+ nocow_ctx->len = len;
+ nocow_ctx->mirror_num = mirror_num;
+ nocow_ctx->physical_for_dev_replace = physical_for_dev_replace;
+- btrfs_init_work(&nocow_ctx->work, copy_nocow_pages_worker, NULL, NULL);
++ btrfs_init_work(&nocow_ctx->work, btrfs_scrubnc_helper,
++ copy_nocow_pages_worker, NULL, NULL);
+ INIT_LIST_HEAD(&nocow_ctx->inodes);
+ btrfs_queue_work(fs_info->scrub_nocow_workers,
+ &nocow_ctx->work);
+diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
+index 5f379affdf23..d89c6d3542ca 100644
+--- a/fs/btrfs/transaction.c
++++ b/fs/btrfs/transaction.c
+@@ -218,7 +218,6 @@ loop:
+ spin_lock_init(&cur_trans->delayed_refs.lock);
+
+ INIT_LIST_HEAD(&cur_trans->pending_snapshots);
+- INIT_LIST_HEAD(&cur_trans->ordered_operations);
+ INIT_LIST_HEAD(&cur_trans->pending_chunks);
+ INIT_LIST_HEAD(&cur_trans->switch_commits);
+ list_add_tail(&cur_trans->list, &fs_info->trans_list);
+@@ -1612,27 +1611,6 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
+ kmem_cache_free(btrfs_trans_handle_cachep, trans);
+ }
+
+-static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
+- struct btrfs_root *root)
+-{
+- int ret;
+-
+- ret = btrfs_run_delayed_items(trans, root);
+- if (ret)
+- return ret;
+-
+- /*
+- * rename don't use btrfs_join_transaction, so, once we
+- * set the transaction to blocked above, we aren't going
+- * to get any new ordered operations. We can safely run
+- * it here and no for sure that nothing new will be added
+- * to the list
+- */
+- ret = btrfs_run_ordered_operations(trans, root, 1);
+-
+- return ret;
+-}
+-
+ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
+ {
+ if (btrfs_test_opt(fs_info->tree_root, FLUSHONCOMMIT))
+@@ -1653,13 +1631,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
+ struct btrfs_transaction *prev_trans = NULL;
+ int ret;
+
+- ret = btrfs_run_ordered_operations(trans, root, 0);
+- if (ret) {
+- btrfs_abort_transaction(trans, root, ret);
+- btrfs_end_transaction(trans, root);
+- return ret;
+- }
+-
+ /* Stop the commit early if ->aborted is set */
+ if (unlikely(ACCESS_ONCE(cur_trans->aborted))) {
+ ret = cur_trans->aborted;
+@@ -1740,7 +1711,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
+ if (ret)
+ goto cleanup_transaction;
+
+- ret = btrfs_flush_all_pending_stuffs(trans, root);
++ ret = btrfs_run_delayed_items(trans, root);
+ if (ret)
+ goto cleanup_transaction;
+
+@@ -1748,7 +1719,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
+ extwriter_counter_read(cur_trans) == 0);
+
+ /* some pending stuffs might be added after the previous flush. */
+- ret = btrfs_flush_all_pending_stuffs(trans, root);
++ ret = btrfs_run_delayed_items(trans, root);
+ if (ret)
+ goto cleanup_transaction;
+
+diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
+index 7dd558ed0716..579be51b27e5 100644
+--- a/fs/btrfs/transaction.h
++++ b/fs/btrfs/transaction.h
+@@ -55,7 +55,6 @@ struct btrfs_transaction {
+ wait_queue_head_t writer_wait;
+ wait_queue_head_t commit_wait;
+ struct list_head pending_snapshots;
+- struct list_head ordered_operations;
+ struct list_head pending_chunks;
+ struct list_head switch_commits;
+ struct btrfs_delayed_ref_root delayed_refs;
+diff --git a/fs/btrfs/ulist.h b/fs/btrfs/ulist.h
+index 7f78cbf5cf41..4c29db604bbe 100644
+--- a/fs/btrfs/ulist.h
++++ b/fs/btrfs/ulist.h
+@@ -57,6 +57,21 @@ void ulist_free(struct ulist *ulist);
+ int ulist_add(struct ulist *ulist, u64 val, u64 aux, gfp_t gfp_mask);
+ int ulist_add_merge(struct ulist *ulist, u64 val, u64 aux,
+ u64 *old_aux, gfp_t gfp_mask);
++
++/* just like ulist_add_merge() but take a pointer for the aux data */
++static inline int ulist_add_merge_ptr(struct ulist *ulist, u64 val, void *aux,
++ void **old_aux, gfp_t gfp_mask)
++{
++#if BITS_PER_LONG == 32
++ u64 old64 = (uintptr_t)*old_aux;
++ int ret = ulist_add_merge(ulist, val, (uintptr_t)aux, &old64, gfp_mask);
++ *old_aux = (void *)((uintptr_t)old64);
++ return ret;
++#else
++ return ulist_add_merge(ulist, val, (u64)aux, (u64 *)old_aux, gfp_mask);
++#endif
++}
++
+ struct ulist_node *ulist_next(struct ulist *ulist,
+ struct ulist_iterator *uiter);
+
+diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
+index 6cb82f62cb7c..81bec9fd8f19 100644
+--- a/fs/btrfs/volumes.c
++++ b/fs/btrfs/volumes.c
+@@ -5800,7 +5800,8 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
+ else
+ generate_random_uuid(dev->uuid);
+
+- btrfs_init_work(&dev->work, pending_bios_fn, NULL, NULL);
++ btrfs_init_work(&dev->work, btrfs_submit_helper,
++ pending_bios_fn, NULL, NULL);
+
+ return dev;
+ }
+diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
+index 8c41b52da358..16a46b6a6fee 100644
+--- a/fs/debugfs/inode.c
++++ b/fs/debugfs/inode.c
+@@ -534,7 +534,7 @@ EXPORT_SYMBOL_GPL(debugfs_remove);
+ */
+ void debugfs_remove_recursive(struct dentry *dentry)
+ {
+- struct dentry *child, *next, *parent;
++ struct dentry *child, *parent;
+
+ if (IS_ERR_OR_NULL(dentry))
+ return;
+@@ -546,30 +546,49 @@ void debugfs_remove_recursive(struct dentry *dentry)
+ parent = dentry;
+ down:
+ mutex_lock(&parent->d_inode->i_mutex);
+- list_for_each_entry_safe(child, next, &parent->d_subdirs, d_u.d_child) {
++ loop:
++ /*
++ * The parent->d_subdirs is protected by the d_lock. Outside that
++ * lock, the child can be unlinked and set to be freed which can
++ * use the d_u.d_child as the rcu head and corrupt this list.
++ */
++ spin_lock(&parent->d_lock);
++ list_for_each_entry(child, &parent->d_subdirs, d_u.d_child) {
+ if (!debugfs_positive(child))
+ continue;
+
+ /* perhaps simple_empty(child) makes more sense */
+ if (!list_empty(&child->d_subdirs)) {
++ spin_unlock(&parent->d_lock);
+ mutex_unlock(&parent->d_inode->i_mutex);
+ parent = child;
+ goto down;
+ }
+- up:
++
++ spin_unlock(&parent->d_lock);
++
+ if (!__debugfs_remove(child, parent))
+ simple_release_fs(&debugfs_mount, &debugfs_mount_count);
++
++ /*
++ * The parent->d_lock protects agaist child from unlinking
++ * from d_subdirs. When releasing the parent->d_lock we can
++ * no longer trust that the next pointer is valid.
++ * Restart the loop. We'll skip this one with the
++ * debugfs_positive() check.
++ */
++ goto loop;
+ }
++ spin_unlock(&parent->d_lock);
+
+ mutex_unlock(&parent->d_inode->i_mutex);
+ child = parent;
+ parent = parent->d_parent;
+ mutex_lock(&parent->d_inode->i_mutex);
+
+- if (child != dentry) {
+- next = list_next_entry(child, d_u.d_child);
+- goto up;
+- }
++ if (child != dentry)
++ /* go up */
++ goto loop;
+
+ if (!__debugfs_remove(child, parent))
+ simple_release_fs(&debugfs_mount, &debugfs_mount_count);
+diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
+index 7cc5a0e23688..1bbe7c315138 100644
+--- a/fs/ext4/ext4.h
++++ b/fs/ext4/ext4.h
+@@ -2144,8 +2144,8 @@ extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
+ extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
+ extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
+ extern void ext4_ind_truncate(handle_t *, struct inode *inode);
+-extern int ext4_free_hole_blocks(handle_t *handle, struct inode *inode,
+- ext4_lblk_t first, ext4_lblk_t stop);
++extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
++ ext4_lblk_t start, ext4_lblk_t end);
+
+ /* ioctl.c */
+ extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
+@@ -2453,6 +2453,22 @@ static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
+ up_write(&EXT4_I(inode)->i_data_sem);
+ }
+
++/* Update i_size, i_disksize. Requires i_mutex to avoid races with truncate */
++static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
++{
++ int changed = 0;
++
++ if (newsize > inode->i_size) {
++ i_size_write(inode, newsize);
++ changed = 1;
++ }
++ if (newsize > EXT4_I(inode)->i_disksize) {
++ ext4_update_i_disksize(inode, newsize);
++ changed |= 2;
++ }
++ return changed;
++}
++
+ struct ext4_group_info {
+ unsigned long bb_state;
+ struct rb_root bb_free_root;
+diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
+index 4da228a0e6d0..7dfd6300e1c2 100644
+--- a/fs/ext4/extents.c
++++ b/fs/ext4/extents.c
+@@ -4664,7 +4664,8 @@ retry:
+ }
+
+ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
+- ext4_lblk_t len, int flags, int mode)
++ ext4_lblk_t len, loff_t new_size,
++ int flags, int mode)
+ {
+ struct inode *inode = file_inode(file);
+ handle_t *handle;
+@@ -4673,8 +4674,10 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
+ int retries = 0;
+ struct ext4_map_blocks map;
+ unsigned int credits;
++ loff_t epos;
+
+ map.m_lblk = offset;
++ map.m_len = len;
+ /*
+ * Don't normalize the request if it can fit in one extent so
+ * that it doesn't get unnecessarily split into multiple
+@@ -4689,9 +4692,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
+ credits = ext4_chunk_trans_blocks(inode, len);
+
+ retry:
+- while (ret >= 0 && ret < len) {
+- map.m_lblk = map.m_lblk + ret;
+- map.m_len = len = len - ret;
++ while (ret >= 0 && len) {
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
+ credits);
+ if (IS_ERR(handle)) {
+@@ -4708,6 +4709,21 @@ retry:
+ ret2 = ext4_journal_stop(handle);
+ break;
+ }
++ map.m_lblk += ret;
++ map.m_len = len = len - ret;
++ epos = (loff_t)map.m_lblk << inode->i_blkbits;
++ inode->i_ctime = ext4_current_time(inode);
++ if (new_size) {
++ if (epos > new_size)
++ epos = new_size;
++ if (ext4_update_inode_size(inode, epos) & 0x1)
++ inode->i_mtime = inode->i_ctime;
++ } else {
++ if (epos > inode->i_size)
++ ext4_set_inode_flag(inode,
++ EXT4_INODE_EOFBLOCKS);
++ }
++ ext4_mark_inode_dirty(handle, inode);
+ ret2 = ext4_journal_stop(handle);
+ if (ret2)
+ break;
+@@ -4730,7 +4746,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ loff_t new_size = 0;
+ int ret = 0;
+ int flags;
+- int partial;
++ int credits;
++ int partial_begin, partial_end;
+ loff_t start, end;
+ ext4_lblk_t lblk;
+ struct address_space *mapping = inode->i_mapping;
+@@ -4770,7 +4787,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+
+ if (start < offset || end > offset + len)
+ return -EINVAL;
+- partial = (offset + len) & ((1 << blkbits) - 1);
++ partial_begin = offset & ((1 << blkbits) - 1);
++ partial_end = (offset + len) & ((1 << blkbits) - 1);
+
+ lblk = start >> blkbits;
+ max_blocks = (end >> blkbits);
+@@ -4804,7 +4822,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ * If we have a partial block after EOF we have to allocate
+ * the entire block.
+ */
+- if (partial)
++ if (partial_end)
+ max_blocks += 1;
+ }
+
+@@ -4812,6 +4830,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+
+ /* Now release the pages and zero block aligned part of pages*/
+ truncate_pagecache_range(inode, start, end - 1);
++ inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+
+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+@@ -4824,13 +4843,22 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ if (ret)
+ goto out_dio;
+
+- ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags,
+- mode);
++ ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
++ flags, mode);
+ if (ret)
+ goto out_dio;
+ }
++ if (!partial_begin && !partial_end)
++ goto out_dio;
+
+- handle = ext4_journal_start(inode, EXT4_HT_MISC, 4);
++ /*
++ * In worst case we have to writeout two nonadjacent unwritten
++ * blocks and update the inode
++ */
++ credits = (2 * ext4_ext_index_trans_blocks(inode, 2)) + 1;
++ if (ext4_should_journal_data(inode))
++ credits += 2;
++ handle = ext4_journal_start(inode, EXT4_HT_MISC, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ ext4_std_error(inode->i_sb, ret);
+@@ -4838,12 +4866,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ }
+
+ inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+-
+ if (new_size) {
+- if (new_size > i_size_read(inode))
+- i_size_write(inode, new_size);
+- if (new_size > EXT4_I(inode)->i_disksize)
+- ext4_update_i_disksize(inode, new_size);
++ ext4_update_inode_size(inode, new_size);
+ } else {
+ /*
+ * Mark that we allocate beyond EOF so the subsequent truncate
+@@ -4852,7 +4876,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ if ((offset + len) > i_size_read(inode))
+ ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
+ }
+-
+ ext4_mark_inode_dirty(handle, inode);
+
+ /* Zero out partial block at the edges of the range */
+@@ -4879,13 +4902,11 @@ out_mutex:
+ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+ {
+ struct inode *inode = file_inode(file);
+- handle_t *handle;
+ loff_t new_size = 0;
+ unsigned int max_blocks;
+ int ret = 0;
+ int flags;
+ ext4_lblk_t lblk;
+- struct timespec tv;
+ unsigned int blkbits = inode->i_blkbits;
+
+ /* Return error if mode is not supported */
+@@ -4936,36 +4957,15 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+ goto out;
+ }
+
+- ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags, mode);
++ ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
++ flags, mode);
+ if (ret)
+ goto out;
+
+- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+- if (IS_ERR(handle))
+- goto out;
+-
+- tv = inode->i_ctime = ext4_current_time(inode);
+-
+- if (new_size) {
+- if (new_size > i_size_read(inode)) {
+- i_size_write(inode, new_size);
+- inode->i_mtime = tv;
+- }
+- if (new_size > EXT4_I(inode)->i_disksize)
+- ext4_update_i_disksize(inode, new_size);
+- } else {
+- /*
+- * Mark that we allocate beyond EOF so the subsequent truncate
+- * can proceed even if the new size is the same as i_size.
+- */
+- if ((offset + len) > i_size_read(inode))
+- ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
++ if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
++ ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
++ EXT4_I(inode)->i_sync_tid);
+ }
+- ext4_mark_inode_dirty(handle, inode);
+- if (file->f_flags & O_SYNC)
+- ext4_handle_sync(handle);
+-
+- ext4_journal_stop(handle);
+ out:
+ mutex_unlock(&inode->i_mutex);
+ trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
+diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
+index fd69da194826..e75f840000a0 100644
+--- a/fs/ext4/indirect.c
++++ b/fs/ext4/indirect.c
+@@ -1295,97 +1295,220 @@ do_indirects:
+ }
+ }
+
+-static int free_hole_blocks(handle_t *handle, struct inode *inode,
+- struct buffer_head *parent_bh, __le32 *i_data,
+- int level, ext4_lblk_t first,
+- ext4_lblk_t count, int max)
++/**
++ * ext4_ind_remove_space - remove space from the range
++ * @handle: JBD handle for this transaction
++ * @inode: inode we are dealing with
++ * @start: First block to remove
++ * @end: One block after the last block to remove (exclusive)
++ *
++ * Free the blocks in the defined range (end is exclusive endpoint of
++ * range). This is used by ext4_punch_hole().
++ */
++int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
++ ext4_lblk_t start, ext4_lblk_t end)
+ {
+- struct buffer_head *bh = NULL;
++ struct ext4_inode_info *ei = EXT4_I(inode);
++ __le32 *i_data = ei->i_data;
+ int addr_per_block = EXT4_ADDR_PER_BLOCK(inode->i_sb);
+- int ret = 0;
+- int i, inc;
+- ext4_lblk_t offset;
+- __le32 blk;
+-
+- inc = 1 << ((EXT4_BLOCK_SIZE_BITS(inode->i_sb) - 2) * level);
+- for (i = 0, offset = 0; i < max; i++, i_data++, offset += inc) {
+- if (offset >= count + first)
+- break;
+- if (*i_data == 0 || (offset + inc) <= first)
+- continue;
+- blk = *i_data;
+- if (level > 0) {
+- ext4_lblk_t first2;
+- ext4_lblk_t count2;
++ ext4_lblk_t offsets[4], offsets2[4];
++ Indirect chain[4], chain2[4];
++ Indirect *partial, *partial2;
++ ext4_lblk_t max_block;
++ __le32 nr = 0, nr2 = 0;
++ int n = 0, n2 = 0;
++ unsigned blocksize = inode->i_sb->s_blocksize;
+
+- bh = sb_bread(inode->i_sb, le32_to_cpu(blk));
+- if (!bh) {
+- EXT4_ERROR_INODE_BLOCK(inode, le32_to_cpu(blk),
+- "Read failure");
+- return -EIO;
+- }
+- if (first > offset) {
+- first2 = first - offset;
+- count2 = count;
++ max_block = (EXT4_SB(inode->i_sb)->s_bitmap_maxbytes + blocksize-1)
++ >> EXT4_BLOCK_SIZE_BITS(inode->i_sb);
++ if (end >= max_block)
++ end = max_block;
++ if ((start >= end) || (start > max_block))
++ return 0;
++
++ n = ext4_block_to_path(inode, start, offsets, NULL);
++ n2 = ext4_block_to_path(inode, end, offsets2, NULL);
++
++ BUG_ON(n > n2);
++
++ if ((n == 1) && (n == n2)) {
++ /* We're punching only within direct block range */
++ ext4_free_data(handle, inode, NULL, i_data + offsets[0],
++ i_data + offsets2[0]);
++ return 0;
++ } else if (n2 > n) {
++ /*
++ * Start and end are on a different levels so we're going to
++ * free partial block at start, and partial block at end of
++ * the range. If there are some levels in between then
++ * do_indirects label will take care of that.
++ */
++
++ if (n == 1) {
++ /*
++ * Start is at the direct block level, free
++ * everything to the end of the level.
++ */
++ ext4_free_data(handle, inode, NULL, i_data + offsets[0],
++ i_data + EXT4_NDIR_BLOCKS);
++ goto end_range;
++ }
++
++
++ partial = ext4_find_shared(inode, n, offsets, chain, &nr);
++ if (nr) {
++ if (partial == chain) {
++ /* Shared branch grows from the inode */
++ ext4_free_branches(handle, inode, NULL,
++ &nr, &nr+1, (chain+n-1) - partial);
++ *partial->p = 0;
+ } else {
+- first2 = 0;
+- count2 = count - (offset - first);
++ /* Shared branch grows from an indirect block */
++ BUFFER_TRACE(partial->bh, "get_write_access");
++ ext4_free_branches(handle, inode, partial->bh,
++ partial->p,
++ partial->p+1, (chain+n-1) - partial);
+ }
+- ret = free_hole_blocks(handle, inode, bh,
+- (__le32 *)bh->b_data, level - 1,
+- first2, count2,
+- inode->i_sb->s_blocksize >> 2);
+- if (ret) {
+- brelse(bh);
+- goto err;
++ }
++
++ /*
++ * Clear the ends of indirect blocks on the shared branch
++ * at the start of the range
++ */
++ while (partial > chain) {
++ ext4_free_branches(handle, inode, partial->bh,
++ partial->p + 1,
++ (__le32 *)partial->bh->b_data+addr_per_block,
++ (chain+n-1) - partial);
++ BUFFER_TRACE(partial->bh, "call brelse");
++ brelse(partial->bh);
++ partial--;
++ }
++
++end_range:
++ partial2 = ext4_find_shared(inode, n2, offsets2, chain2, &nr2);
++ if (nr2) {
++ if (partial2 == chain2) {
++ /*
++ * Remember, end is exclusive so here we're at
++ * the start of the next level we're not going
++ * to free. Everything was covered by the start
++ * of the range.
++ */
++ return 0;
++ } else {
++ /* Shared branch grows from an indirect block */
++ partial2--;
+ }
++ } else {
++ /*
++ * ext4_find_shared returns Indirect structure which
++ * points to the last element which should not be
++ * removed by truncate. But this is end of the range
++ * in punch_hole so we need to point to the next element
++ */
++ partial2->p++;
+ }
+- if (level == 0 ||
+- (bh && all_zeroes((__le32 *)bh->b_data,
+- (__le32 *)bh->b_data + addr_per_block))) {
+- ext4_free_data(handle, inode, parent_bh,
+- i_data, i_data + 1);
++
++ /*
++ * Clear the ends of indirect blocks on the shared branch
++ * at the end of the range
++ */
++ while (partial2 > chain2) {
++ ext4_free_branches(handle, inode, partial2->bh,
++ (__le32 *)partial2->bh->b_data,
++ partial2->p,
++ (chain2+n2-1) - partial2);
++ BUFFER_TRACE(partial2->bh, "call brelse");
++ brelse(partial2->bh);
++ partial2--;
+ }
+- brelse(bh);
+- bh = NULL;
++ goto do_indirects;
+ }
+
+-err:
+- return ret;
+-}
+-
+-int ext4_free_hole_blocks(handle_t *handle, struct inode *inode,
+- ext4_lblk_t first, ext4_lblk_t stop)
+-{
+- int addr_per_block = EXT4_ADDR_PER_BLOCK(inode->i_sb);
+- int level, ret = 0;
+- int num = EXT4_NDIR_BLOCKS;
+- ext4_lblk_t count, max = EXT4_NDIR_BLOCKS;
+- __le32 *i_data = EXT4_I(inode)->i_data;
+-
+- count = stop - first;
+- for (level = 0; level < 4; level++, max *= addr_per_block) {
+- if (first < max) {
+- ret = free_hole_blocks(handle, inode, NULL, i_data,
+- level, first, count, num);
+- if (ret)
+- goto err;
+- if (count > max - first)
+- count -= max - first;
+- else
+- break;
+- first = 0;
+- } else {
+- first -= max;
++ /* Punch happened within the same level (n == n2) */
++ partial = ext4_find_shared(inode, n, offsets, chain, &nr);
++ partial2 = ext4_find_shared(inode, n2, offsets2, chain2, &nr2);
++ /*
++ * ext4_find_shared returns Indirect structure which
++ * points to the last element which should not be
++ * removed by truncate. But this is end of the range
++ * in punch_hole so we need to point to the next element
++ */
++ partial2->p++;
++ while ((partial > chain) || (partial2 > chain2)) {
++ /* We're at the same block, so we're almost finished */
++ if ((partial->bh && partial2->bh) &&
++ (partial->bh->b_blocknr == partial2->bh->b_blocknr)) {
++ if ((partial > chain) && (partial2 > chain2)) {
++ ext4_free_branches(handle, inode, partial->bh,
++ partial->p + 1,
++ partial2->p,
++ (chain+n-1) - partial);
++ BUFFER_TRACE(partial->bh, "call brelse");
++ brelse(partial->bh);
++ BUFFER_TRACE(partial2->bh, "call brelse");
++ brelse(partial2->bh);
++ }
++ return 0;
+ }
+- i_data += num;
+- if (level == 0) {
+- num = 1;
+- max = 1;
++ /*
++ * Clear the ends of indirect blocks on the shared branch
++ * at the start of the range
++ */
++ if (partial > chain) {
++ ext4_free_branches(handle, inode, partial->bh,
++ partial->p + 1,
++ (__le32 *)partial->bh->b_data+addr_per_block,
++ (chain+n-1) - partial);
++ BUFFER_TRACE(partial->bh, "call brelse");
++ brelse(partial->bh);
++ partial--;
++ }
++ /*
++ * Clear the ends of indirect blocks on the shared branch
++ * at the end of the range
++ */
++ if (partial2 > chain2) {
++ ext4_free_branches(handle, inode, partial2->bh,
++ (__le32 *)partial2->bh->b_data,
++ partial2->p,
++ (chain2+n-1) - partial2);
++ BUFFER_TRACE(partial2->bh, "call brelse");
++ brelse(partial2->bh);
++ partial2--;
+ }
+ }
+
+-err:
+- return ret;
++do_indirects:
++ /* Kill the remaining (whole) subtrees */
++ switch (offsets[0]) {
++ default:
++ if (++n >= n2)
++ return 0;
++ nr = i_data[EXT4_IND_BLOCK];
++ if (nr) {
++ ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 1);
++ i_data[EXT4_IND_BLOCK] = 0;
++ }
++ case EXT4_IND_BLOCK:
++ if (++n >= n2)
++ return 0;
++ nr = i_data[EXT4_DIND_BLOCK];
++ if (nr) {
++ ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 2);
++ i_data[EXT4_DIND_BLOCK] = 0;
++ }
++ case EXT4_DIND_BLOCK:
++ if (++n >= n2)
++ return 0;
++ nr = i_data[EXT4_TIND_BLOCK];
++ if (nr) {
++ ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 3);
++ i_data[EXT4_TIND_BLOCK] = 0;
++ }
++ case EXT4_TIND_BLOCK:
++ ;
++ }
++ return 0;
+ }
+-
+diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
+index 8a064734e6eb..e9c9b5bd906a 100644
+--- a/fs/ext4/inode.c
++++ b/fs/ext4/inode.c
+@@ -1092,27 +1092,11 @@ static int ext4_write_end(struct file *file,
+ } else
+ copied = block_write_end(file, mapping, pos,
+ len, copied, page, fsdata);
+-
+ /*
+- * No need to use i_size_read() here, the i_size
+- * cannot change under us because we hole i_mutex.
+- *
+- * But it's important to update i_size while still holding page lock:
++ * it's important to update i_size while still holding page lock:
+ * page writeout could otherwise come in and zero beyond i_size.
+ */
+- if (pos + copied > inode->i_size) {
+- i_size_write(inode, pos + copied);
+- i_size_changed = 1;
+- }
+-
+- if (pos + copied > EXT4_I(inode)->i_disksize) {
+- /* We need to mark inode dirty even if
+- * new_i_size is less that inode->i_size
+- * but greater than i_disksize. (hint delalloc)
+- */
+- ext4_update_i_disksize(inode, (pos + copied));
+- i_size_changed = 1;
+- }
++ i_size_changed = ext4_update_inode_size(inode, pos + copied);
+ unlock_page(page);
+ page_cache_release(page);
+
+@@ -1160,7 +1144,7 @@ static int ext4_journalled_write_end(struct file *file,
+ int ret = 0, ret2;
+ int partial = 0;
+ unsigned from, to;
+- loff_t new_i_size;
++ int size_changed = 0;
+
+ trace_ext4_journalled_write_end(inode, pos, len, copied);
+ from = pos & (PAGE_CACHE_SIZE - 1);
+@@ -1183,20 +1167,18 @@ static int ext4_journalled_write_end(struct file *file,
+ if (!partial)
+ SetPageUptodate(page);
+ }
+- new_i_size = pos + copied;
+- if (new_i_size > inode->i_size)
+- i_size_write(inode, pos+copied);
++ size_changed = ext4_update_inode_size(inode, pos + copied);
+ ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+ EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
+- if (new_i_size > EXT4_I(inode)->i_disksize) {
+- ext4_update_i_disksize(inode, new_i_size);
++ unlock_page(page);
++ page_cache_release(page);
++
++ if (size_changed) {
+ ret2 = ext4_mark_inode_dirty(handle, inode);
+ if (!ret)
+ ret = ret2;
+ }
+
+- unlock_page(page);
+- page_cache_release(page);
+ if (pos + len > inode->i_size && ext4_can_truncate(inode))
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+@@ -2212,6 +2194,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
+ struct ext4_map_blocks *map = &mpd->map;
+ int err;
+ loff_t disksize;
++ int progress = 0;
+
+ mpd->io_submit.io_end->offset =
+ ((loff_t)map->m_lblk) << inode->i_blkbits;
+@@ -2228,8 +2211,11 @@ static int mpage_map_and_submit_extent(handle_t *handle,
+ * is non-zero, a commit should free up blocks.
+ */
+ if ((err == -ENOMEM) ||
+- (err == -ENOSPC && ext4_count_free_clusters(sb)))
++ (err == -ENOSPC && ext4_count_free_clusters(sb))) {
++ if (progress)
++ goto update_disksize;
+ return err;
++ }
+ ext4_msg(sb, KERN_CRIT,
+ "Delayed block allocation failed for "
+ "inode %lu at logical offset %llu with"
+@@ -2246,15 +2232,17 @@ static int mpage_map_and_submit_extent(handle_t *handle,
+ *give_up_on_write = true;
+ return err;
+ }
++ progress = 1;
+ /*
+ * Update buffer state, submit mapped pages, and get us new
+ * extent to map
+ */
+ err = mpage_map_and_submit_buffers(mpd);
+ if (err < 0)
+- return err;
++ goto update_disksize;
+ } while (map->m_len);
+
++update_disksize:
+ /*
+ * Update on-disk size after IO is submitted. Races with
+ * truncate are avoided by checking i_size under i_data_sem.
+@@ -3624,7 +3612,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
+ ret = ext4_ext_remove_space(inode, first_block,
+ stop_block - 1);
+ else
+- ret = ext4_free_hole_blocks(handle, inode, first_block,
++ ret = ext4_ind_remove_space(handle, inode, first_block,
+ stop_block);
+
+ up_write(&EXT4_I(inode)->i_data_sem);
+diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
+index 2dcb936be90e..c3e7418a6811 100644
+--- a/fs/ext4/mballoc.c
++++ b/fs/ext4/mballoc.c
+@@ -1412,6 +1412,8 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
+ int last = first + count - 1;
+ struct super_block *sb = e4b->bd_sb;
+
++ if (WARN_ON(count == 0))
++ return;
+ BUG_ON(last >= (sb->s_blocksize << 3));
+ assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
+ /* Don't bother if the block group is corrupt. */
+@@ -3216,8 +3218,30 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
+ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
+ {
+ struct ext4_prealloc_space *pa = ac->ac_pa;
++ struct ext4_buddy e4b;
++ int err;
+
+- if (pa && pa->pa_type == MB_INODE_PA)
++ if (pa == NULL) {
++ if (ac->ac_f_ex.fe_len == 0)
++ return;
++ err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b);
++ if (err) {
++ /*
++ * This should never happen since we pin the
++ * pages in the ext4_allocation_context so
++ * ext4_mb_load_buddy() should never fail.
++ */
++ WARN(1, "mb_load_buddy failed (%d)", err);
++ return;
++ }
++ ext4_lock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
++ mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start,
++ ac->ac_f_ex.fe_len);
++ ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
++ ext4_mb_unload_buddy(&e4b);
++ return;
++ }
++ if (pa->pa_type == MB_INODE_PA)
+ pa->pa_free += ac->ac_b_ex.fe_len;
+ }
+
+diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
+index 3520ab8a6639..9e6eced1605b 100644
+--- a/fs/ext4/namei.c
++++ b/fs/ext4/namei.c
+@@ -3128,7 +3128,8 @@ static int ext4_find_delete_entry(handle_t *handle, struct inode *dir,
+ return retval;
+ }
+
+-static void ext4_rename_delete(handle_t *handle, struct ext4_renament *ent)
++static void ext4_rename_delete(handle_t *handle, struct ext4_renament *ent,
++ int force_reread)
+ {
+ int retval;
+ /*
+@@ -3140,7 +3141,8 @@ static void ext4_rename_delete(handle_t *handle, struct ext4_renament *ent)
+ if (le32_to_cpu(ent->de->inode) != ent->inode->i_ino ||
+ ent->de->name_len != ent->dentry->d_name.len ||
+ strncmp(ent->de->name, ent->dentry->d_name.name,
+- ent->de->name_len)) {
++ ent->de->name_len) ||
++ force_reread) {
+ retval = ext4_find_delete_entry(handle, ent->dir,
+ &ent->dentry->d_name);
+ } else {
+@@ -3191,6 +3193,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ .dentry = new_dentry,
+ .inode = new_dentry->d_inode,
+ };
++ int force_reread;
+ int retval;
+
+ dquot_initialize(old.dir);
+@@ -3246,6 +3249,15 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ if (retval)
+ goto end_rename;
+ }
++ /*
++ * If we're renaming a file within an inline_data dir and adding or
++ * setting the new dirent causes a conversion from inline_data to
++ * extents/blockmap, we need to force the dirent delete code to
++ * re-read the directory, or else we end up trying to delete a dirent
++ * from what is now the extent tree root (or a block map).
++ */
++ force_reread = (new.dir->i_ino == old.dir->i_ino &&
++ ext4_test_inode_flag(new.dir, EXT4_INODE_INLINE_DATA));
+ if (!new.bh) {
+ retval = ext4_add_entry(handle, new.dentry, old.inode);
+ if (retval)
+@@ -3256,6 +3268,9 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ if (retval)
+ goto end_rename;
+ }
++ if (force_reread)
++ force_reread = !ext4_test_inode_flag(new.dir,
++ EXT4_INODE_INLINE_DATA);
+
+ /*
+ * Like most other Unix systems, set the ctime for inodes on a
+@@ -3267,7 +3282,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ /*
+ * ok, that's it
+ */
+- ext4_rename_delete(handle, &old);
++ ext4_rename_delete(handle, &old, force_reread);
+
+ if (new.inode) {
+ ext4_dec_count(handle, new.inode);
+diff --git a/fs/ext4/super.c b/fs/ext4/super.c
+index 6df7bc611dbd..beeb5c4e1f9d 100644
+--- a/fs/ext4/super.c
++++ b/fs/ext4/super.c
+@@ -3185,9 +3185,9 @@ static int set_journal_csum_feature_set(struct super_block *sb)
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
+- /* journal checksum v2 */
++ /* journal checksum v3 */
+ compat = 0;
+- incompat = JBD2_FEATURE_INCOMPAT_CSUM_V2;
++ incompat = JBD2_FEATURE_INCOMPAT_CSUM_V3;
+ } else {
+ /* journal checksum v1 */
+ compat = JBD2_FEATURE_COMPAT_CHECKSUM;
+@@ -3209,6 +3209,7 @@ static int set_journal_csum_feature_set(struct super_block *sb)
+ jbd2_journal_clear_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
++ JBD2_FEATURE_INCOMPAT_CSUM_V3 |
+ JBD2_FEATURE_INCOMPAT_CSUM_V2);
+ }
+
+diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
+index 4556ce1af5b0..5ddaf8625d3b 100644
+--- a/fs/isofs/inode.c
++++ b/fs/isofs/inode.c
+@@ -61,7 +61,7 @@ static void isofs_put_super(struct super_block *sb)
+ return;
+ }
+
+-static int isofs_read_inode(struct inode *);
++static int isofs_read_inode(struct inode *, int relocated);
+ static int isofs_statfs (struct dentry *, struct kstatfs *);
+
+ static struct kmem_cache *isofs_inode_cachep;
+@@ -1259,7 +1259,7 @@ out_toomany:
+ goto out;
+ }
+
+-static int isofs_read_inode(struct inode *inode)
++static int isofs_read_inode(struct inode *inode, int relocated)
+ {
+ struct super_block *sb = inode->i_sb;
+ struct isofs_sb_info *sbi = ISOFS_SB(sb);
+@@ -1404,7 +1404,7 @@ static int isofs_read_inode(struct inode *inode)
+ */
+
+ if (!high_sierra) {
+- parse_rock_ridge_inode(de, inode);
++ parse_rock_ridge_inode(de, inode, relocated);
+ /* if we want uid/gid set, override the rock ridge setting */
+ if (sbi->s_uid_set)
+ inode->i_uid = sbi->s_uid;
+@@ -1483,9 +1483,10 @@ static int isofs_iget5_set(struct inode *ino, void *data)
+ * offset that point to the underlying meta-data for the inode. The
+ * code below is otherwise similar to the iget() code in
+ * include/linux/fs.h */
+-struct inode *isofs_iget(struct super_block *sb,
+- unsigned long block,
+- unsigned long offset)
++struct inode *__isofs_iget(struct super_block *sb,
++ unsigned long block,
++ unsigned long offset,
++ int relocated)
+ {
+ unsigned long hashval;
+ struct inode *inode;
+@@ -1507,7 +1508,7 @@ struct inode *isofs_iget(struct super_block *sb,
+ return ERR_PTR(-ENOMEM);
+
+ if (inode->i_state & I_NEW) {
+- ret = isofs_read_inode(inode);
++ ret = isofs_read_inode(inode, relocated);
+ if (ret < 0) {
+ iget_failed(inode);
+ inode = ERR_PTR(ret);
+diff --git a/fs/isofs/isofs.h b/fs/isofs/isofs.h
+index 99167238518d..0ac4c1f73fbd 100644
+--- a/fs/isofs/isofs.h
++++ b/fs/isofs/isofs.h
+@@ -107,7 +107,7 @@ extern int iso_date(char *, int);
+
+ struct inode; /* To make gcc happy */
+
+-extern int parse_rock_ridge_inode(struct iso_directory_record *, struct inode *);
++extern int parse_rock_ridge_inode(struct iso_directory_record *, struct inode *, int relocated);
+ extern int get_rock_ridge_filename(struct iso_directory_record *, char *, struct inode *);
+ extern int isofs_name_translate(struct iso_directory_record *, char *, struct inode *);
+
+@@ -118,9 +118,24 @@ extern struct dentry *isofs_lookup(struct inode *, struct dentry *, unsigned int
+ extern struct buffer_head *isofs_bread(struct inode *, sector_t);
+ extern int isofs_get_blocks(struct inode *, sector_t, struct buffer_head **, unsigned long);
+
+-extern struct inode *isofs_iget(struct super_block *sb,
+- unsigned long block,
+- unsigned long offset);
++struct inode *__isofs_iget(struct super_block *sb,
++ unsigned long block,
++ unsigned long offset,
++ int relocated);
++
++static inline struct inode *isofs_iget(struct super_block *sb,
++ unsigned long block,
++ unsigned long offset)
++{
++ return __isofs_iget(sb, block, offset, 0);
++}
++
++static inline struct inode *isofs_iget_reloc(struct super_block *sb,
++ unsigned long block,
++ unsigned long offset)
++{
++ return __isofs_iget(sb, block, offset, 1);
++}
+
+ /* Because the inode number is no longer relevant to finding the
+ * underlying meta-data for an inode, we are free to choose a more
+diff --git a/fs/isofs/rock.c b/fs/isofs/rock.c
+index c0bf42472e40..f488bbae541a 100644
+--- a/fs/isofs/rock.c
++++ b/fs/isofs/rock.c
+@@ -288,12 +288,16 @@ eio:
+ goto out;
+ }
+
++#define RR_REGARD_XA 1
++#define RR_RELOC_DE 2
++
+ static int
+ parse_rock_ridge_inode_internal(struct iso_directory_record *de,
+- struct inode *inode, int regard_xa)
++ struct inode *inode, int flags)
+ {
+ int symlink_len = 0;
+ int cnt, sig;
++ unsigned int reloc_block;
+ struct inode *reloc;
+ struct rock_ridge *rr;
+ int rootflag;
+@@ -305,7 +309,7 @@ parse_rock_ridge_inode_internal(struct iso_directory_record *de,
+
+ init_rock_state(&rs, inode);
+ setup_rock_ridge(de, inode, &rs);
+- if (regard_xa) {
++ if (flags & RR_REGARD_XA) {
+ rs.chr += 14;
+ rs.len -= 14;
+ if (rs.len < 0)
+@@ -485,12 +489,22 @@ repeat:
+ "relocated directory\n");
+ goto out;
+ case SIG('C', 'L'):
+- ISOFS_I(inode)->i_first_extent =
+- isonum_733(rr->u.CL.location);
+- reloc =
+- isofs_iget(inode->i_sb,
+- ISOFS_I(inode)->i_first_extent,
+- 0);
++ if (flags & RR_RELOC_DE) {
++ printk(KERN_ERR
++ "ISOFS: Recursive directory relocation "
++ "is not supported\n");
++ goto eio;
++ }
++ reloc_block = isonum_733(rr->u.CL.location);
++ if (reloc_block == ISOFS_I(inode)->i_iget5_block &&
++ ISOFS_I(inode)->i_iget5_offset == 0) {
++ printk(KERN_ERR
++ "ISOFS: Directory relocation points to "
++ "itself\n");
++ goto eio;
++ }
++ ISOFS_I(inode)->i_first_extent = reloc_block;
++ reloc = isofs_iget_reloc(inode->i_sb, reloc_block, 0);
+ if (IS_ERR(reloc)) {
+ ret = PTR_ERR(reloc);
+ goto out;
+@@ -637,9 +651,11 @@ static char *get_symlink_chunk(char *rpnt, struct rock_ridge *rr, char *plimit)
+ return rpnt;
+ }
+
+-int parse_rock_ridge_inode(struct iso_directory_record *de, struct inode *inode)
++int parse_rock_ridge_inode(struct iso_directory_record *de, struct inode *inode,
++ int relocated)
+ {
+- int result = parse_rock_ridge_inode_internal(de, inode, 0);
++ int flags = relocated ? RR_RELOC_DE : 0;
++ int result = parse_rock_ridge_inode_internal(de, inode, flags);
+
+ /*
+ * if rockridge flag was reset and we didn't look for attributes
+@@ -647,7 +663,8 @@ int parse_rock_ridge_inode(struct iso_directory_record *de, struct inode *inode)
+ */
+ if ((ISOFS_SB(inode->i_sb)->s_rock_offset == -1)
+ && (ISOFS_SB(inode->i_sb)->s_rock == 2)) {
+- result = parse_rock_ridge_inode_internal(de, inode, 14);
++ result = parse_rock_ridge_inode_internal(de, inode,
++ flags | RR_REGARD_XA);
+ }
+ return result;
+ }
+diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
+index 6fac74349856..b73e0215baa7 100644
+--- a/fs/jbd2/commit.c
++++ b/fs/jbd2/commit.c
+@@ -97,7 +97,7 @@ static void jbd2_commit_block_csum_set(journal_t *j, struct buffer_head *bh)
+ struct commit_header *h;
+ __u32 csum;
+
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return;
+
+ h = (struct commit_header *)(bh->b_data);
+@@ -313,11 +313,11 @@ static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
+ return checksum;
+ }
+
+-static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
++static void write_tag_block(journal_t *j, journal_block_tag_t *tag,
+ unsigned long long block)
+ {
+ tag->t_blocknr = cpu_to_be32(block & (u32)~0);
+- if (tag_bytes > JBD2_TAG_SIZE32)
++ if (JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_64BIT))
+ tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1);
+ }
+
+@@ -327,7 +327,7 @@ static void jbd2_descr_block_csum_set(journal_t *j,
+ struct jbd2_journal_block_tail *tail;
+ __u32 csum;
+
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return;
+
+ tail = (struct jbd2_journal_block_tail *)(bh->b_data + j->j_blocksize -
+@@ -340,12 +340,13 @@ static void jbd2_descr_block_csum_set(journal_t *j,
+ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
+ struct buffer_head *bh, __u32 sequence)
+ {
++ journal_block_tag3_t *tag3 = (journal_block_tag3_t *)tag;
+ struct page *page = bh->b_page;
+ __u8 *addr;
+ __u32 csum32;
+ __be32 seq;
+
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return;
+
+ seq = cpu_to_be32(sequence);
+@@ -355,8 +356,10 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
+ bh->b_size);
+ kunmap_atomic(addr);
+
+- /* We only have space to store the lower 16 bits of the crc32c. */
+- tag->t_checksum = cpu_to_be16(csum32);
++ if (JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V3))
++ tag3->t_checksum = cpu_to_be32(csum32);
++ else
++ tag->t_checksum = cpu_to_be16(csum32);
+ }
+ /*
+ * jbd2_journal_commit_transaction
+@@ -396,7 +399,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
+ LIST_HEAD(io_bufs);
+ LIST_HEAD(log_bufs);
+
+- if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (jbd2_journal_has_csum_v2or3(journal))
+ csum_size = sizeof(struct jbd2_journal_block_tail);
+
+ /*
+@@ -690,7 +693,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
+ tag_flag |= JBD2_FLAG_SAME_UUID;
+
+ tag = (journal_block_tag_t *) tagp;
+- write_tag_block(tag_bytes, tag, jh2bh(jh)->b_blocknr);
++ write_tag_block(journal, tag, jh2bh(jh)->b_blocknr);
+ tag->t_flags = cpu_to_be16(tag_flag);
+ jbd2_block_tag_csum_set(journal, tag, wbuf[bufs],
+ commit_transaction->t_tid);
+diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
+index 67b8e303946c..19d74d86d99c 100644
+--- a/fs/jbd2/journal.c
++++ b/fs/jbd2/journal.c
+@@ -124,7 +124,7 @@ EXPORT_SYMBOL(__jbd2_debug);
+ /* Checksumming functions */
+ static int jbd2_verify_csum_type(journal_t *j, journal_superblock_t *sb)
+ {
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return 1;
+
+ return sb->s_checksum_type == JBD2_CRC32C_CHKSUM;
+@@ -145,7 +145,7 @@ static __be32 jbd2_superblock_csum(journal_t *j, journal_superblock_t *sb)
+
+ static int jbd2_superblock_csum_verify(journal_t *j, journal_superblock_t *sb)
+ {
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return 1;
+
+ return sb->s_checksum == jbd2_superblock_csum(j, sb);
+@@ -153,7 +153,7 @@ static int jbd2_superblock_csum_verify(journal_t *j, journal_superblock_t *sb)
+
+ static void jbd2_superblock_csum_set(journal_t *j, journal_superblock_t *sb)
+ {
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return;
+
+ sb->s_checksum = jbd2_superblock_csum(j, sb);
+@@ -1522,21 +1522,29 @@ static int journal_get_superblock(journal_t *journal)
+ goto out;
+ }
+
+- if (JBD2_HAS_COMPAT_FEATURE(journal, JBD2_FEATURE_COMPAT_CHECKSUM) &&
+- JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
++ if (jbd2_journal_has_csum_v2or3(journal) &&
++ JBD2_HAS_COMPAT_FEATURE(journal, JBD2_FEATURE_COMPAT_CHECKSUM)) {
+ /* Can't have checksum v1 and v2 on at the same time! */
+ printk(KERN_ERR "JBD2: Can't enable checksumming v1 and v2 "
+ "at the same time!\n");
+ goto out;
+ }
+
++ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2) &&
++ JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V3)) {
++ /* Can't have checksum v2 and v3 at the same time! */
++ printk(KERN_ERR "JBD2: Can't enable checksumming v2 and v3 "
++ "at the same time!\n");
++ goto out;
++ }
++
+ if (!jbd2_verify_csum_type(journal, sb)) {
+ printk(KERN_ERR "JBD2: Unknown checksum type\n");
+ goto out;
+ }
+
+ /* Load the checksum driver */
+- if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
++ if (jbd2_journal_has_csum_v2or3(journal)) {
+ journal->j_chksum_driver = crypto_alloc_shash("crc32c", 0, 0);
+ if (IS_ERR(journal->j_chksum_driver)) {
+ printk(KERN_ERR "JBD2: Cannot load crc32c driver.\n");
+@@ -1553,7 +1561,7 @@ static int journal_get_superblock(journal_t *journal)
+ }
+
+ /* Precompute checksum seed for all metadata */
+- if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (jbd2_journal_has_csum_v2or3(journal))
+ journal->j_csum_seed = jbd2_chksum(journal, ~0, sb->s_uuid,
+ sizeof(sb->s_uuid));
+
+@@ -1813,8 +1821,14 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
+ if (!jbd2_journal_check_available_features(journal, compat, ro, incompat))
+ return 0;
+
+- /* Asking for checksumming v2 and v1? Only give them v2. */
+- if (incompat & JBD2_FEATURE_INCOMPAT_CSUM_V2 &&
++ /* If enabling v2 checksums, turn on v3 instead */
++ if (incompat & JBD2_FEATURE_INCOMPAT_CSUM_V2) {
++ incompat &= ~JBD2_FEATURE_INCOMPAT_CSUM_V2;
++ incompat |= JBD2_FEATURE_INCOMPAT_CSUM_V3;
++ }
++
++ /* Asking for checksumming v3 and v1? Only give them v3. */
++ if (incompat & JBD2_FEATURE_INCOMPAT_CSUM_V3 &&
+ compat & JBD2_FEATURE_COMPAT_CHECKSUM)
+ compat &= ~JBD2_FEATURE_COMPAT_CHECKSUM;
+
+@@ -1823,8 +1837,8 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
+
+ sb = journal->j_superblock;
+
+- /* If enabling v2 checksums, update superblock */
+- if (INCOMPAT_FEATURE_ON(JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
++ /* If enabling v3 checksums, update superblock */
++ if (INCOMPAT_FEATURE_ON(JBD2_FEATURE_INCOMPAT_CSUM_V3)) {
+ sb->s_checksum_type = JBD2_CRC32C_CHKSUM;
+ sb->s_feature_compat &=
+ ~cpu_to_be32(JBD2_FEATURE_COMPAT_CHECKSUM);
+@@ -1842,8 +1856,7 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
+ }
+
+ /* Precompute checksum seed for all metadata */
+- if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+- JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (jbd2_journal_has_csum_v2or3(journal))
+ journal->j_csum_seed = jbd2_chksum(journal, ~0,
+ sb->s_uuid,
+ sizeof(sb->s_uuid));
+@@ -1852,7 +1865,8 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
+ /* If enabling v1 checksums, downgrade superblock */
+ if (COMPAT_FEATURE_ON(JBD2_FEATURE_COMPAT_CHECKSUM))
+ sb->s_feature_incompat &=
+- ~cpu_to_be32(JBD2_FEATURE_INCOMPAT_CSUM_V2);
++ ~cpu_to_be32(JBD2_FEATURE_INCOMPAT_CSUM_V2 |
++ JBD2_FEATURE_INCOMPAT_CSUM_V3);
+
+ sb->s_feature_compat |= cpu_to_be32(compat);
+ sb->s_feature_ro_compat |= cpu_to_be32(ro);
+@@ -2165,16 +2179,20 @@ int jbd2_journal_blocks_per_page(struct inode *inode)
+ */
+ size_t journal_tag_bytes(journal_t *journal)
+ {
+- journal_block_tag_t tag;
+- size_t x = 0;
++ size_t sz;
++
++ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V3))
++ return sizeof(journal_block_tag3_t);
++
++ sz = sizeof(journal_block_tag_t);
+
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+- x += sizeof(tag.t_checksum);
++ sz += sizeof(__u16);
+
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
+- return x + JBD2_TAG_SIZE64;
++ return sz;
+ else
+- return x + JBD2_TAG_SIZE32;
++ return sz - sizeof(__u32);
+ }
+
+ /*
+diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
+index 3b6bb19d60b1..9b329b55ffe3 100644
+--- a/fs/jbd2/recovery.c
++++ b/fs/jbd2/recovery.c
+@@ -181,7 +181,7 @@ static int jbd2_descr_block_csum_verify(journal_t *j,
+ __be32 provided;
+ __u32 calculated;
+
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return 1;
+
+ tail = (struct jbd2_journal_block_tail *)(buf + j->j_blocksize -
+@@ -205,7 +205,7 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
+ int nr = 0, size = journal->j_blocksize;
+ int tag_bytes = journal_tag_bytes(journal);
+
+- if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (jbd2_journal_has_csum_v2or3(journal))
+ size -= sizeof(struct jbd2_journal_block_tail);
+
+ tagp = &bh->b_data[sizeof(journal_header_t)];
+@@ -338,10 +338,11 @@ int jbd2_journal_skip_recovery(journal_t *journal)
+ return err;
+ }
+
+-static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag)
++static inline unsigned long long read_tag_block(journal_t *journal,
++ journal_block_tag_t *tag)
+ {
+ unsigned long long block = be32_to_cpu(tag->t_blocknr);
+- if (tag_bytes > JBD2_TAG_SIZE32)
++ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
+ block |= (u64)be32_to_cpu(tag->t_blocknr_high) << 32;
+ return block;
+ }
+@@ -384,7 +385,7 @@ static int jbd2_commit_block_csum_verify(journal_t *j, void *buf)
+ __be32 provided;
+ __u32 calculated;
+
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return 1;
+
+ h = buf;
+@@ -399,17 +400,21 @@ static int jbd2_commit_block_csum_verify(journal_t *j, void *buf)
+ static int jbd2_block_tag_csum_verify(journal_t *j, journal_block_tag_t *tag,
+ void *buf, __u32 sequence)
+ {
++ journal_block_tag3_t *tag3 = (journal_block_tag3_t *)tag;
+ __u32 csum32;
+ __be32 seq;
+
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return 1;
+
+ seq = cpu_to_be32(sequence);
+ csum32 = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&seq, sizeof(seq));
+ csum32 = jbd2_chksum(j, csum32, buf, j->j_blocksize);
+
+- return tag->t_checksum == cpu_to_be16(csum32);
++ if (JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V3))
++ return tag3->t_checksum == cpu_to_be32(csum32);
++ else
++ return tag->t_checksum == cpu_to_be16(csum32);
+ }
+
+ static int do_one_pass(journal_t *journal,
+@@ -426,6 +431,7 @@ static int do_one_pass(journal_t *journal,
+ int tag_bytes = journal_tag_bytes(journal);
+ __u32 crc32_sum = ~0; /* Transactional Checksums */
+ int descr_csum_size = 0;
++ int block_error = 0;
+
+ /*
+ * First thing is to establish what we expect to find in the log
+@@ -512,8 +518,7 @@ static int do_one_pass(journal_t *journal,
+ switch(blocktype) {
+ case JBD2_DESCRIPTOR_BLOCK:
+ /* Verify checksum first */
+- if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+- JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (jbd2_journal_has_csum_v2or3(journal))
+ descr_csum_size =
+ sizeof(struct jbd2_journal_block_tail);
+ if (descr_csum_size > 0 &&
+@@ -574,7 +579,7 @@ static int do_one_pass(journal_t *journal,
+ unsigned long long blocknr;
+
+ J_ASSERT(obh != NULL);
+- blocknr = read_tag_block(tag_bytes,
++ blocknr = read_tag_block(journal,
+ tag);
+
+ /* If the block has been
+@@ -598,7 +603,8 @@ static int do_one_pass(journal_t *journal,
+ "checksum recovering "
+ "block %llu in log\n",
+ blocknr);
+- continue;
++ block_error = 1;
++ goto skip_write;
+ }
+
+ /* Find a buffer for the new
+@@ -797,7 +803,8 @@ static int do_one_pass(journal_t *journal,
+ success = -EIO;
+ }
+ }
+-
++ if (block_error && success == 0)
++ success = -EIO;
+ return success;
+
+ failed:
+@@ -811,7 +818,7 @@ static int jbd2_revoke_block_csum_verify(journal_t *j,
+ __be32 provided;
+ __u32 calculated;
+
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return 1;
+
+ tail = (struct jbd2_journal_revoke_tail *)(buf + j->j_blocksize -
+diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
+index 198c9c10276d..d5e95a175c92 100644
+--- a/fs/jbd2/revoke.c
++++ b/fs/jbd2/revoke.c
+@@ -91,8 +91,8 @@
+ #include <linux/list.h>
+ #include <linux/init.h>
+ #include <linux/bio.h>
+-#endif
+ #include <linux/log2.h>
++#endif
+
+ static struct kmem_cache *jbd2_revoke_record_cache;
+ static struct kmem_cache *jbd2_revoke_table_cache;
+@@ -597,7 +597,7 @@ static void write_one_revoke_record(journal_t *journal,
+ offset = *offsetp;
+
+ /* Do we need to leave space at the end for a checksum? */
+- if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (jbd2_journal_has_csum_v2or3(journal))
+ csum_size = sizeof(struct jbd2_journal_revoke_tail);
+
+ /* Make sure we have a descriptor with space left for the record */
+@@ -644,7 +644,7 @@ static void jbd2_revoke_csum_set(journal_t *j, struct buffer_head *bh)
+ struct jbd2_journal_revoke_tail *tail;
+ __u32 csum;
+
+- if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++ if (!jbd2_journal_has_csum_v2or3(j))
+ return;
+
+ tail = (struct jbd2_journal_revoke_tail *)(bh->b_data + j->j_blocksize -
+diff --git a/fs/nfs/nfs3acl.c b/fs/nfs/nfs3acl.c
+index 8f854dde4150..24c6898159cc 100644
+--- a/fs/nfs/nfs3acl.c
++++ b/fs/nfs/nfs3acl.c
+@@ -129,7 +129,10 @@ static int __nfs3_proc_setacls(struct inode *inode, struct posix_acl *acl,
+ .rpc_argp = &args,
+ .rpc_resp = &fattr,
+ };
+- int status;
++ int status = 0;
++
++ if (acl == NULL && (!S_ISDIR(inode->i_mode) || dfacl == NULL))
++ goto out;
+
+ status = -EOPNOTSUPP;
+ if (!nfs_server_capable(inode, NFS_CAP_ACLS))
+@@ -256,7 +259,7 @@ nfs3_list_one_acl(struct inode *inode, int type, const char *name, void *data,
+ char *p = data + *result;
+
+ acl = get_acl(inode, type);
+- if (!acl)
++ if (IS_ERR_OR_NULL(acl))
+ return 0;
+
+ posix_acl_release(acl);
+diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
+index 4bf3d97cc5a0..dac979866f83 100644
+--- a/fs/nfs/nfs4proc.c
++++ b/fs/nfs/nfs4proc.c
+@@ -2545,6 +2545,7 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
+ struct nfs4_closedata *calldata = data;
+ struct nfs4_state *state = calldata->state;
+ struct nfs_server *server = NFS_SERVER(calldata->inode);
++ nfs4_stateid *res_stateid = NULL;
+
+ dprintk("%s: begin!\n", __func__);
+ if (!nfs4_sequence_done(task, &calldata->res.seq_res))
+@@ -2555,12 +2556,12 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
+ */
+ switch (task->tk_status) {
+ case 0:
+- if (calldata->roc)
++ res_stateid = &calldata->res.stateid;
++ if (calldata->arg.fmode == 0 && calldata->roc)
+ pnfs_roc_set_barrier(state->inode,
+ calldata->roc_barrier);
+- nfs_clear_open_stateid(state, &calldata->res.stateid, 0);
+ renew_lease(server, calldata->timestamp);
+- goto out_release;
++ break;
+ case -NFS4ERR_ADMIN_REVOKED:
+ case -NFS4ERR_STALE_STATEID:
+ case -NFS4ERR_OLD_STATEID:
+@@ -2574,7 +2575,7 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
+ goto out_release;
+ }
+ }
+- nfs_clear_open_stateid(state, NULL, calldata->arg.fmode);
++ nfs_clear_open_stateid(state, res_stateid, calldata->arg.fmode);
+ out_release:
+ nfs_release_seqid(calldata->arg.seqid);
+ nfs_refresh_inode(calldata->inode, calldata->res.fattr);
+@@ -2586,6 +2587,7 @@ static void nfs4_close_prepare(struct rpc_task *task, void *data)
+ struct nfs4_closedata *calldata = data;
+ struct nfs4_state *state = calldata->state;
+ struct inode *inode = calldata->inode;
++ bool is_rdonly, is_wronly, is_rdwr;
+ int call_close = 0;
+
+ dprintk("%s: begin!\n", __func__);
+@@ -2593,18 +2595,24 @@ static void nfs4_close_prepare(struct rpc_task *task, void *data)
+ goto out_wait;
+
+ task->tk_msg.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_OPEN_DOWNGRADE];
+- calldata->arg.fmode = FMODE_READ|FMODE_WRITE;
+ spin_lock(&state->owner->so_lock);
++ is_rdwr = test_bit(NFS_O_RDWR_STATE, &state->flags);
++ is_rdonly = test_bit(NFS_O_RDONLY_STATE, &state->flags);
++ is_wronly = test_bit(NFS_O_WRONLY_STATE, &state->flags);
++ /* Calculate the current open share mode */
++ calldata->arg.fmode = 0;
++ if (is_rdonly || is_rdwr)
++ calldata->arg.fmode |= FMODE_READ;
++ if (is_wronly || is_rdwr)
++ calldata->arg.fmode |= FMODE_WRITE;
+ /* Calculate the change in open mode */
+ if (state->n_rdwr == 0) {
+ if (state->n_rdonly == 0) {
+- call_close |= test_bit(NFS_O_RDONLY_STATE, &state->flags);
+- call_close |= test_bit(NFS_O_RDWR_STATE, &state->flags);
++ call_close |= is_rdonly || is_rdwr;
+ calldata->arg.fmode &= ~FMODE_READ;
+ }
+ if (state->n_wronly == 0) {
+- call_close |= test_bit(NFS_O_WRONLY_STATE, &state->flags);
+- call_close |= test_bit(NFS_O_RDWR_STATE, &state->flags);
++ call_close |= is_wronly || is_rdwr;
+ calldata->arg.fmode &= ~FMODE_WRITE;
+ }
+ }
+diff --git a/fs/nfs/super.c b/fs/nfs/super.c
+index 084af1060d79..3fd83327bbad 100644
+--- a/fs/nfs/super.c
++++ b/fs/nfs/super.c
+@@ -2180,7 +2180,7 @@ out_no_address:
+ return -EINVAL;
+ }
+
+-#define NFS_MOUNT_CMP_FLAGMASK ~(NFS_MOUNT_INTR \
++#define NFS_REMOUNT_CMP_FLAGMASK ~(NFS_MOUNT_INTR \
+ | NFS_MOUNT_SECURE \
+ | NFS_MOUNT_TCP \
+ | NFS_MOUNT_VER3 \
+@@ -2188,15 +2188,16 @@ out_no_address:
+ | NFS_MOUNT_NONLM \
+ | NFS_MOUNT_BROKEN_SUID \
+ | NFS_MOUNT_STRICTLOCK \
+- | NFS_MOUNT_UNSHARED \
+- | NFS_MOUNT_NORESVPORT \
+ | NFS_MOUNT_LEGACY_INTERFACE)
+
++#define NFS_MOUNT_CMP_FLAGMASK (NFS_REMOUNT_CMP_FLAGMASK & \
++ ~(NFS_MOUNT_UNSHARED | NFS_MOUNT_NORESVPORT))
++
+ static int
+ nfs_compare_remount_data(struct nfs_server *nfss,
+ struct nfs_parsed_mount_data *data)
+ {
+- if ((data->flags ^ nfss->flags) & NFS_MOUNT_CMP_FLAGMASK ||
++ if ((data->flags ^ nfss->flags) & NFS_REMOUNT_CMP_FLAGMASK ||
+ data->rsize != nfss->rsize ||
+ data->wsize != nfss->wsize ||
+ data->version != nfss->nfs_client->rpc_ops->version ||
+diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
+index 2c73cae9899d..0f23ad005826 100644
+--- a/fs/nfsd/nfs4callback.c
++++ b/fs/nfsd/nfs4callback.c
+@@ -689,7 +689,8 @@ static int setup_callback_client(struct nfs4_client *clp, struct nfs4_cb_conn *c
+ clp->cl_cb_session = ses;
+ args.bc_xprt = conn->cb_xprt;
+ args.prognumber = clp->cl_cb_session->se_cb_prog;
+- args.protocol = XPRT_TRANSPORT_BC_TCP;
++ args.protocol = conn->cb_xprt->xpt_class->xcl_ident |
++ XPRT_TRANSPORT_BC;
+ args.authflavor = ses->se_cb_sec.flavor;
+ }
+ /* Create RPC client */
+diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
+index 1879e43f2868..2f2edbb2a4a3 100644
+--- a/fs/nfsd/nfssvc.c
++++ b/fs/nfsd/nfssvc.c
+@@ -221,7 +221,8 @@ static int nfsd_startup_generic(int nrservs)
+ */
+ ret = nfsd_racache_init(2*nrservs);
+ if (ret)
+- return ret;
++ goto dec_users;
++
+ ret = nfs4_state_start();
+ if (ret)
+ goto out_racache;
+@@ -229,6 +230,8 @@ static int nfsd_startup_generic(int nrservs)
+
+ out_racache:
+ nfsd_racache_shutdown();
++dec_users:
++ nfsd_users--;
+ return ret;
+ }
+
+diff --git a/include/drm/drm_pciids.h b/include/drm/drm_pciids.h
+index 6dfd64b3a604..e973540cd15b 100644
+--- a/include/drm/drm_pciids.h
++++ b/include/drm/drm_pciids.h
+@@ -17,6 +17,7 @@
+ {0x1002, 0x1315, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ {0x1002, 0x1316, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ {0x1002, 0x1317, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
++ {0x1002, 0x1318, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ {0x1002, 0x131B, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ {0x1002, 0x131C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ {0x1002, 0x131D, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+@@ -164,8 +165,11 @@
+ {0x1002, 0x6601, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6602, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6603, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++ {0x1002, 0x6604, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++ {0x1002, 0x6605, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6606, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6607, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++ {0x1002, 0x6608, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6610, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6611, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6613, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+@@ -175,6 +179,8 @@
+ {0x1002, 0x6631, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6640, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6641, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++ {0x1002, 0x6646, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++ {0x1002, 0x6647, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6649, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6650, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6651, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_NEW_MEMMAP}, \
+@@ -297,6 +303,7 @@
+ {0x1002, 0x6829, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x682A, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x682B, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++ {0x1002, 0x682C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x682D, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x682F, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ {0x1002, 0x6830, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
+index d5b50a19463c..0dae71e9971c 100644
+--- a/include/linux/jbd2.h
++++ b/include/linux/jbd2.h
+@@ -159,7 +159,11 @@ typedef struct journal_header_s
+ * journal_block_tag (in the descriptor). The other h_chksum* fields are
+ * not used.
+ *
+- * Checksum v1 and v2 are mutually exclusive features.
++ * If FEATURE_INCOMPAT_CSUM_V3 is set, the descriptor block uses
++ * journal_block_tag3_t to store a full 32-bit checksum. Everything else
++ * is the same as v2.
++ *
++ * Checksum v1, v2, and v3 are mutually exclusive features.
+ */
+ struct commit_header {
+ __be32 h_magic;
+@@ -179,6 +183,14 @@ struct commit_header {
+ * raw struct shouldn't be used for pointer math or sizeof() - use
+ * journal_tag_bytes(journal) instead to compute this.
+ */
++typedef struct journal_block_tag3_s
++{
++ __be32 t_blocknr; /* The on-disk block number */
++ __be32 t_flags; /* See below */
++ __be32 t_blocknr_high; /* most-significant high 32bits. */
++ __be32 t_checksum; /* crc32c(uuid+seq+block) */
++} journal_block_tag3_t;
++
+ typedef struct journal_block_tag_s
+ {
+ __be32 t_blocknr; /* The on-disk block number */
+@@ -187,9 +199,6 @@ typedef struct journal_block_tag_s
+ __be32 t_blocknr_high; /* most-significant high 32bits. */
+ } journal_block_tag_t;
+
+-#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
+-#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))
+-
+ /* Tail of descriptor block, for checksumming */
+ struct jbd2_journal_block_tail {
+ __be32 t_checksum; /* crc32c(uuid+descr_block) */
+@@ -284,6 +293,7 @@ typedef struct journal_superblock_s
+ #define JBD2_FEATURE_INCOMPAT_64BIT 0x00000002
+ #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004
+ #define JBD2_FEATURE_INCOMPAT_CSUM_V2 0x00000008
++#define JBD2_FEATURE_INCOMPAT_CSUM_V3 0x00000010
+
+ /* Features known to this kernel version: */
+ #define JBD2_KNOWN_COMPAT_FEATURES JBD2_FEATURE_COMPAT_CHECKSUM
+@@ -291,7 +301,8 @@ typedef struct journal_superblock_s
+ #define JBD2_KNOWN_INCOMPAT_FEATURES (JBD2_FEATURE_INCOMPAT_REVOKE | \
+ JBD2_FEATURE_INCOMPAT_64BIT | \
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \
+- JBD2_FEATURE_INCOMPAT_CSUM_V2)
++ JBD2_FEATURE_INCOMPAT_CSUM_V2 | \
++ JBD2_FEATURE_INCOMPAT_CSUM_V3)
+
+ #ifdef __KERNEL__
+
+@@ -1296,6 +1307,15 @@ static inline int tid_geq(tid_t x, tid_t y)
+ extern int jbd2_journal_blocks_per_page(struct inode *inode);
+ extern size_t journal_tag_bytes(journal_t *journal);
+
++static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
++{
++ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2) ||
++ JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V3))
++ return 1;
++
++ return 0;
++}
++
+ /*
+ * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
+ * transaction control blocks.
+diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
+index 7235040a19b2..5d9d6f84b382 100644
+--- a/include/linux/sunrpc/svc_xprt.h
++++ b/include/linux/sunrpc/svc_xprt.h
+@@ -33,6 +33,7 @@ struct svc_xprt_class {
+ struct svc_xprt_ops *xcl_ops;
+ struct list_head xcl_list;
+ u32 xcl_max_payload;
++ int xcl_ident;
+ };
+
+ /*
+diff --git a/kernel/sched/core.c b/kernel/sched/core.c
+index bc1638b33449..0acf96b790c5 100644
+--- a/kernel/sched/core.c
++++ b/kernel/sched/core.c
+@@ -3558,9 +3558,10 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
+ };
+
+ /*
+- * Fixup the legacy SCHED_RESET_ON_FORK hack
++ * Fixup the legacy SCHED_RESET_ON_FORK hack, except if
++ * the policy=-1 was passed by sched_setparam().
+ */
+- if (policy & SCHED_RESET_ON_FORK) {
++ if ((policy != -1) && (policy & SCHED_RESET_ON_FORK)) {
+ attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
+ policy &= ~SCHED_RESET_ON_FORK;
+ attr.sched_policy = policy;
+diff --git a/mm/memory.c b/mm/memory.c
+index 8b44f765b645..0a21f3d162ae 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -751,7 +751,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn = pte_pfn(pte);
+
+ if (HAVE_PTE_SPECIAL) {
+- if (likely(!pte_special(pte) || pte_numa(pte)))
++ if (likely(!pte_special(pte)))
+ goto check_pfn;
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ return NULL;
+@@ -777,15 +777,14 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ }
+ }
+
++ if (is_zero_pfn(pfn))
++ return NULL;
+ check_pfn:
+ if (unlikely(pfn > highest_memmap_pfn)) {
+ print_bad_pte(vma, addr, pte, NULL);
+ return NULL;
+ }
+
+- if (is_zero_pfn(pfn))
+- return NULL;
+-
+ /*
+ * NOTE! We still have PageReserved() pages in the page tables.
+ * eg. VDSO mappings can cause them to exist.
+diff --git a/mm/util.c b/mm/util.c
+index d5ea733c5082..33e9f4455800 100644
+--- a/mm/util.c
++++ b/mm/util.c
+@@ -277,17 +277,14 @@ pid_t vm_is_stack(struct task_struct *task,
+
+ if (in_group) {
+ struct task_struct *t;
+- rcu_read_lock();
+- if (!pid_alive(task))
+- goto done;
+
+- t = task;
+- do {
++ rcu_read_lock();
++ for_each_thread(task, t) {
+ if (vm_is_stack_for_task(t, vma)) {
+ ret = t->pid;
+ goto done;
+ }
+- } while_each_thread(task, t);
++ }
+ done:
+ rcu_read_unlock();
+ }
+diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
+index b507cd327d9b..b2437ee93657 100644
+--- a/net/sunrpc/svcsock.c
++++ b/net/sunrpc/svcsock.c
+@@ -692,6 +692,7 @@ static struct svc_xprt_class svc_udp_class = {
+ .xcl_owner = THIS_MODULE,
+ .xcl_ops = &svc_udp_ops,
+ .xcl_max_payload = RPCSVC_MAXPAYLOAD_UDP,
++ .xcl_ident = XPRT_TRANSPORT_UDP,
+ };
+
+ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv)
+@@ -1292,6 +1293,7 @@ static struct svc_xprt_class svc_tcp_class = {
+ .xcl_owner = THIS_MODULE,
+ .xcl_ops = &svc_tcp_ops,
+ .xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
++ .xcl_ident = XPRT_TRANSPORT_TCP,
+ };
+
+ void svc_init_xprt_sock(void)
+diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
+index c3b2b3369e52..51c63165073c 100644
+--- a/net/sunrpc/xprt.c
++++ b/net/sunrpc/xprt.c
+@@ -1306,7 +1306,7 @@ struct rpc_xprt *xprt_create_transport(struct xprt_create *args)
+ }
+ }
+ spin_unlock(&xprt_list_lock);
+- printk(KERN_ERR "RPC: transport (%d) not supported\n", args->ident);
++ dprintk("RPC: transport (%d) not supported\n", args->ident);
+ return ERR_PTR(-EIO);
+
+ found:
+diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
+index e7323fbbd348..06a5d9235107 100644
+--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
++++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
+@@ -92,6 +92,7 @@ struct svc_xprt_class svc_rdma_class = {
+ .xcl_owner = THIS_MODULE,
+ .xcl_ops = &svc_rdma_ops,
+ .xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
++ .xcl_ident = XPRT_TRANSPORT_RDMA,
+ };
+
+ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+diff --git a/sound/pci/Kconfig b/sound/pci/Kconfig
+index 3a3a3a71088b..50dd0086cfb1 100644
+--- a/sound/pci/Kconfig
++++ b/sound/pci/Kconfig
+@@ -858,8 +858,8 @@ config SND_VIRTUOSO
+ select SND_JACK if INPUT=y || INPUT=SND
+ help
+ Say Y here to include support for sound cards based on the
+- Asus AV66/AV100/AV200 chips, i.e., Xonar D1, DX, D2, D2X, DS,
+- Essence ST (Deluxe), and Essence STX.
++ Asus AV66/AV100/AV200 chips, i.e., Xonar D1, DX, D2, D2X, DS, DSX,
++ Essence ST (Deluxe), and Essence STX (II).
+ Support for the HDAV1.3 (Deluxe) and HDAV1.3 Slim is experimental;
+ for the Xense, missing.
+
+diff --git a/sound/pci/hda/patch_ca0132.c b/sound/pci/hda/patch_ca0132.c
+index 092f2bd030bd..b686aca7f000 100644
+--- a/sound/pci/hda/patch_ca0132.c
++++ b/sound/pci/hda/patch_ca0132.c
+@@ -4376,6 +4376,9 @@ static void ca0132_download_dsp(struct hda_codec *codec)
+ return; /* NOP */
+ #endif
+
++ if (spec->dsp_state == DSP_DOWNLOAD_FAILED)
++ return; /* don't retry failures */
++
+ chipio_enable_clocks(codec);
+ spec->dsp_state = DSP_DOWNLOADING;
+ if (!ca0132_download_dsp_images(codec))
+@@ -4552,7 +4555,8 @@ static int ca0132_init(struct hda_codec *codec)
+ struct auto_pin_cfg *cfg = &spec->autocfg;
+ int i;
+
+- spec->dsp_state = DSP_DOWNLOAD_INIT;
++ if (spec->dsp_state != DSP_DOWNLOAD_FAILED)
++ spec->dsp_state = DSP_DOWNLOAD_INIT;
+ spec->curr_chip_addx = INVALID_CHIP_ADDRESS;
+
+ snd_hda_power_up(codec);
+@@ -4663,6 +4667,7 @@ static int patch_ca0132(struct hda_codec *codec)
+ codec->spec = spec;
+ spec->codec = codec;
+
++ spec->dsp_state = DSP_DOWNLOAD_INIT;
+ spec->num_mixers = 1;
+ spec->mixers[0] = ca0132_mixer;
+
+diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
+index b60824e90408..25728aaacc26 100644
+--- a/sound/pci/hda/patch_realtek.c
++++ b/sound/pci/hda/patch_realtek.c
+@@ -180,6 +180,8 @@ static void alc_fix_pll(struct hda_codec *codec)
+ spec->pll_coef_idx);
+ val = snd_hda_codec_read(codec, spec->pll_nid, 0,
+ AC_VERB_GET_PROC_COEF, 0);
++ if (val == -1)
++ return;
+ snd_hda_codec_write(codec, spec->pll_nid, 0, AC_VERB_SET_COEF_INDEX,
+ spec->pll_coef_idx);
+ snd_hda_codec_write(codec, spec->pll_nid, 0, AC_VERB_SET_PROC_COEF,
+@@ -2784,6 +2786,8 @@ static int alc269_parse_auto_config(struct hda_codec *codec)
+ static void alc269vb_toggle_power_output(struct hda_codec *codec, int power_up)
+ {
+ int val = alc_read_coef_idx(codec, 0x04);
++ if (val == -1)
++ return;
+ if (power_up)
+ val |= 1 << 11;
+ else
+@@ -3242,6 +3246,15 @@ static int alc269_resume(struct hda_codec *codec)
+ snd_hda_codec_resume_cache(codec);
+ alc_inv_dmic_sync(codec, true);
+ hda_call_check_power_status(codec, 0x01);
++
++ /* on some machine, the BIOS will clear the codec gpio data when enter
++ * suspend, and won't restore the data after resume, so we restore it
++ * in the driver.
++ */
++ if (spec->gpio_led)
++ snd_hda_codec_write(codec, codec->afg, 0, AC_VERB_SET_GPIO_DATA,
++ spec->gpio_led);
++
+ if (spec->has_alc5505_dsp)
+ alc5505_dsp_resume(codec);
+
+@@ -4782,6 +4795,8 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = {
+ SND_PCI_QUIRK(0x103c, 0x1983, "HP Pavilion", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+ SND_PCI_QUIRK(0x103c, 0x218b, "HP", ALC269_FIXUP_LIMIT_INT_MIC_BOOST_MUTE_LED),
+ /* ALC282 */
++ SND_PCI_QUIRK(0x103c, 0x2191, "HP Touchsmart 14", ALC269_FIXUP_HP_MUTE_LED_MIC1),
++ SND_PCI_QUIRK(0x103c, 0x2192, "HP Touchsmart 15", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+ SND_PCI_QUIRK(0x103c, 0x220d, "HP", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+ SND_PCI_QUIRK(0x103c, 0x220e, "HP", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+ SND_PCI_QUIRK(0x103c, 0x220f, "HP", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+@@ -5122,27 +5137,30 @@ static void alc269_fill_coef(struct hda_codec *codec)
+ if ((alc_get_coef0(codec) & 0x00ff) == 0x017) {
+ val = alc_read_coef_idx(codec, 0x04);
+ /* Power up output pin */
+- alc_write_coef_idx(codec, 0x04, val | (1<<11));
++ if (val != -1)
++ alc_write_coef_idx(codec, 0x04, val | (1<<11));
+ }
+
+ if ((alc_get_coef0(codec) & 0x00ff) == 0x018) {
+ val = alc_read_coef_idx(codec, 0xd);
+- if ((val & 0x0c00) >> 10 != 0x1) {
++ if (val != -1 && (val & 0x0c00) >> 10 != 0x1) {
+ /* Capless ramp up clock control */
+ alc_write_coef_idx(codec, 0xd, val | (1<<10));
+ }
+ val = alc_read_coef_idx(codec, 0x17);
+- if ((val & 0x01c0) >> 6 != 0x4) {
++ if (val != -1 && (val & 0x01c0) >> 6 != 0x4) {
+ /* Class D power on reset */
+ alc_write_coef_idx(codec, 0x17, val | (1<<7));
+ }
+ }
+
+ val = alc_read_coef_idx(codec, 0xd); /* Class D */
+- alc_write_coef_idx(codec, 0xd, val | (1<<14));
++ if (val != -1)
++ alc_write_coef_idx(codec, 0xd, val | (1<<14));
+
+ val = alc_read_coef_idx(codec, 0x4); /* HP */
+- alc_write_coef_idx(codec, 0x4, val | (1<<11));
++ if (val != -1)
++ alc_write_coef_idx(codec, 0x4, val | (1<<11));
+ }
+
+ /*
+diff --git a/sound/pci/hda/patch_sigmatel.c b/sound/pci/hda/patch_sigmatel.c
+index 3744ea4e843d..4d3a3b932690 100644
+--- a/sound/pci/hda/patch_sigmatel.c
++++ b/sound/pci/hda/patch_sigmatel.c
+@@ -84,6 +84,7 @@ enum {
+ STAC_DELL_EQ,
+ STAC_ALIENWARE_M17X,
+ STAC_92HD89XX_HP_FRONT_JACK,
++ STAC_92HD89XX_HP_Z1_G2_RIGHT_MIC_JACK,
+ STAC_92HD73XX_MODELS
+ };
+
+@@ -1809,6 +1810,11 @@ static const struct hda_pintbl stac92hd89xx_hp_front_jack_pin_configs[] = {
+ {}
+ };
+
++static const struct hda_pintbl stac92hd89xx_hp_z1_g2_right_mic_jack_pin_configs[] = {
++ { 0x0e, 0x400000f0 },
++ {}
++};
++
+ static void stac92hd73xx_fixup_ref(struct hda_codec *codec,
+ const struct hda_fixup *fix, int action)
+ {
+@@ -1931,6 +1937,10 @@ static const struct hda_fixup stac92hd73xx_fixups[] = {
+ [STAC_92HD89XX_HP_FRONT_JACK] = {
+ .type = HDA_FIXUP_PINS,
+ .v.pins = stac92hd89xx_hp_front_jack_pin_configs,
++ },
++ [STAC_92HD89XX_HP_Z1_G2_RIGHT_MIC_JACK] = {
++ .type = HDA_FIXUP_PINS,
++ .v.pins = stac92hd89xx_hp_z1_g2_right_mic_jack_pin_configs,
+ }
+ };
+
+@@ -1991,6 +2001,8 @@ static const struct snd_pci_quirk stac92hd73xx_fixup_tbl[] = {
+ "Alienware M17x", STAC_ALIENWARE_M17X),
+ SND_PCI_QUIRK(PCI_VENDOR_ID_DELL, 0x0490,
+ "Alienware M17x R3", STAC_DELL_EQ),
++ SND_PCI_QUIRK(PCI_VENDOR_ID_HP, 0x1927,
++ "HP Z1 G2", STAC_92HD89XX_HP_Z1_G2_RIGHT_MIC_JACK),
+ SND_PCI_QUIRK(PCI_VENDOR_ID_HP, 0x2b17,
+ "unknown HP", STAC_92HD89XX_HP_FRONT_JACK),
+ {} /* terminator */
+diff --git a/sound/pci/oxygen/virtuoso.c b/sound/pci/oxygen/virtuoso.c
+index 64b9fda5f04a..dbbbacfd535e 100644
+--- a/sound/pci/oxygen/virtuoso.c
++++ b/sound/pci/oxygen/virtuoso.c
+@@ -53,6 +53,7 @@ static DEFINE_PCI_DEVICE_TABLE(xonar_ids) = {
+ { OXYGEN_PCI_SUBID(0x1043, 0x835e) },
+ { OXYGEN_PCI_SUBID(0x1043, 0x838e) },
+ { OXYGEN_PCI_SUBID(0x1043, 0x8522) },
++ { OXYGEN_PCI_SUBID(0x1043, 0x85f4) },
+ { OXYGEN_PCI_SUBID_BROKEN_EEPROM },
+ { }
+ };
+diff --git a/sound/pci/oxygen/xonar_pcm179x.c b/sound/pci/oxygen/xonar_pcm179x.c
+index c8c7f2c9b355..e02605931669 100644
+--- a/sound/pci/oxygen/xonar_pcm179x.c
++++ b/sound/pci/oxygen/xonar_pcm179x.c
+@@ -100,8 +100,8 @@
+ */
+
+ /*
+- * Xonar Essence ST (Deluxe)/STX
+- * -----------------------------
++ * Xonar Essence ST (Deluxe)/STX (II)
++ * ----------------------------------
+ *
+ * CMI8788:
+ *
+@@ -1138,6 +1138,14 @@ int get_xonar_pcm179x_model(struct oxygen *chip,
+ chip->model.resume = xonar_stx_resume;
+ chip->model.set_dac_params = set_pcm1796_params;
+ break;
++ case 0x85f4:
++ chip->model = model_xonar_st;
++ /* TODO: daughterboard support */
++ chip->model.shortname = "Xonar STX II";
++ chip->model.init = xonar_stx_init;
++ chip->model.resume = xonar_stx_resume;
++ chip->model.set_dac_params = set_pcm1796_params;
++ break;
+ default:
+ return -EINVAL;
+ }
+diff --git a/sound/usb/quirks-table.h b/sound/usb/quirks-table.h
+index f652b10ce905..223c47b33ba3 100644
+--- a/sound/usb/quirks-table.h
++++ b/sound/usb/quirks-table.h
+@@ -1581,6 +1581,35 @@ YAMAHA_DEVICE(0x7010, "UB99"),
+ }
+ },
+ {
++ /* BOSS ME-25 */
++ USB_DEVICE(0x0582, 0x0113),
++ .driver_info = (unsigned long) & (const struct snd_usb_audio_quirk) {
++ .ifnum = QUIRK_ANY_INTERFACE,
++ .type = QUIRK_COMPOSITE,
++ .data = (const struct snd_usb_audio_quirk[]) {
++ {
++ .ifnum = 0,
++ .type = QUIRK_AUDIO_STANDARD_INTERFACE
++ },
++ {
++ .ifnum = 1,
++ .type = QUIRK_AUDIO_STANDARD_INTERFACE
++ },
++ {
++ .ifnum = 2,
++ .type = QUIRK_MIDI_FIXED_ENDPOINT,
++ .data = & (const struct snd_usb_midi_endpoint_info) {
++ .out_cables = 0x0001,
++ .in_cables = 0x0001
++ }
++ },
++ {
++ .ifnum = -1
++ }
++ }
++ }
++},
++{
+ /* only 44.1 kHz works at the moment */
+ USB_DEVICE(0x0582, 0x0120),
+ .driver_info = (unsigned long) & (const struct snd_usb_audio_quirk) {
+diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c
+index 7c57f2268dd7..19a921eb75f1 100644
+--- a/sound/usb/quirks.c
++++ b/sound/usb/quirks.c
+@@ -670,7 +670,7 @@ static int snd_usb_gamecon780_boot_quirk(struct usb_device *dev)
+ /* set the initial volume and don't change; other values are either
+ * too loud or silent due to firmware bug (bko#65251)
+ */
+- u8 buf[2] = { 0x74, 0xdc };
++ u8 buf[2] = { 0x74, 0xe3 };
+ return snd_usb_ctl_msg(dev, usb_sndctrlpipe(dev, 0), UAC_SET_CUR,
+ USB_RECIP_INTERFACE | USB_TYPE_CLASS | USB_DIR_OUT,
+ UAC_FU_VOLUME << 8, 9 << 8, buf, 2);
+diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
+index 2458a1dc2ba9..e8ce34c9db32 100644
+--- a/virt/kvm/ioapic.c
++++ b/virt/kvm/ioapic.c
+@@ -254,10 +254,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
+ spin_lock(&ioapic->lock);
+ for (index = 0; index < IOAPIC_NUM_PINS; index++) {
+ e = &ioapic->redirtbl[index];
+- if (!e->fields.mask &&
+- (e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
+- kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC,
+- index) || index == RTC_GSI)) {
++ if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
++ kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, index) ||
++ index == RTC_GSI) {
+ if (kvm_apic_match_dest(vcpu, NULL, 0,
+ e->fields.dest_id, e->fields.dest_mode)) {
+ __set_bit(e->fields.vector,
+diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
+index 0df7d4b34dfe..714b94932312 100644
+--- a/virt/kvm/iommu.c
++++ b/virt/kvm/iommu.c
+@@ -61,6 +61,14 @@ static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
+ return pfn;
+ }
+
++static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
++{
++ unsigned long i;
++
++ for (i = 0; i < npages; ++i)
++ kvm_release_pfn_clean(pfn + i);
++}
++
+ int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+ {
+ gfn_t gfn, end_gfn;
+@@ -123,6 +131,7 @@ int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+ if (r) {
+ printk(KERN_ERR "kvm_iommu_map_address:"
+ "iommu failed to map pfn=%llx\n", pfn);
++ kvm_unpin_pages(kvm, pfn, page_size);
+ goto unmap_pages;
+ }
+
+@@ -134,7 +143,7 @@ int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+ return 0;
+
+ unmap_pages:
+- kvm_iommu_put_pages(kvm, slot->base_gfn, gfn);
++ kvm_iommu_put_pages(kvm, slot->base_gfn, gfn - slot->base_gfn);
+ return r;
+ }
+
+@@ -266,14 +275,6 @@ out_unlock:
+ return r;
+ }
+
+-static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
+-{
+- unsigned long i;
+-
+- for (i = 0; i < npages; ++i)
+- kvm_release_pfn_clean(pfn + i);
+-}
+-
+ static void kvm_iommu_put_pages(struct kvm *kvm,
+ gfn_t base_gfn, unsigned long npages)
+ {
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-17 22:19 Anthony G. Basile
0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-09-17 22:19 UTC (permalink / raw
To: gentoo-commits
commit: e086bd08b11f58e9c3bedb30b3d52f2ca6fdcf7d
Author: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Wed Sep 17 22:22:02 2014 +0000
Commit: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Wed Sep 17 22:22:02 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=e086bd08
Linux patch 3.16.3
---
0000_README | 4 +
1002_linux-3.16.3.patch | 7142 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 7146 insertions(+)
diff --git a/0000_README b/0000_README
index 1ecfc95..706e53e 100644
--- a/0000_README
+++ b/0000_README
@@ -50,6 +50,10 @@ Patch: 1001_linux-3.16.2.patch
From: http://www.kernel.org
Desc: Linux 3.16.2
+Patch: 1002_linux-3.16.3.patch
+From: http://www.kernel.org
+Desc: Linux 3.16.3
+
Patch: 2400_kcopy-patch-for-infiniband-driver.patch
From: Alexey Shvetsov <alexxy@gentoo.org>
Desc: Zero copy for infiniband psm userspace driver
diff --git a/1002_linux-3.16.3.patch b/1002_linux-3.16.3.patch
new file mode 100644
index 0000000..987f475
--- /dev/null
+++ b/1002_linux-3.16.3.patch
@@ -0,0 +1,7142 @@
+diff --git a/Documentation/devicetree/bindings/sound/adi,axi-spdif-tx.txt b/Documentation/devicetree/bindings/sound/adi,axi-spdif-tx.txt
+index 46f344965313..4eb7997674a0 100644
+--- a/Documentation/devicetree/bindings/sound/adi,axi-spdif-tx.txt
++++ b/Documentation/devicetree/bindings/sound/adi,axi-spdif-tx.txt
+@@ -1,7 +1,7 @@
+ ADI AXI-SPDIF controller
+
+ Required properties:
+- - compatible : Must be "adi,axi-spdif-1.00.a"
++ - compatible : Must be "adi,axi-spdif-tx-1.00.a"
+ - reg : Must contain SPDIF core's registers location and length
+ - clocks : Pairs of phandle and specifier referencing the controller's clocks.
+ The controller expects two clocks, the clock used for the AXI interface and
+diff --git a/Makefile b/Makefile
+index c2617526e605..9b25a830a9d7 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 2
++SUBLEVEL = 3
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+
+diff --git a/arch/arm/boot/dts/omap3-n900.dts b/arch/arm/boot/dts/omap3-n900.dts
+index b15f1a77d684..1fe45d1f75ec 100644
+--- a/arch/arm/boot/dts/omap3-n900.dts
++++ b/arch/arm/boot/dts/omap3-n900.dts
+@@ -353,7 +353,7 @@
+ };
+
+ twl_power: power {
+- compatible = "ti,twl4030-power-n900";
++ compatible = "ti,twl4030-power-n900", "ti,twl4030-power-idle-osc-off";
+ ti,use_poweroff;
+ };
+ };
+diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c
+index 008e9c8b8eac..c9d9c627e244 100644
+--- a/arch/mips/cavium-octeon/setup.c
++++ b/arch/mips/cavium-octeon/setup.c
+@@ -458,6 +458,18 @@ static void octeon_halt(void)
+ octeon_kill_core(NULL);
+ }
+
++static char __read_mostly octeon_system_type[80];
++
++static int __init init_octeon_system_type(void)
++{
++ snprintf(octeon_system_type, sizeof(octeon_system_type), "%s (%s)",
++ cvmx_board_type_to_string(octeon_bootinfo->board_type),
++ octeon_model_get_string(read_c0_prid()));
++
++ return 0;
++}
++early_initcall(init_octeon_system_type);
++
+ /**
+ * Return a string representing the system type
+ *
+@@ -465,11 +477,7 @@ static void octeon_halt(void)
+ */
+ const char *octeon_board_type_string(void)
+ {
+- static char name[80];
+- sprintf(name, "%s (%s)",
+- cvmx_board_type_to_string(octeon_bootinfo->board_type),
+- octeon_model_get_string(read_c0_prid()));
+- return name;
++ return octeon_system_type;
+ }
+
+ const char *get_system_type(void)
+diff --git a/arch/mips/include/asm/eva.h b/arch/mips/include/asm/eva.h
+new file mode 100644
+index 000000000000..a3d1807f227c
+--- /dev/null
++++ b/arch/mips/include/asm/eva.h
+@@ -0,0 +1,43 @@
++/*
++ * This file is subject to the terms and conditions of the GNU General Public
++ * License. See the file "COPYING" in the main directory of this archive
++ * for more details.
++ *
++ * Copyright (C) 2014, Imagination Technologies Ltd.
++ *
++ * EVA functions for generic code
++ */
++
++#ifndef _ASM_EVA_H
++#define _ASM_EVA_H
++
++#include <kernel-entry-init.h>
++
++#ifdef __ASSEMBLY__
++
++#ifdef CONFIG_EVA
++
++/*
++ * EVA early init code
++ *
++ * Platforms must define their own 'platform_eva_init' macro in
++ * their kernel-entry-init.h header. This macro usually does the
++ * platform specific configuration of the segmentation registers,
++ * and it is normally called from assembly code.
++ *
++ */
++
++.macro eva_init
++platform_eva_init
++.endm
++
++#else
++
++.macro eva_init
++.endm
++
++#endif /* CONFIG_EVA */
++
++#endif /* __ASSEMBLY__ */
++
++#endif
+diff --git a/arch/mips/include/asm/mach-malta/kernel-entry-init.h b/arch/mips/include/asm/mach-malta/kernel-entry-init.h
+index 77eeda77e73c..0cf8622db27f 100644
+--- a/arch/mips/include/asm/mach-malta/kernel-entry-init.h
++++ b/arch/mips/include/asm/mach-malta/kernel-entry-init.h
+@@ -10,14 +10,15 @@
+ #ifndef __ASM_MACH_MIPS_KERNEL_ENTRY_INIT_H
+ #define __ASM_MACH_MIPS_KERNEL_ENTRY_INIT_H
+
++#include <asm/regdef.h>
++#include <asm/mipsregs.h>
++
+ /*
+ * Prepare segments for EVA boot:
+ *
+ * This is in case the processor boots in legacy configuration
+ * (SI_EVAReset is de-asserted and CONFIG5.K == 0)
+ *
+- * On entry, t1 is loaded with CP0_CONFIG
+- *
+ * ========================= Mappings =============================
+ * Virtual memory Physical memory Mapping
+ * 0x00000000 - 0x7fffffff 0x80000000 - 0xfffffffff MUSUK (kuseg)
+@@ -30,12 +31,20 @@
+ *
+ *
+ * Lowmem is expanded to 2GB
++ *
++ * The following code uses the t0, t1, t2 and ra registers without
++ * previously preserving them.
++ *
+ */
+- .macro eva_entry
++ .macro platform_eva_init
++
++ .set push
++ .set reorder
+ /*
+ * Get Config.K0 value and use it to program
+ * the segmentation registers
+ */
++ mfc0 t1, CP0_CONFIG
+ andi t1, 0x7 /* CCA */
+ move t2, t1
+ ins t2, t1, 16, 3
+@@ -77,6 +86,8 @@
+ mtc0 t0, $16, 5
+ sync
+ jal mips_ihb
++
++ .set pop
+ .endm
+
+ .macro kernel_entry_setup
+@@ -95,7 +106,7 @@
+ sll t0, t0, 6 /* SC bit */
+ bgez t0, 9f
+
+- eva_entry
++ platform_eva_init
+ b 0f
+ 9:
+ /* Assume we came from YAMON... */
+@@ -127,8 +138,7 @@ nonsc_processor:
+ #ifdef CONFIG_EVA
+ sync
+ ehb
+- mfc0 t1, CP0_CONFIG
+- eva_entry
++ platform_eva_init
+ #endif
+ .endm
+
+diff --git a/arch/mips/include/asm/ptrace.h b/arch/mips/include/asm/ptrace.h
+index 7e6e682aece3..c301fa9b139f 100644
+--- a/arch/mips/include/asm/ptrace.h
++++ b/arch/mips/include/asm/ptrace.h
+@@ -23,7 +23,7 @@
+ struct pt_regs {
+ #ifdef CONFIG_32BIT
+ /* Pad bytes for argument save space on the stack. */
+- unsigned long pad0[6];
++ unsigned long pad0[8];
+ #endif
+
+ /* Saved main processor registers. */
+diff --git a/arch/mips/include/asm/reg.h b/arch/mips/include/asm/reg.h
+index 910e71a12466..b8343ccbc989 100644
+--- a/arch/mips/include/asm/reg.h
++++ b/arch/mips/include/asm/reg.h
+@@ -12,116 +12,194 @@
+ #ifndef __ASM_MIPS_REG_H
+ #define __ASM_MIPS_REG_H
+
+-
+-#if defined(CONFIG_32BIT) || defined(WANT_COMPAT_REG_H)
+-
+-#define EF_R0 6
+-#define EF_R1 7
+-#define EF_R2 8
+-#define EF_R3 9
+-#define EF_R4 10
+-#define EF_R5 11
+-#define EF_R6 12
+-#define EF_R7 13
+-#define EF_R8 14
+-#define EF_R9 15
+-#define EF_R10 16
+-#define EF_R11 17
+-#define EF_R12 18
+-#define EF_R13 19
+-#define EF_R14 20
+-#define EF_R15 21
+-#define EF_R16 22
+-#define EF_R17 23
+-#define EF_R18 24
+-#define EF_R19 25
+-#define EF_R20 26
+-#define EF_R21 27
+-#define EF_R22 28
+-#define EF_R23 29
+-#define EF_R24 30
+-#define EF_R25 31
++#define MIPS32_EF_R0 6
++#define MIPS32_EF_R1 7
++#define MIPS32_EF_R2 8
++#define MIPS32_EF_R3 9
++#define MIPS32_EF_R4 10
++#define MIPS32_EF_R5 11
++#define MIPS32_EF_R6 12
++#define MIPS32_EF_R7 13
++#define MIPS32_EF_R8 14
++#define MIPS32_EF_R9 15
++#define MIPS32_EF_R10 16
++#define MIPS32_EF_R11 17
++#define MIPS32_EF_R12 18
++#define MIPS32_EF_R13 19
++#define MIPS32_EF_R14 20
++#define MIPS32_EF_R15 21
++#define MIPS32_EF_R16 22
++#define MIPS32_EF_R17 23
++#define MIPS32_EF_R18 24
++#define MIPS32_EF_R19 25
++#define MIPS32_EF_R20 26
++#define MIPS32_EF_R21 27
++#define MIPS32_EF_R22 28
++#define MIPS32_EF_R23 29
++#define MIPS32_EF_R24 30
++#define MIPS32_EF_R25 31
+
+ /*
+ * k0/k1 unsaved
+ */
+-#define EF_R26 32
+-#define EF_R27 33
++#define MIPS32_EF_R26 32
++#define MIPS32_EF_R27 33
+
+-#define EF_R28 34
+-#define EF_R29 35
+-#define EF_R30 36
+-#define EF_R31 37
++#define MIPS32_EF_R28 34
++#define MIPS32_EF_R29 35
++#define MIPS32_EF_R30 36
++#define MIPS32_EF_R31 37
+
+ /*
+ * Saved special registers
+ */
+-#define EF_LO 38
+-#define EF_HI 39
+-
+-#define EF_CP0_EPC 40
+-#define EF_CP0_BADVADDR 41
+-#define EF_CP0_STATUS 42
+-#define EF_CP0_CAUSE 43
+-#define EF_UNUSED0 44
+-
+-#define EF_SIZE 180
+-
+-#endif
+-
+-#if defined(CONFIG_64BIT) && !defined(WANT_COMPAT_REG_H)
+-
+-#define EF_R0 0
+-#define EF_R1 1
+-#define EF_R2 2
+-#define EF_R3 3
+-#define EF_R4 4
+-#define EF_R5 5
+-#define EF_R6 6
+-#define EF_R7 7
+-#define EF_R8 8
+-#define EF_R9 9
+-#define EF_R10 10
+-#define EF_R11 11
+-#define EF_R12 12
+-#define EF_R13 13
+-#define EF_R14 14
+-#define EF_R15 15
+-#define EF_R16 16
+-#define EF_R17 17
+-#define EF_R18 18
+-#define EF_R19 19
+-#define EF_R20 20
+-#define EF_R21 21
+-#define EF_R22 22
+-#define EF_R23 23
+-#define EF_R24 24
+-#define EF_R25 25
++#define MIPS32_EF_LO 38
++#define MIPS32_EF_HI 39
++
++#define MIPS32_EF_CP0_EPC 40
++#define MIPS32_EF_CP0_BADVADDR 41
++#define MIPS32_EF_CP0_STATUS 42
++#define MIPS32_EF_CP0_CAUSE 43
++#define MIPS32_EF_UNUSED0 44
++
++#define MIPS32_EF_SIZE 180
++
++#define MIPS64_EF_R0 0
++#define MIPS64_EF_R1 1
++#define MIPS64_EF_R2 2
++#define MIPS64_EF_R3 3
++#define MIPS64_EF_R4 4
++#define MIPS64_EF_R5 5
++#define MIPS64_EF_R6 6
++#define MIPS64_EF_R7 7
++#define MIPS64_EF_R8 8
++#define MIPS64_EF_R9 9
++#define MIPS64_EF_R10 10
++#define MIPS64_EF_R11 11
++#define MIPS64_EF_R12 12
++#define MIPS64_EF_R13 13
++#define MIPS64_EF_R14 14
++#define MIPS64_EF_R15 15
++#define MIPS64_EF_R16 16
++#define MIPS64_EF_R17 17
++#define MIPS64_EF_R18 18
++#define MIPS64_EF_R19 19
++#define MIPS64_EF_R20 20
++#define MIPS64_EF_R21 21
++#define MIPS64_EF_R22 22
++#define MIPS64_EF_R23 23
++#define MIPS64_EF_R24 24
++#define MIPS64_EF_R25 25
+
+ /*
+ * k0/k1 unsaved
+ */
+-#define EF_R26 26
+-#define EF_R27 27
++#define MIPS64_EF_R26 26
++#define MIPS64_EF_R27 27
+
+
+-#define EF_R28 28
+-#define EF_R29 29
+-#define EF_R30 30
+-#define EF_R31 31
++#define MIPS64_EF_R28 28
++#define MIPS64_EF_R29 29
++#define MIPS64_EF_R30 30
++#define MIPS64_EF_R31 31
+
+ /*
+ * Saved special registers
+ */
+-#define EF_LO 32
+-#define EF_HI 33
+-
+-#define EF_CP0_EPC 34
+-#define EF_CP0_BADVADDR 35
+-#define EF_CP0_STATUS 36
+-#define EF_CP0_CAUSE 37
+-
+-#define EF_SIZE 304 /* size in bytes */
++#define MIPS64_EF_LO 32
++#define MIPS64_EF_HI 33
++
++#define MIPS64_EF_CP0_EPC 34
++#define MIPS64_EF_CP0_BADVADDR 35
++#define MIPS64_EF_CP0_STATUS 36
++#define MIPS64_EF_CP0_CAUSE 37
++
++#define MIPS64_EF_SIZE 304 /* size in bytes */
++
++#if defined(CONFIG_32BIT)
++
++#define EF_R0 MIPS32_EF_R0
++#define EF_R1 MIPS32_EF_R1
++#define EF_R2 MIPS32_EF_R2
++#define EF_R3 MIPS32_EF_R3
++#define EF_R4 MIPS32_EF_R4
++#define EF_R5 MIPS32_EF_R5
++#define EF_R6 MIPS32_EF_R6
++#define EF_R7 MIPS32_EF_R7
++#define EF_R8 MIPS32_EF_R8
++#define EF_R9 MIPS32_EF_R9
++#define EF_R10 MIPS32_EF_R10
++#define EF_R11 MIPS32_EF_R11
++#define EF_R12 MIPS32_EF_R12
++#define EF_R13 MIPS32_EF_R13
++#define EF_R14 MIPS32_EF_R14
++#define EF_R15 MIPS32_EF_R15
++#define EF_R16 MIPS32_EF_R16
++#define EF_R17 MIPS32_EF_R17
++#define EF_R18 MIPS32_EF_R18
++#define EF_R19 MIPS32_EF_R19
++#define EF_R20 MIPS32_EF_R20
++#define EF_R21 MIPS32_EF_R21
++#define EF_R22 MIPS32_EF_R22
++#define EF_R23 MIPS32_EF_R23
++#define EF_R24 MIPS32_EF_R24
++#define EF_R25 MIPS32_EF_R25
++#define EF_R26 MIPS32_EF_R26
++#define EF_R27 MIPS32_EF_R27
++#define EF_R28 MIPS32_EF_R28
++#define EF_R29 MIPS32_EF_R29
++#define EF_R30 MIPS32_EF_R30
++#define EF_R31 MIPS32_EF_R31
++#define EF_LO MIPS32_EF_LO
++#define EF_HI MIPS32_EF_HI
++#define EF_CP0_EPC MIPS32_EF_CP0_EPC
++#define EF_CP0_BADVADDR MIPS32_EF_CP0_BADVADDR
++#define EF_CP0_STATUS MIPS32_EF_CP0_STATUS
++#define EF_CP0_CAUSE MIPS32_EF_CP0_CAUSE
++#define EF_UNUSED0 MIPS32_EF_UNUSED0
++#define EF_SIZE MIPS32_EF_SIZE
++
++#elif defined(CONFIG_64BIT)
++
++#define EF_R0 MIPS64_EF_R0
++#define EF_R1 MIPS64_EF_R1
++#define EF_R2 MIPS64_EF_R2
++#define EF_R3 MIPS64_EF_R3
++#define EF_R4 MIPS64_EF_R4
++#define EF_R5 MIPS64_EF_R5
++#define EF_R6 MIPS64_EF_R6
++#define EF_R7 MIPS64_EF_R7
++#define EF_R8 MIPS64_EF_R8
++#define EF_R9 MIPS64_EF_R9
++#define EF_R10 MIPS64_EF_R10
++#define EF_R11 MIPS64_EF_R11
++#define EF_R12 MIPS64_EF_R12
++#define EF_R13 MIPS64_EF_R13
++#define EF_R14 MIPS64_EF_R14
++#define EF_R15 MIPS64_EF_R15
++#define EF_R16 MIPS64_EF_R16
++#define EF_R17 MIPS64_EF_R17
++#define EF_R18 MIPS64_EF_R18
++#define EF_R19 MIPS64_EF_R19
++#define EF_R20 MIPS64_EF_R20
++#define EF_R21 MIPS64_EF_R21
++#define EF_R22 MIPS64_EF_R22
++#define EF_R23 MIPS64_EF_R23
++#define EF_R24 MIPS64_EF_R24
++#define EF_R25 MIPS64_EF_R25
++#define EF_R26 MIPS64_EF_R26
++#define EF_R27 MIPS64_EF_R27
++#define EF_R28 MIPS64_EF_R28
++#define EF_R29 MIPS64_EF_R29
++#define EF_R30 MIPS64_EF_R30
++#define EF_R31 MIPS64_EF_R31
++#define EF_LO MIPS64_EF_LO
++#define EF_HI MIPS64_EF_HI
++#define EF_CP0_EPC MIPS64_EF_CP0_EPC
++#define EF_CP0_BADVADDR MIPS64_EF_CP0_BADVADDR
++#define EF_CP0_STATUS MIPS64_EF_CP0_STATUS
++#define EF_CP0_CAUSE MIPS64_EF_CP0_CAUSE
++#define EF_SIZE MIPS64_EF_SIZE
+
+ #endif /* CONFIG_64BIT */
+
+diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
+index 17960fe7a8ce..cdf68b33bd65 100644
+--- a/arch/mips/include/asm/syscall.h
++++ b/arch/mips/include/asm/syscall.h
+@@ -131,10 +131,12 @@ static inline int syscall_get_arch(void)
+ {
+ int arch = EM_MIPS;
+ #ifdef CONFIG_64BIT
+- if (!test_thread_flag(TIF_32BIT_REGS))
++ if (!test_thread_flag(TIF_32BIT_REGS)) {
+ arch |= __AUDIT_ARCH_64BIT;
+- if (test_thread_flag(TIF_32BIT_ADDR))
+- arch |= __AUDIT_ARCH_CONVENTION_MIPS64_N32;
++ /* N32 sets only TIF_32BIT_ADDR */
++ if (test_thread_flag(TIF_32BIT_ADDR))
++ arch |= __AUDIT_ARCH_CONVENTION_MIPS64_N32;
++ }
+ #endif
+ #if defined(__LITTLE_ENDIAN)
+ arch |= __AUDIT_ARCH_LE;
+diff --git a/arch/mips/kernel/binfmt_elfo32.c b/arch/mips/kernel/binfmt_elfo32.c
+index 7faf5f2bee25..71df942fb77c 100644
+--- a/arch/mips/kernel/binfmt_elfo32.c
++++ b/arch/mips/kernel/binfmt_elfo32.c
+@@ -72,12 +72,6 @@ typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
+
+ #include <asm/processor.h>
+
+-/*
+- * When this file is selected, we are definitely running a 64bit kernel.
+- * So using the right regs define in asm/reg.h
+- */
+-#define WANT_COMPAT_REG_H
+-
+ /* These MUST be defined before elf.h gets included */
+ extern void elf32_core_copy_regs(elf_gregset_t grp, struct pt_regs *regs);
+ #define ELF_CORE_COPY_REGS(_dest, _regs) elf32_core_copy_regs(_dest, _regs);
+@@ -149,21 +143,21 @@ void elf32_core_copy_regs(elf_gregset_t grp, struct pt_regs *regs)
+ {
+ int i;
+
+- for (i = 0; i < EF_R0; i++)
++ for (i = 0; i < MIPS32_EF_R0; i++)
+ grp[i] = 0;
+- grp[EF_R0] = 0;
++ grp[MIPS32_EF_R0] = 0;
+ for (i = 1; i <= 31; i++)
+- grp[EF_R0 + i] = (elf_greg_t) regs->regs[i];
+- grp[EF_R26] = 0;
+- grp[EF_R27] = 0;
+- grp[EF_LO] = (elf_greg_t) regs->lo;
+- grp[EF_HI] = (elf_greg_t) regs->hi;
+- grp[EF_CP0_EPC] = (elf_greg_t) regs->cp0_epc;
+- grp[EF_CP0_BADVADDR] = (elf_greg_t) regs->cp0_badvaddr;
+- grp[EF_CP0_STATUS] = (elf_greg_t) regs->cp0_status;
+- grp[EF_CP0_CAUSE] = (elf_greg_t) regs->cp0_cause;
+-#ifdef EF_UNUSED0
+- grp[EF_UNUSED0] = 0;
++ grp[MIPS32_EF_R0 + i] = (elf_greg_t) regs->regs[i];
++ grp[MIPS32_EF_R26] = 0;
++ grp[MIPS32_EF_R27] = 0;
++ grp[MIPS32_EF_LO] = (elf_greg_t) regs->lo;
++ grp[MIPS32_EF_HI] = (elf_greg_t) regs->hi;
++ grp[MIPS32_EF_CP0_EPC] = (elf_greg_t) regs->cp0_epc;
++ grp[MIPS32_EF_CP0_BADVADDR] = (elf_greg_t) regs->cp0_badvaddr;
++ grp[MIPS32_EF_CP0_STATUS] = (elf_greg_t) regs->cp0_status;
++ grp[MIPS32_EF_CP0_CAUSE] = (elf_greg_t) regs->cp0_cause;
++#ifdef MIPS32_EF_UNUSED0
++ grp[MIPS32_EF_UNUSED0] = 0;
+ #endif
+ }
+
+diff --git a/arch/mips/kernel/cps-vec.S b/arch/mips/kernel/cps-vec.S
+index 6f4f739dad96..e6e97d2a5c9e 100644
+--- a/arch/mips/kernel/cps-vec.S
++++ b/arch/mips/kernel/cps-vec.S
+@@ -13,6 +13,7 @@
+ #include <asm/asm-offsets.h>
+ #include <asm/asmmacro.h>
+ #include <asm/cacheops.h>
++#include <asm/eva.h>
+ #include <asm/mipsregs.h>
+ #include <asm/mipsmtregs.h>
+ #include <asm/pm.h>
+@@ -166,6 +167,9 @@ dcache_done:
+ 1: jal mips_cps_core_init
+ nop
+
++ /* Do any EVA initialization if necessary */
++ eva_init
++
+ /*
+ * Boot any other VPEs within this core that should be online, and
+ * deactivate this VPE if it should be offline.
+diff --git a/arch/mips/kernel/irq-gic.c b/arch/mips/kernel/irq-gic.c
+index 88e4c323382c..d5e59b8f4863 100644
+--- a/arch/mips/kernel/irq-gic.c
++++ b/arch/mips/kernel/irq-gic.c
+@@ -269,11 +269,13 @@ static void __init gic_setup_intr(unsigned int intr, unsigned int cpu,
+
+ /* Setup Intr to Pin mapping */
+ if (pin & GIC_MAP_TO_NMI_MSK) {
++ int i;
++
+ GICWRITE(GIC_REG_ADDR(SHARED, GIC_SH_MAP_TO_PIN(intr)), pin);
+ /* FIXME: hack to route NMI to all cpu's */
+- for (cpu = 0; cpu < NR_CPUS; cpu += 32) {
++ for (i = 0; i < NR_CPUS; i += 32) {
+ GICWRITE(GIC_REG_ADDR(SHARED,
+- GIC_SH_MAP_TO_VPE_REG_OFF(intr, cpu)),
++ GIC_SH_MAP_TO_VPE_REG_OFF(intr, i)),
+ 0xffffffff);
+ }
+ } else {
+diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
+index f639ccd5060c..aae71198b515 100644
+--- a/arch/mips/kernel/ptrace.c
++++ b/arch/mips/kernel/ptrace.c
+@@ -129,7 +129,7 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data)
+ }
+
+ __put_user(child->thread.fpu.fcr31, data + 64);
+- __put_user(current_cpu_data.fpu_id, data + 65);
++ __put_user(boot_cpu_data.fpu_id, data + 65);
+
+ return 0;
+ }
+@@ -151,6 +151,7 @@ int ptrace_setfpregs(struct task_struct *child, __u32 __user *data)
+ }
+
+ __get_user(child->thread.fpu.fcr31, data + 64);
++ child->thread.fpu.fcr31 &= ~FPU_CSR_ALL_X;
+
+ /* FIR may not be written. */
+
+@@ -246,36 +247,160 @@ int ptrace_set_watch_regs(struct task_struct *child,
+
+ /* regset get/set implementations */
+
+-static int gpr_get(struct task_struct *target,
+- const struct user_regset *regset,
+- unsigned int pos, unsigned int count,
+- void *kbuf, void __user *ubuf)
++#if defined(CONFIG_32BIT) || defined(CONFIG_MIPS32_O32)
++
++static int gpr32_get(struct task_struct *target,
++ const struct user_regset *regset,
++ unsigned int pos, unsigned int count,
++ void *kbuf, void __user *ubuf)
+ {
+ struct pt_regs *regs = task_pt_regs(target);
++ u32 uregs[ELF_NGREG] = {};
++ unsigned i;
++
++ for (i = MIPS32_EF_R1; i <= MIPS32_EF_R31; i++) {
++ /* k0/k1 are copied as zero. */
++ if (i == MIPS32_EF_R26 || i == MIPS32_EF_R27)
++ continue;
++
++ uregs[i] = regs->regs[i - MIPS32_EF_R0];
++ }
+
+- return user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+- regs, 0, sizeof(*regs));
++ uregs[MIPS32_EF_LO] = regs->lo;
++ uregs[MIPS32_EF_HI] = regs->hi;
++ uregs[MIPS32_EF_CP0_EPC] = regs->cp0_epc;
++ uregs[MIPS32_EF_CP0_BADVADDR] = regs->cp0_badvaddr;
++ uregs[MIPS32_EF_CP0_STATUS] = regs->cp0_status;
++ uregs[MIPS32_EF_CP0_CAUSE] = regs->cp0_cause;
++
++ return user_regset_copyout(&pos, &count, &kbuf, &ubuf, uregs, 0,
++ sizeof(uregs));
+ }
+
+-static int gpr_set(struct task_struct *target,
+- const struct user_regset *regset,
+- unsigned int pos, unsigned int count,
+- const void *kbuf, const void __user *ubuf)
++static int gpr32_set(struct task_struct *target,
++ const struct user_regset *regset,
++ unsigned int pos, unsigned int count,
++ const void *kbuf, const void __user *ubuf)
+ {
+- struct pt_regs newregs;
+- int ret;
++ struct pt_regs *regs = task_pt_regs(target);
++ u32 uregs[ELF_NGREG];
++ unsigned start, num_regs, i;
++ int err;
++
++ start = pos / sizeof(u32);
++ num_regs = count / sizeof(u32);
++
++ if (start + num_regs > ELF_NGREG)
++ return -EIO;
++
++ err = user_regset_copyin(&pos, &count, &kbuf, &ubuf, uregs, 0,
++ sizeof(uregs));
++ if (err)
++ return err;
++
++ for (i = start; i < num_regs; i++) {
++ /*
++ * Cast all values to signed here so that if this is a 64-bit
++ * kernel, the supplied 32-bit values will be sign extended.
++ */
++ switch (i) {
++ case MIPS32_EF_R1 ... MIPS32_EF_R25:
++ /* k0/k1 are ignored. */
++ case MIPS32_EF_R28 ... MIPS32_EF_R31:
++ regs->regs[i - MIPS32_EF_R0] = (s32)uregs[i];
++ break;
++ case MIPS32_EF_LO:
++ regs->lo = (s32)uregs[i];
++ break;
++ case MIPS32_EF_HI:
++ regs->hi = (s32)uregs[i];
++ break;
++ case MIPS32_EF_CP0_EPC:
++ regs->cp0_epc = (s32)uregs[i];
++ break;
++ }
++ }
++
++ return 0;
++}
++
++#endif /* CONFIG_32BIT || CONFIG_MIPS32_O32 */
++
++#ifdef CONFIG_64BIT
++
++static int gpr64_get(struct task_struct *target,
++ const struct user_regset *regset,
++ unsigned int pos, unsigned int count,
++ void *kbuf, void __user *ubuf)
++{
++ struct pt_regs *regs = task_pt_regs(target);
++ u64 uregs[ELF_NGREG] = {};
++ unsigned i;
++
++ for (i = MIPS64_EF_R1; i <= MIPS64_EF_R31; i++) {
++ /* k0/k1 are copied as zero. */
++ if (i == MIPS64_EF_R26 || i == MIPS64_EF_R27)
++ continue;
++
++ uregs[i] = regs->regs[i - MIPS64_EF_R0];
++ }
++
++ uregs[MIPS64_EF_LO] = regs->lo;
++ uregs[MIPS64_EF_HI] = regs->hi;
++ uregs[MIPS64_EF_CP0_EPC] = regs->cp0_epc;
++ uregs[MIPS64_EF_CP0_BADVADDR] = regs->cp0_badvaddr;
++ uregs[MIPS64_EF_CP0_STATUS] = regs->cp0_status;
++ uregs[MIPS64_EF_CP0_CAUSE] = regs->cp0_cause;
++
++ return user_regset_copyout(&pos, &count, &kbuf, &ubuf, uregs, 0,
++ sizeof(uregs));
++}
+
+- ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
+- &newregs,
+- 0, sizeof(newregs));
+- if (ret)
+- return ret;
++static int gpr64_set(struct task_struct *target,
++ const struct user_regset *regset,
++ unsigned int pos, unsigned int count,
++ const void *kbuf, const void __user *ubuf)
++{
++ struct pt_regs *regs = task_pt_regs(target);
++ u64 uregs[ELF_NGREG];
++ unsigned start, num_regs, i;
++ int err;
++
++ start = pos / sizeof(u64);
++ num_regs = count / sizeof(u64);
+
+- *task_pt_regs(target) = newregs;
++ if (start + num_regs > ELF_NGREG)
++ return -EIO;
++
++ err = user_regset_copyin(&pos, &count, &kbuf, &ubuf, uregs, 0,
++ sizeof(uregs));
++ if (err)
++ return err;
++
++ for (i = start; i < num_regs; i++) {
++ switch (i) {
++ case MIPS64_EF_R1 ... MIPS64_EF_R25:
++ /* k0/k1 are ignored. */
++ case MIPS64_EF_R28 ... MIPS64_EF_R31:
++ regs->regs[i - MIPS64_EF_R0] = uregs[i];
++ break;
++ case MIPS64_EF_LO:
++ regs->lo = uregs[i];
++ break;
++ case MIPS64_EF_HI:
++ regs->hi = uregs[i];
++ break;
++ case MIPS64_EF_CP0_EPC:
++ regs->cp0_epc = uregs[i];
++ break;
++ }
++ }
+
+ return 0;
+ }
+
++#endif /* CONFIG_64BIT */
++
+ static int fpr_get(struct task_struct *target,
+ const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+@@ -337,14 +462,16 @@ enum mips_regset {
+ REGSET_FPR,
+ };
+
++#if defined(CONFIG_32BIT) || defined(CONFIG_MIPS32_O32)
++
+ static const struct user_regset mips_regsets[] = {
+ [REGSET_GPR] = {
+ .core_note_type = NT_PRSTATUS,
+ .n = ELF_NGREG,
+ .size = sizeof(unsigned int),
+ .align = sizeof(unsigned int),
+- .get = gpr_get,
+- .set = gpr_set,
++ .get = gpr32_get,
++ .set = gpr32_set,
+ },
+ [REGSET_FPR] = {
+ .core_note_type = NT_PRFPREG,
+@@ -364,14 +491,18 @@ static const struct user_regset_view user_mips_view = {
+ .n = ARRAY_SIZE(mips_regsets),
+ };
+
++#endif /* CONFIG_32BIT || CONFIG_MIPS32_O32 */
++
++#ifdef CONFIG_64BIT
++
+ static const struct user_regset mips64_regsets[] = {
+ [REGSET_GPR] = {
+ .core_note_type = NT_PRSTATUS,
+ .n = ELF_NGREG,
+ .size = sizeof(unsigned long),
+ .align = sizeof(unsigned long),
+- .get = gpr_get,
+- .set = gpr_set,
++ .get = gpr64_get,
++ .set = gpr64_set,
+ },
+ [REGSET_FPR] = {
+ .core_note_type = NT_PRFPREG,
+@@ -384,25 +515,26 @@ static const struct user_regset mips64_regsets[] = {
+ };
+
+ static const struct user_regset_view user_mips64_view = {
+- .name = "mips",
++ .name = "mips64",
+ .e_machine = ELF_ARCH,
+ .ei_osabi = ELF_OSABI,
+ .regsets = mips64_regsets,
+- .n = ARRAY_SIZE(mips_regsets),
++ .n = ARRAY_SIZE(mips64_regsets),
+ };
+
++#endif /* CONFIG_64BIT */
++
+ const struct user_regset_view *task_user_regset_view(struct task_struct *task)
+ {
+ #ifdef CONFIG_32BIT
+ return &user_mips_view;
+-#endif
+-
++#else
+ #ifdef CONFIG_MIPS32_O32
+- if (test_thread_flag(TIF_32BIT_REGS))
+- return &user_mips_view;
++ if (test_tsk_thread_flag(task, TIF_32BIT_REGS))
++ return &user_mips_view;
+ #endif
+-
+ return &user_mips64_view;
++#endif
+ }
+
+ long arch_ptrace(struct task_struct *child, long request,
+@@ -480,7 +612,7 @@ long arch_ptrace(struct task_struct *child, long request,
+ break;
+ case FPC_EIR:
+ /* implementation / version register */
+- tmp = current_cpu_data.fpu_id;
++ tmp = boot_cpu_data.fpu_id;
+ break;
+ case DSP_BASE ... DSP_BASE + 5: {
+ dspreg_t *dregs;
+@@ -565,7 +697,7 @@ long arch_ptrace(struct task_struct *child, long request,
+ break;
+ #endif
+ case FPC_CSR:
+- child->thread.fpu.fcr31 = data;
++ child->thread.fpu.fcr31 = data & ~FPU_CSR_ALL_X;
+ break;
+ case DSP_BASE ... DSP_BASE + 5: {
+ dspreg_t *dregs;
+diff --git a/arch/mips/kernel/ptrace32.c b/arch/mips/kernel/ptrace32.c
+index b40c3ca60ee5..a83fb730b387 100644
+--- a/arch/mips/kernel/ptrace32.c
++++ b/arch/mips/kernel/ptrace32.c
+@@ -129,7 +129,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
+ break;
+ case FPC_EIR:
+ /* implementation / version register */
+- tmp = current_cpu_data.fpu_id;
++ tmp = boot_cpu_data.fpu_id;
+ break;
+ case DSP_BASE ... DSP_BASE + 5: {
+ dspreg_t *dregs;
+diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
+index f1343ccd7ed7..7f5feb25ae04 100644
+--- a/arch/mips/kernel/scall64-o32.S
++++ b/arch/mips/kernel/scall64-o32.S
+@@ -113,15 +113,19 @@ trace_a_syscall:
+ move s0, t2 # Save syscall pointer
+ move a0, sp
+ /*
+- * syscall number is in v0 unless we called syscall(__NR_###)
++ * absolute syscall number is in v0 unless we called syscall(__NR_###)
+ * where the real syscall number is in a0
+ * note: NR_syscall is the first O32 syscall but the macro is
+ * only defined when compiling with -mabi=32 (CONFIG_32BIT)
+ * therefore __NR_O32_Linux is used (4000)
+ */
+- addiu a1, v0, __NR_O32_Linux
+- bnez v0, 1f /* __NR_syscall at offset 0 */
+- lw a1, PT_R4(sp)
++ .set push
++ .set reorder
++ subu t1, v0, __NR_O32_Linux
++ move a1, v0
++ bnez t1, 1f /* __NR_syscall at offset 0 */
++ lw a1, PT_R4(sp) /* Arg1 for __NR_syscall case */
++ .set pop
+
+ 1: jal syscall_trace_enter
+
+diff --git a/arch/mips/kernel/smp-mt.c b/arch/mips/kernel/smp-mt.c
+index 3babf6e4f894..21f23add04f4 100644
+--- a/arch/mips/kernel/smp-mt.c
++++ b/arch/mips/kernel/smp-mt.c
+@@ -288,6 +288,7 @@ struct plat_smp_ops vsmp_smp_ops = {
+ .prepare_cpus = vsmp_prepare_cpus,
+ };
+
++#ifdef CONFIG_PROC_FS
+ static int proc_cpuinfo_chain_call(struct notifier_block *nfb,
+ unsigned long action_unused, void *data)
+ {
+@@ -309,3 +310,4 @@ static int __init proc_cpuinfo_notifier_init(void)
+ }
+
+ subsys_initcall(proc_cpuinfo_notifier_init);
++#endif
+diff --git a/arch/mips/kernel/unaligned.c b/arch/mips/kernel/unaligned.c
+index 2b3517214d6d..e11906dff885 100644
+--- a/arch/mips/kernel/unaligned.c
++++ b/arch/mips/kernel/unaligned.c
+@@ -690,7 +690,6 @@ static void emulate_load_store_insn(struct pt_regs *regs,
+ case sdc1_op:
+ die_if_kernel("Unaligned FP access in kernel code", regs);
+ BUG_ON(!used_math());
+- BUG_ON(!is_fpu_owner());
+
+ lose_fpu(1); /* Save FPU state for the emulator. */
+ res = fpu_emulator_cop1Handler(regs, ¤t->thread.fpu, 1,
+diff --git a/arch/mips/mm/tlbex.c b/arch/mips/mm/tlbex.c
+index e80e10bafc83..343fe0f559b1 100644
+--- a/arch/mips/mm/tlbex.c
++++ b/arch/mips/mm/tlbex.c
+@@ -1299,6 +1299,7 @@ static void build_r4000_tlb_refill_handler(void)
+ }
+ #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
+ uasm_l_tlb_huge_update(&l, p);
++ UASM_i_LW(&p, K0, 0, K1);
+ build_huge_update_entries(&p, htlb_info.huge_pte, K1);
+ build_huge_tlb_write_entry(&p, &l, &r, K0, tlb_random,
+ htlb_info.restore_scratch);
+diff --git a/arch/mips/mti-malta/malta-memory.c b/arch/mips/mti-malta/malta-memory.c
+index 6d9773096750..fdffc806664f 100644
+--- a/arch/mips/mti-malta/malta-memory.c
++++ b/arch/mips/mti-malta/malta-memory.c
+@@ -34,13 +34,19 @@ fw_memblock_t * __init fw_getmdesc(int eva)
+ /* otherwise look in the environment */
+
+ memsize_str = fw_getenv("memsize");
+- if (memsize_str)
+- tmp = kstrtol(memsize_str, 0, &memsize);
++ if (memsize_str) {
++ tmp = kstrtoul(memsize_str, 0, &memsize);
++ if (tmp)
++ pr_warn("Failed to read the 'memsize' env variable.\n");
++ }
+ if (eva) {
+ /* Look for ememsize for EVA */
+ ememsize_str = fw_getenv("ememsize");
+- if (ememsize_str)
+- tmp = kstrtol(ememsize_str, 0, &ememsize);
++ if (ememsize_str) {
++ tmp = kstrtoul(ememsize_str, 0, &ememsize);
++ if (tmp)
++ pr_warn("Failed to read the 'ememsize' env variable.\n");
++ }
+ }
+ if (!memsize && !ememsize) {
+ pr_warn("memsize not set in YAMON, set to default (32Mb)\n");
+diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
+index f92b0b54e921..8dcb721d03d8 100644
+--- a/arch/powerpc/include/asm/machdep.h
++++ b/arch/powerpc/include/asm/machdep.h
+@@ -57,10 +57,10 @@ struct machdep_calls {
+ void (*hpte_removebolted)(unsigned long ea,
+ int psize, int ssize);
+ void (*flush_hash_range)(unsigned long number, int local);
+- void (*hugepage_invalidate)(struct mm_struct *mm,
++ void (*hugepage_invalidate)(unsigned long vsid,
++ unsigned long addr,
+ unsigned char *hpte_slot_array,
+- unsigned long addr, int psize);
+-
++ int psize, int ssize);
+ /* special for kexec, to be called in real mode, linear mapping is
+ * destroyed as well */
+ void (*hpte_clear_all)(void);
+diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
+index eb9261024f51..7b3d54fae46f 100644
+--- a/arch/powerpc/include/asm/pgtable-ppc64.h
++++ b/arch/powerpc/include/asm/pgtable-ppc64.h
+@@ -413,7 +413,7 @@ static inline char *get_hpte_slot_array(pmd_t *pmdp)
+ }
+
+ extern void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+- pmd_t *pmdp);
++ pmd_t *pmdp, unsigned long old_pmd);
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+ extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
+diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
+index d836d945068d..9ecede1e124c 100644
+--- a/arch/powerpc/include/asm/pte-hash64-64k.h
++++ b/arch/powerpc/include/asm/pte-hash64-64k.h
+@@ -46,11 +46,31 @@
+ * in order to deal with 64K made of 4K HW pages. Thus we override the
+ * generic accessors and iterators here
+ */
+-#define __real_pte(e,p) ((real_pte_t) { \
+- (e), (pte_val(e) & _PAGE_COMBO) ? \
+- (pte_val(*((p) + PTRS_PER_PTE))) : 0 })
+-#define __rpte_to_hidx(r,index) ((pte_val((r).pte) & _PAGE_COMBO) ? \
+- (((r).hidx >> ((index)<<2)) & 0xf) : ((pte_val((r).pte) >> 12) & 0xf))
++#define __real_pte __real_pte
++static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
++{
++ real_pte_t rpte;
++
++ rpte.pte = pte;
++ rpte.hidx = 0;
++ if (pte_val(pte) & _PAGE_COMBO) {
++ /*
++ * Make sure we order the hidx load against the _PAGE_COMBO
++ * check. The store side ordering is done in __hash_page_4K
++ */
++ smp_rmb();
++ rpte.hidx = pte_val(*((ptep) + PTRS_PER_PTE));
++ }
++ return rpte;
++}
++
++static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
++{
++ if ((pte_val(rpte.pte) & _PAGE_COMBO))
++ return (rpte.hidx >> (index<<2)) & 0xf;
++ return (pte_val(rpte.pte) >> 12) & 0xf;
++}
++
+ #define __rpte_to_pte(r) ((r).pte)
+ #define __rpte_sub_valid(rpte, index) \
+ (pte_val(rpte.pte) & (_PAGE_HPTE_SUB0 >> (index)))
+diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
+index 88e3ec6e1d96..48fb2c18fa81 100644
+--- a/arch/powerpc/kernel/iommu.c
++++ b/arch/powerpc/kernel/iommu.c
+@@ -1120,37 +1120,41 @@ EXPORT_SYMBOL_GPL(iommu_release_ownership);
+ int iommu_add_device(struct device *dev)
+ {
+ struct iommu_table *tbl;
+- int ret = 0;
+
+- if (WARN_ON(dev->iommu_group)) {
+- pr_warn("iommu_tce: device %s is already in iommu group %d, skipping\n",
+- dev_name(dev),
+- iommu_group_id(dev->iommu_group));
++ /*
++ * The sysfs entries should be populated before
++ * binding IOMMU group. If sysfs entries isn't
++ * ready, we simply bail.
++ */
++ if (!device_is_registered(dev))
++ return -ENOENT;
++
++ if (dev->iommu_group) {
++ pr_debug("%s: Skipping device %s with iommu group %d\n",
++ __func__, dev_name(dev),
++ iommu_group_id(dev->iommu_group));
+ return -EBUSY;
+ }
+
+ tbl = get_iommu_table_base(dev);
+ if (!tbl || !tbl->it_group) {
+- pr_debug("iommu_tce: skipping device %s with no tbl\n",
+- dev_name(dev));
++ pr_debug("%s: Skipping device %s with no tbl\n",
++ __func__, dev_name(dev));
+ return 0;
+ }
+
+- pr_debug("iommu_tce: adding %s to iommu group %d\n",
+- dev_name(dev), iommu_group_id(tbl->it_group));
++ pr_debug("%s: Adding %s to iommu group %d\n",
++ __func__, dev_name(dev),
++ iommu_group_id(tbl->it_group));
+
+ if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
+- pr_err("iommu_tce: unsupported iommu page size.");
+- pr_err("%s has not been added\n", dev_name(dev));
++ pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
++ __func__, IOMMU_PAGE_SIZE(tbl),
++ PAGE_SIZE, dev_name(dev));
+ return -EINVAL;
+ }
+
+- ret = iommu_group_add_device(tbl->it_group, dev);
+- if (ret < 0)
+- pr_err("iommu_tce: %s has not been added, ret=%d\n",
+- dev_name(dev), ret);
+-
+- return ret;
++ return iommu_group_add_device(tbl->it_group, dev);
+ }
+ EXPORT_SYMBOL_GPL(iommu_add_device);
+
+diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
+index cf1d325eae8b..afc0a8295f84 100644
+--- a/arch/powerpc/mm/hash_native_64.c
++++ b/arch/powerpc/mm/hash_native_64.c
+@@ -412,18 +412,18 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
+ local_irq_restore(flags);
+ }
+
+-static void native_hugepage_invalidate(struct mm_struct *mm,
++static void native_hugepage_invalidate(unsigned long vsid,
++ unsigned long addr,
+ unsigned char *hpte_slot_array,
+- unsigned long addr, int psize)
++ int psize, int ssize)
+ {
+- int ssize = 0, i;
+- int lock_tlbie;
++ int i;
+ struct hash_pte *hptep;
+ int actual_psize = MMU_PAGE_16M;
+ unsigned int max_hpte_count, valid;
+ unsigned long flags, s_addr = addr;
+ unsigned long hpte_v, want_v, shift;
+- unsigned long hidx, vpn = 0, vsid, hash, slot;
++ unsigned long hidx, vpn = 0, hash, slot;
+
+ shift = mmu_psize_defs[psize].shift;
+ max_hpte_count = 1U << (PMD_SHIFT - shift);
+@@ -437,15 +437,6 @@ static void native_hugepage_invalidate(struct mm_struct *mm,
+
+ /* get the vpn */
+ addr = s_addr + (i * (1ul << shift));
+- if (!is_kernel_addr(addr)) {
+- ssize = user_segment_size(addr);
+- vsid = get_vsid(mm->context.id, addr, ssize);
+- WARN_ON(vsid == 0);
+- } else {
+- vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+- ssize = mmu_kernel_ssize;
+- }
+-
+ vpn = hpt_vpn(addr, vsid, ssize);
+ hash = hpt_hash(vpn, shift, ssize);
+ if (hidx & _PTEIDX_SECONDARY)
+@@ -465,22 +456,13 @@ static void native_hugepage_invalidate(struct mm_struct *mm,
+ else
+ /* Invalidate the hpte. NOTE: this also unlocks it */
+ hptep->v = 0;
++ /*
++ * We need to do tlb invalidate for all the address, tlbie
++ * instruction compares entry_VA in tlb with the VA specified
++ * here
++ */
++ tlbie(vpn, psize, actual_psize, ssize, 0);
+ }
+- /*
+- * Since this is a hugepage, we just need a single tlbie.
+- * use the last vpn.
+- */
+- lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
+- if (lock_tlbie)
+- raw_spin_lock(&native_tlbie_lock);
+-
+- asm volatile("ptesync":::"memory");
+- __tlbie(vpn, psize, actual_psize, ssize);
+- asm volatile("eieio; tlbsync; ptesync":::"memory");
+-
+- if (lock_tlbie)
+- raw_spin_unlock(&native_tlbie_lock);
+-
+ local_irq_restore(flags);
+ }
+
+diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
+index 826893fcb3a7..5f5e6328c21c 100644
+--- a/arch/powerpc/mm/hugepage-hash64.c
++++ b/arch/powerpc/mm/hugepage-hash64.c
+@@ -18,6 +18,57 @@
+ #include <linux/mm.h>
+ #include <asm/machdep.h>
+
++static void invalidate_old_hpte(unsigned long vsid, unsigned long addr,
++ pmd_t *pmdp, unsigned int psize, int ssize)
++{
++ int i, max_hpte_count, valid;
++ unsigned long s_addr;
++ unsigned char *hpte_slot_array;
++ unsigned long hidx, shift, vpn, hash, slot;
++
++ s_addr = addr & HPAGE_PMD_MASK;
++ hpte_slot_array = get_hpte_slot_array(pmdp);
++ /*
++ * IF we try to do a HUGE PTE update after a withdraw is done.
++ * we will find the below NULL. This happens when we do
++ * split_huge_page_pmd
++ */
++ if (!hpte_slot_array)
++ return;
++
++ if (ppc_md.hugepage_invalidate)
++ return ppc_md.hugepage_invalidate(vsid, s_addr, hpte_slot_array,
++ psize, ssize);
++ /*
++ * No bluk hpte removal support, invalidate each entry
++ */
++ shift = mmu_psize_defs[psize].shift;
++ max_hpte_count = HPAGE_PMD_SIZE >> shift;
++ for (i = 0; i < max_hpte_count; i++) {
++ /*
++ * 8 bits per each hpte entries
++ * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
++ */
++ valid = hpte_valid(hpte_slot_array, i);
++ if (!valid)
++ continue;
++ hidx = hpte_hash_index(hpte_slot_array, i);
++
++ /* get the vpn */
++ addr = s_addr + (i * (1ul << shift));
++ vpn = hpt_vpn(addr, vsid, ssize);
++ hash = hpt_hash(vpn, shift, ssize);
++ if (hidx & _PTEIDX_SECONDARY)
++ hash = ~hash;
++
++ slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
++ slot += hidx & _PTEIDX_GROUP_IX;
++ ppc_md.hpte_invalidate(slot, vpn, psize,
++ MMU_PAGE_16M, ssize, 0);
++ }
++}
++
++
+ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ pmd_t *pmdp, unsigned long trap, int local, int ssize,
+ unsigned int psize)
+@@ -33,7 +84,9 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ * atomically mark the linux large page PMD busy and dirty
+ */
+ do {
+- old_pmd = pmd_val(*pmdp);
++ pmd_t pmd = ACCESS_ONCE(*pmdp);
++
++ old_pmd = pmd_val(pmd);
+ /* If PMD busy, retry the access */
+ if (unlikely(old_pmd & _PAGE_BUSY))
+ return 0;
+@@ -85,6 +138,15 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ vpn = hpt_vpn(ea, vsid, ssize);
+ hash = hpt_hash(vpn, shift, ssize);
+ hpte_slot_array = get_hpte_slot_array(pmdp);
++ if (psize == MMU_PAGE_4K) {
++ /*
++ * invalidate the old hpte entry if we have that mapped via 64K
++ * base page size. This is because demote_segment won't flush
++ * hash page table entries.
++ */
++ if ((old_pmd & _PAGE_HASHPTE) && !(old_pmd & _PAGE_COMBO))
++ invalidate_old_hpte(vsid, ea, pmdp, MMU_PAGE_64K, ssize);
++ }
+
+ valid = hpte_valid(hpte_slot_array, index);
+ if (valid) {
+@@ -107,11 +169,8 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ * safely update this here.
+ */
+ valid = 0;
+- new_pmd &= ~_PAGE_HPTEFLAGS;
+ hpte_slot_array[index] = 0;
+- } else
+- /* clear the busy bits and set the hash pte bits */
+- new_pmd = (new_pmd & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
++ }
+ }
+
+ if (!valid) {
+@@ -119,11 +178,7 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+
+ /* insert new entry */
+ pa = pmd_pfn(__pmd(old_pmd)) << PAGE_SHIFT;
+-repeat:
+- hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+-
+- /* clear the busy bits and set the hash pte bits */
+- new_pmd = (new_pmd & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
++ new_pmd |= _PAGE_HASHPTE;
+
+ /* Add in WIMG bits */
+ rflags |= (new_pmd & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
+@@ -132,6 +187,8 @@ repeat:
+ * enable the memory coherence always
+ */
+ rflags |= HPTE_R_M;
++repeat:
++ hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+
+ /* Insert into the hash table, primary slot */
+ slot = ppc_md.hpte_insert(hpte_group, vpn, pa, rflags, 0,
+@@ -172,8 +229,17 @@ repeat:
+ mark_hpte_slot_valid(hpte_slot_array, index, slot);
+ }
+ /*
+- * No need to use ldarx/stdcx here
++ * Mark the pte with _PAGE_COMBO, if we are trying to hash it with
++ * base page size 4k.
++ */
++ if (psize == MMU_PAGE_4K)
++ new_pmd |= _PAGE_COMBO;
++ /*
++ * The hpte valid is stored in the pgtable whose address is in the
++ * second half of the PMD. Order this against clearing of the busy bit in
++ * huge pmd.
+ */
++ smp_wmb();
+ *pmdp = __pmd(new_pmd & ~_PAGE_BUSY);
+ return 0;
+ }
+diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
+index 3b181b22cd46..d3e9a78eaed3 100644
+--- a/arch/powerpc/mm/numa.c
++++ b/arch/powerpc/mm/numa.c
+@@ -611,8 +611,8 @@ static int cpu_numa_callback(struct notifier_block *nfb, unsigned long action,
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ unmap_cpu_from_node(lcpu);
+- break;
+ ret = NOTIFY_OK;
++ break;
+ #endif
+ }
+ return ret;
+diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
+index f6ce1f111f5b..71d084b6f766 100644
+--- a/arch/powerpc/mm/pgtable_64.c
++++ b/arch/powerpc/mm/pgtable_64.c
+@@ -538,7 +538,7 @@ unsigned long pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
+ *pmdp = __pmd((old & ~clr) | set);
+ #endif
+ if (old & _PAGE_HASHPTE)
+- hpte_do_hugepage_flush(mm, addr, pmdp);
++ hpte_do_hugepage_flush(mm, addr, pmdp, old);
+ return old;
+ }
+
+@@ -645,7 +645,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma,
+ if (!(old & _PAGE_SPLITTING)) {
+ /* We need to flush the hpte */
+ if (old & _PAGE_HASHPTE)
+- hpte_do_hugepage_flush(vma->vm_mm, address, pmdp);
++ hpte_do_hugepage_flush(vma->vm_mm, address, pmdp, old);
+ }
+ /*
+ * This ensures that generic code that rely on IRQ disabling
+@@ -723,7 +723,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ * neesd to be flushed.
+ */
+ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+- pmd_t *pmdp)
++ pmd_t *pmdp, unsigned long old_pmd)
+ {
+ int ssize, i;
+ unsigned long s_addr;
+@@ -745,12 +745,29 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+ if (!hpte_slot_array)
+ return;
+
+- /* get the base page size */
++ /* get the base page size,vsid and segment size */
++#ifdef CONFIG_DEBUG_VM
+ psize = get_slice_psize(mm, s_addr);
++ BUG_ON(psize == MMU_PAGE_16M);
++#endif
++ if (old_pmd & _PAGE_COMBO)
++ psize = MMU_PAGE_4K;
++ else
++ psize = MMU_PAGE_64K;
++
++ if (!is_kernel_addr(s_addr)) {
++ ssize = user_segment_size(s_addr);
++ vsid = get_vsid(mm->context.id, s_addr, ssize);
++ WARN_ON(vsid == 0);
++ } else {
++ vsid = get_kernel_vsid(s_addr, mmu_kernel_ssize);
++ ssize = mmu_kernel_ssize;
++ }
+
+ if (ppc_md.hugepage_invalidate)
+- return ppc_md.hugepage_invalidate(mm, hpte_slot_array,
+- s_addr, psize);
++ return ppc_md.hugepage_invalidate(vsid, s_addr,
++ hpte_slot_array,
++ psize, ssize);
+ /*
+ * No bluk hpte removal support, invalidate each entry
+ */
+@@ -768,15 +785,6 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+
+ /* get the vpn */
+ addr = s_addr + (i * (1ul << shift));
+- if (!is_kernel_addr(addr)) {
+- ssize = user_segment_size(addr);
+- vsid = get_vsid(mm->context.id, addr, ssize);
+- WARN_ON(vsid == 0);
+- } else {
+- vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+- ssize = mmu_kernel_ssize;
+- }
+-
+ vpn = hpt_vpn(addr, vsid, ssize);
+ hash = hpt_hash(vpn, shift, ssize);
+ if (hidx & _PTEIDX_SECONDARY)
+diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
+index c99f6510a0b2..9adda5790463 100644
+--- a/arch/powerpc/mm/tlb_hash64.c
++++ b/arch/powerpc/mm/tlb_hash64.c
+@@ -216,7 +216,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
+ if (!(pte & _PAGE_HASHPTE))
+ continue;
+ if (unlikely(hugepage_shift && pmd_trans_huge(*(pmd_t *)pte)))
+- hpte_do_hugepage_flush(mm, start, (pmd_t *)pte);
++ hpte_do_hugepage_flush(mm, start, (pmd_t *)ptep, pte);
+ else
+ hpte_need_flush(mm, start, ptep, pte, 0);
+ }
+diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
+index 3136ae2f75af..dc30aa5a2ce8 100644
+--- a/arch/powerpc/platforms/powernv/pci-ioda.c
++++ b/arch/powerpc/platforms/powernv/pci-ioda.c
+@@ -462,7 +462,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
+
+ pe = &phb->ioda.pe_array[pdn->pe_number];
+ WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
+- set_iommu_table_base(&pdev->dev, &pe->tce32_table);
++ set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+ }
+
+ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
+diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
+index 7995135170a3..24abc5c223c7 100644
+--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
++++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
+@@ -146,7 +146,7 @@ static inline int pseries_remove_memblock(unsigned long base,
+ }
+ static inline int pseries_remove_mem_node(struct device_node *np)
+ {
+- return -EOPNOTSUPP;
++ return 0;
+ }
+ #endif /* CONFIG_MEMORY_HOTREMOVE */
+
+diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
+index 33b552ffbe57..4642d6a4d356 100644
+--- a/arch/powerpc/platforms/pseries/iommu.c
++++ b/arch/powerpc/platforms/pseries/iommu.c
+@@ -721,13 +721,13 @@ static int __init disable_ddw_setup(char *str)
+
+ early_param("disable_ddw", disable_ddw_setup);
+
+-static void remove_ddw(struct device_node *np)
++static void remove_ddw(struct device_node *np, bool remove_prop)
+ {
+ struct dynamic_dma_window_prop *dwp;
+ struct property *win64;
+ const u32 *ddw_avail;
+ u64 liobn;
+- int len, ret;
++ int len, ret = 0;
+
+ ddw_avail = of_get_property(np, "ibm,ddw-applicable", &len);
+ win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
+@@ -761,7 +761,8 @@ static void remove_ddw(struct device_node *np)
+ np->full_name, ret, ddw_avail[2], liobn);
+
+ delprop:
+- ret = of_remove_property(np, win64);
++ if (remove_prop)
++ ret = of_remove_property(np, win64);
+ if (ret)
+ pr_warning("%s: failed to remove direct window property: %d\n",
+ np->full_name, ret);
+@@ -805,7 +806,7 @@ static int find_existing_ddw_windows(void)
+ window = kzalloc(sizeof(*window), GFP_KERNEL);
+ if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
+ kfree(window);
+- remove_ddw(pdn);
++ remove_ddw(pdn, true);
+ continue;
+ }
+
+@@ -1045,7 +1046,7 @@ out_free_window:
+ kfree(window);
+
+ out_clear_window:
+- remove_ddw(pdn);
++ remove_ddw(pdn, true);
+
+ out_free_prop:
+ kfree(win64->name);
+@@ -1255,7 +1256,14 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
+
+ switch (action) {
+ case OF_RECONFIG_DETACH_NODE:
+- remove_ddw(np);
++ /*
++ * Removing the property will invoke the reconfig
++ * notifier again, which causes dead-lock on the
++ * read-write semaphore of the notifier chain. So
++ * we have to remove the property when releasing
++ * the device node.
++ */
++ remove_ddw(np, false);
+ if (pci && pci->iommu_table)
+ iommu_free_table(pci->iommu_table, np->full_name);
+
+diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
+index b02af9ef3ff6..ccf6f162f69c 100644
+--- a/arch/powerpc/platforms/pseries/lpar.c
++++ b/arch/powerpc/platforms/pseries/lpar.c
+@@ -430,16 +430,17 @@ static void __pSeries_lpar_hugepage_invalidate(unsigned long *slot,
+ spin_unlock_irqrestore(&pSeries_lpar_tlbie_lock, flags);
+ }
+
+-static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
+- unsigned char *hpte_slot_array,
+- unsigned long addr, int psize)
++static void pSeries_lpar_hugepage_invalidate(unsigned long vsid,
++ unsigned long addr,
++ unsigned char *hpte_slot_array,
++ int psize, int ssize)
+ {
+- int ssize = 0, i, index = 0;
++ int i, index = 0;
+ unsigned long s_addr = addr;
+ unsigned int max_hpte_count, valid;
+ unsigned long vpn_array[PPC64_HUGE_HPTE_BATCH];
+ unsigned long slot_array[PPC64_HUGE_HPTE_BATCH];
+- unsigned long shift, hidx, vpn = 0, vsid, hash, slot;
++ unsigned long shift, hidx, vpn = 0, hash, slot;
+
+ shift = mmu_psize_defs[psize].shift;
+ max_hpte_count = 1U << (PMD_SHIFT - shift);
+@@ -452,15 +453,6 @@ static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
+
+ /* get the vpn */
+ addr = s_addr + (i * (1ul << shift));
+- if (!is_kernel_addr(addr)) {
+- ssize = user_segment_size(addr);
+- vsid = get_vsid(mm->context.id, addr, ssize);
+- WARN_ON(vsid == 0);
+- } else {
+- vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+- ssize = mmu_kernel_ssize;
+- }
+-
+ vpn = hpt_vpn(addr, vsid, ssize);
+ hash = hpt_hash(vpn, shift, ssize);
+ if (hidx & _PTEIDX_SECONDARY)
+diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
+index bb63499fc5d3..9f00f9301613 100644
+--- a/arch/s390/Kconfig
++++ b/arch/s390/Kconfig
+@@ -92,6 +92,7 @@ config S390
+ select ARCH_INLINE_WRITE_UNLOCK_IRQ
+ select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
+ select ARCH_SAVE_PAGE_KEYS if HIBERNATION
++ select ARCH_SUPPORTS_ATOMIC_RMW
+ select ARCH_USE_CMPXCHG_LOCKREF
+ select ARCH_WANT_IPC_PARSE_VERSION
+ select BUILDTIME_EXTABLE_SORT
+diff --git a/arch/sh/include/asm/io_noioport.h b/arch/sh/include/asm/io_noioport.h
+index 4d48f1436a63..c727e6ddf69e 100644
+--- a/arch/sh/include/asm/io_noioport.h
++++ b/arch/sh/include/asm/io_noioport.h
+@@ -34,6 +34,17 @@ static inline void outl(unsigned int x, unsigned long port)
+ BUG();
+ }
+
++static inline void __iomem *ioport_map(unsigned long port, unsigned int size)
++{
++ BUG();
++ return NULL;
++}
++
++static inline void ioport_unmap(void __iomem *addr)
++{
++ BUG();
++}
++
+ #define inb_p(addr) inb(addr)
+ #define inw_p(addr) inw(addr)
+ #define inl_p(addr) inl(addr)
+diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
+index 14695c6221c8..84ab119b6ffa 100644
+--- a/block/scsi_ioctl.c
++++ b/block/scsi_ioctl.c
+@@ -438,6 +438,11 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
+ }
+
+ rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_WAIT);
++ if (!rq) {
++ err = -ENOMEM;
++ goto error;
++ }
++ blk_rq_set_block_pc(rq);
+
+ cmdlen = COMMAND_SIZE(opcode);
+
+@@ -491,7 +496,6 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
+ memset(sense, 0, sizeof(sense));
+ rq->sense = sense;
+ rq->sense_len = 0;
+- blk_rq_set_block_pc(rq);
+
+ blk_execute_rq(q, disk, rq, 0);
+
+@@ -511,7 +515,8 @@ out:
+
+ error:
+ kfree(buffer);
+- blk_put_request(rq);
++ if (rq)
++ blk_put_request(rq);
+ return err;
+ }
+ EXPORT_SYMBOL_GPL(sg_scsi_ioctl);
+diff --git a/drivers/acpi/acpica/nsobject.c b/drivers/acpi/acpica/nsobject.c
+index fe54a8c73b8c..f1ea8e56cd87 100644
+--- a/drivers/acpi/acpica/nsobject.c
++++ b/drivers/acpi/acpica/nsobject.c
+@@ -239,6 +239,17 @@ void acpi_ns_detach_object(struct acpi_namespace_node *node)
+ }
+ }
+
++ /*
++ * Detach the object from any data objects (which are still held by
++ * the namespace node)
++ */
++
++ if (obj_desc->common.next_object &&
++ ((obj_desc->common.next_object)->common.type ==
++ ACPI_TYPE_LOCAL_DATA)) {
++ obj_desc->common.next_object = NULL;
++ }
++
+ /* Reset the node type to untyped */
+
+ node->type = ACPI_TYPE_ANY;
+diff --git a/drivers/acpi/acpica/utcopy.c b/drivers/acpi/acpica/utcopy.c
+index 270c16464dd9..ff601c0f7c7a 100644
+--- a/drivers/acpi/acpica/utcopy.c
++++ b/drivers/acpi/acpica/utcopy.c
+@@ -1001,5 +1001,11 @@ acpi_ut_copy_iobject_to_iobject(union acpi_operand_object *source_desc,
+ status = acpi_ut_copy_simple_object(source_desc, *dest_desc);
+ }
+
++ /* Delete the allocated object if copy failed */
++
++ if (ACPI_FAILURE(status)) {
++ acpi_ut_remove_reference(*dest_desc);
++ }
++
+ return_ACPI_STATUS(status);
+ }
+diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
+index a66ab658abbc..9922cc46b15c 100644
+--- a/drivers/acpi/ec.c
++++ b/drivers/acpi/ec.c
+@@ -197,6 +197,8 @@ static bool advance_transaction(struct acpi_ec *ec)
+ t->rdata[t->ri++] = acpi_ec_read_data(ec);
+ if (t->rlen == t->ri) {
+ t->flags |= ACPI_EC_COMMAND_COMPLETE;
++ if (t->command == ACPI_EC_COMMAND_QUERY)
++ pr_debug("hardware QR_EC completion\n");
+ wakeup = true;
+ }
+ } else
+@@ -208,7 +210,20 @@ static bool advance_transaction(struct acpi_ec *ec)
+ }
+ return wakeup;
+ } else {
+- if ((status & ACPI_EC_FLAG_IBF) == 0) {
++ /*
++ * There is firmware refusing to respond QR_EC when SCI_EVT
++ * is not set, for which case, we complete the QR_EC
++ * without issuing it to the firmware.
++ * https://bugzilla.kernel.org/show_bug.cgi?id=86211
++ */
++ if (!(status & ACPI_EC_FLAG_SCI) &&
++ (t->command == ACPI_EC_COMMAND_QUERY)) {
++ t->flags |= ACPI_EC_COMMAND_POLL;
++ t->rdata[t->ri++] = 0x00;
++ t->flags |= ACPI_EC_COMMAND_COMPLETE;
++ pr_debug("software QR_EC completion\n");
++ wakeup = true;
++ } else if ((status & ACPI_EC_FLAG_IBF) == 0) {
+ acpi_ec_write_cmd(ec, t->command);
+ t->flags |= ACPI_EC_COMMAND_POLL;
+ } else
+@@ -288,11 +303,11 @@ static int acpi_ec_transaction_unlocked(struct acpi_ec *ec,
+ /* following two actions should be kept atomic */
+ ec->curr = t;
+ start_transaction(ec);
+- if (ec->curr->command == ACPI_EC_COMMAND_QUERY)
+- clear_bit(EC_FLAGS_QUERY_PENDING, &ec->flags);
+ spin_unlock_irqrestore(&ec->lock, tmp);
+ ret = ec_poll(ec);
+ spin_lock_irqsave(&ec->lock, tmp);
++ if (ec->curr->command == ACPI_EC_COMMAND_QUERY)
++ clear_bit(EC_FLAGS_QUERY_PENDING, &ec->flags);
+ ec->curr = NULL;
+ spin_unlock_irqrestore(&ec->lock, tmp);
+ return ret;
+diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
+index 3dca36d4ad26..17f9ec501972 100644
+--- a/drivers/acpi/processor_idle.c
++++ b/drivers/acpi/processor_idle.c
+@@ -1071,9 +1071,9 @@ int acpi_processor_cst_has_changed(struct acpi_processor *pr)
+
+ if (pr->id == 0 && cpuidle_get_driver() == &acpi_idle_driver) {
+
+- cpuidle_pause_and_lock();
+ /* Protect against cpu-hotplug */
+ get_online_cpus();
++ cpuidle_pause_and_lock();
+
+ /* Disable all cpuidle devices */
+ for_each_online_cpu(cpu) {
+@@ -1100,8 +1100,8 @@ int acpi_processor_cst_has_changed(struct acpi_processor *pr)
+ cpuidle_enable_device(dev);
+ }
+ }
+- put_online_cpus();
+ cpuidle_resume_and_unlock();
++ put_online_cpus();
+ }
+
+ return 0;
+diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
+index f775fa0d850f..551f29127369 100644
+--- a/drivers/acpi/scan.c
++++ b/drivers/acpi/scan.c
+@@ -351,7 +351,8 @@ static int acpi_scan_hot_remove(struct acpi_device *device)
+ unsigned long long sta;
+ acpi_status status;
+
+- if (device->handler->hotplug.demand_offline && !acpi_force_hot_remove) {
++ if (device->handler && device->handler->hotplug.demand_offline
++ && !acpi_force_hot_remove) {
+ if (!acpi_scan_is_offline(device, true))
+ return -EBUSY;
+ } else {
+@@ -664,8 +665,14 @@ static ssize_t
+ acpi_device_sun_show(struct device *dev, struct device_attribute *attr,
+ char *buf) {
+ struct acpi_device *acpi_dev = to_acpi_device(dev);
++ acpi_status status;
++ unsigned long long sun;
++
++ status = acpi_evaluate_integer(acpi_dev->handle, "_SUN", NULL, &sun);
++ if (ACPI_FAILURE(status))
++ return -ENODEV;
+
+- return sprintf(buf, "%lu\n", acpi_dev->pnp.sun);
++ return sprintf(buf, "%llu\n", sun);
+ }
+ static DEVICE_ATTR(sun, 0444, acpi_device_sun_show, NULL);
+
+@@ -687,7 +694,6 @@ static int acpi_device_setup_files(struct acpi_device *dev)
+ {
+ struct acpi_buffer buffer = {ACPI_ALLOCATE_BUFFER, NULL};
+ acpi_status status;
+- unsigned long long sun;
+ int result = 0;
+
+ /*
+@@ -728,14 +734,10 @@ static int acpi_device_setup_files(struct acpi_device *dev)
+ if (dev->pnp.unique_id)
+ result = device_create_file(&dev->dev, &dev_attr_uid);
+
+- status = acpi_evaluate_integer(dev->handle, "_SUN", NULL, &sun);
+- if (ACPI_SUCCESS(status)) {
+- dev->pnp.sun = (unsigned long)sun;
++ if (acpi_has_method(dev->handle, "_SUN")) {
+ result = device_create_file(&dev->dev, &dev_attr_sun);
+ if (result)
+ goto end;
+- } else {
+- dev->pnp.sun = (unsigned long)-1;
+ }
+
+ if (acpi_has_method(dev->handle, "_STA")) {
+@@ -919,12 +921,17 @@ static void acpi_device_notify(acpi_handle handle, u32 event, void *data)
+ device->driver->ops.notify(device, event);
+ }
+
+-static acpi_status acpi_device_notify_fixed(void *data)
++static void acpi_device_notify_fixed(void *data)
+ {
+ struct acpi_device *device = data;
+
+ /* Fixed hardware devices have no handles */
+ acpi_device_notify(NULL, ACPI_FIXED_HARDWARE_EVENT, device);
++}
++
++static acpi_status acpi_device_fixed_event(void *data)
++{
++ acpi_os_execute(OSL_NOTIFY_HANDLER, acpi_device_notify_fixed, data);
+ return AE_OK;
+ }
+
+@@ -935,12 +942,12 @@ static int acpi_device_install_notify_handler(struct acpi_device *device)
+ if (device->device_type == ACPI_BUS_TYPE_POWER_BUTTON)
+ status =
+ acpi_install_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
+- acpi_device_notify_fixed,
++ acpi_device_fixed_event,
+ device);
+ else if (device->device_type == ACPI_BUS_TYPE_SLEEP_BUTTON)
+ status =
+ acpi_install_fixed_event_handler(ACPI_EVENT_SLEEP_BUTTON,
+- acpi_device_notify_fixed,
++ acpi_device_fixed_event,
+ device);
+ else
+ status = acpi_install_notify_handler(device->handle,
+@@ -957,10 +964,10 @@ static void acpi_device_remove_notify_handler(struct acpi_device *device)
+ {
+ if (device->device_type == ACPI_BUS_TYPE_POWER_BUTTON)
+ acpi_remove_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
+- acpi_device_notify_fixed);
++ acpi_device_fixed_event);
+ else if (device->device_type == ACPI_BUS_TYPE_SLEEP_BUTTON)
+ acpi_remove_fixed_event_handler(ACPI_EVENT_SLEEP_BUTTON,
+- acpi_device_notify_fixed);
++ acpi_device_fixed_event);
+ else
+ acpi_remove_notify_handler(device->handle, ACPI_DEVICE_NOTIFY,
+ acpi_device_notify);
+@@ -972,7 +979,7 @@ static int acpi_device_probe(struct device *dev)
+ struct acpi_driver *acpi_drv = to_acpi_driver(dev->driver);
+ int ret;
+
+- if (acpi_dev->handler)
++ if (acpi_dev->handler && !acpi_is_pnp_device(acpi_dev))
+ return -EINVAL;
+
+ if (!acpi_drv->ops.add)
+diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
+index 350d52a8f781..4834b4cae540 100644
+--- a/drivers/acpi/video.c
++++ b/drivers/acpi/video.c
+@@ -82,9 +82,9 @@ module_param(allow_duplicates, bool, 0644);
+ * For Windows 8 systems: used to decide if video module
+ * should skip registering backlight interface of its own.
+ */
+-static int use_native_backlight_param = 1;
++static int use_native_backlight_param = -1;
+ module_param_named(use_native_backlight, use_native_backlight_param, int, 0444);
+-static bool use_native_backlight_dmi = false;
++static bool use_native_backlight_dmi = true;
+
+ static int register_count;
+ static struct mutex video_list_lock;
+@@ -415,6 +415,12 @@ static int __init video_set_use_native_backlight(const struct dmi_system_id *d)
+ return 0;
+ }
+
++static int __init video_disable_native_backlight(const struct dmi_system_id *d)
++{
++ use_native_backlight_dmi = false;
++ return 0;
++}
++
+ static struct dmi_system_id video_dmi_table[] __initdata = {
+ /*
+ * Broken _BQC workaround http://bugzilla.kernel.org/show_bug.cgi?id=13121
+@@ -645,6 +651,41 @@ static struct dmi_system_id video_dmi_table[] __initdata = {
+ DMI_MATCH(DMI_PRODUCT_NAME, "HP EliteBook 8780w"),
+ },
+ },
++
++ /*
++ * These models have a working acpi_video backlight control, and using
++ * native backlight causes a regression where backlight does not work
++ * when userspace is not handling brightness key events. Disable
++ * native_backlight on these to fix this:
++ * https://bugzilla.kernel.org/show_bug.cgi?id=81691
++ */
++ {
++ .callback = video_disable_native_backlight,
++ .ident = "ThinkPad T420",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T420"),
++ },
++ },
++ {
++ .callback = video_disable_native_backlight,
++ .ident = "ThinkPad T520",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T520"),
++ },
++ },
++
++ /* The native backlight controls do not work on some older machines */
++ {
++ /* https://bugs.freedesktop.org/show_bug.cgi?id=81515 */
++ .callback = video_disable_native_backlight,
++ .ident = "HP ENVY 15 Notebook",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
++ DMI_MATCH(DMI_PRODUCT_NAME, "HP ENVY 15 Notebook PC"),
++ },
++ },
+ {}
+ };
+
+diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
+index b2c98c1bc037..9dc02c429771 100644
+--- a/drivers/block/rbd.c
++++ b/drivers/block/rbd.c
+@@ -42,6 +42,7 @@
+ #include <linux/blkdev.h>
+ #include <linux/slab.h>
+ #include <linux/idr.h>
++#include <linux/workqueue.h>
+
+ #include "rbd_types.h"
+
+@@ -332,7 +333,10 @@ struct rbd_device {
+
+ char name[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */
+
++ struct list_head rq_queue; /* incoming rq queue */
+ spinlock_t lock; /* queue, flags, open_count */
++ struct workqueue_struct *rq_wq;
++ struct work_struct rq_work;
+
+ struct rbd_image_header header;
+ unsigned long flags; /* possibly lock protected */
+@@ -3183,102 +3187,129 @@ out:
+ return ret;
+ }
+
+-static void rbd_request_fn(struct request_queue *q)
+- __releases(q->queue_lock) __acquires(q->queue_lock)
++static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
+ {
+- struct rbd_device *rbd_dev = q->queuedata;
+- struct request *rq;
++ struct rbd_img_request *img_request;
++ u64 offset = (u64)blk_rq_pos(rq) << SECTOR_SHIFT;
++ u64 length = blk_rq_bytes(rq);
++ bool wr = rq_data_dir(rq) == WRITE;
+ int result;
+
+- while ((rq = blk_fetch_request(q))) {
+- bool write_request = rq_data_dir(rq) == WRITE;
+- struct rbd_img_request *img_request;
+- u64 offset;
+- u64 length;
++ /* Ignore/skip any zero-length requests */
+
+- /* Ignore any non-FS requests that filter through. */
++ if (!length) {
++ dout("%s: zero-length request\n", __func__);
++ result = 0;
++ goto err_rq;
++ }
+
+- if (rq->cmd_type != REQ_TYPE_FS) {
+- dout("%s: non-fs request type %d\n", __func__,
+- (int) rq->cmd_type);
+- __blk_end_request_all(rq, 0);
+- continue;
++ /* Disallow writes to a read-only device */
++
++ if (wr) {
++ if (rbd_dev->mapping.read_only) {
++ result = -EROFS;
++ goto err_rq;
+ }
++ rbd_assert(rbd_dev->spec->snap_id == CEPH_NOSNAP);
++ }
+
+- /* Ignore/skip any zero-length requests */
++ /*
++ * Quit early if the mapped snapshot no longer exists. It's
++ * still possible the snapshot will have disappeared by the
++ * time our request arrives at the osd, but there's no sense in
++ * sending it if we already know.
++ */
++ if (!test_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags)) {
++ dout("request for non-existent snapshot");
++ rbd_assert(rbd_dev->spec->snap_id != CEPH_NOSNAP);
++ result = -ENXIO;
++ goto err_rq;
++ }
+
+- offset = (u64) blk_rq_pos(rq) << SECTOR_SHIFT;
+- length = (u64) blk_rq_bytes(rq);
++ if (offset && length > U64_MAX - offset + 1) {
++ rbd_warn(rbd_dev, "bad request range (%llu~%llu)", offset,
++ length);
++ result = -EINVAL;
++ goto err_rq; /* Shouldn't happen */
++ }
+
+- if (!length) {
+- dout("%s: zero-length request\n", __func__);
+- __blk_end_request_all(rq, 0);
+- continue;
+- }
++ if (offset + length > rbd_dev->mapping.size) {
++ rbd_warn(rbd_dev, "beyond EOD (%llu~%llu > %llu)", offset,
++ length, rbd_dev->mapping.size);
++ result = -EIO;
++ goto err_rq;
++ }
+
+- spin_unlock_irq(q->queue_lock);
++ img_request = rbd_img_request_create(rbd_dev, offset, length, wr);
++ if (!img_request) {
++ result = -ENOMEM;
++ goto err_rq;
++ }
++ img_request->rq = rq;
+
+- /* Disallow writes to a read-only device */
++ result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO, rq->bio);
++ if (result)
++ goto err_img_request;
+
+- if (write_request) {
+- result = -EROFS;
+- if (rbd_dev->mapping.read_only)
+- goto end_request;
+- rbd_assert(rbd_dev->spec->snap_id == CEPH_NOSNAP);
+- }
++ result = rbd_img_request_submit(img_request);
++ if (result)
++ goto err_img_request;
+
+- /*
+- * Quit early if the mapped snapshot no longer
+- * exists. It's still possible the snapshot will
+- * have disappeared by the time our request arrives
+- * at the osd, but there's no sense in sending it if
+- * we already know.
+- */
+- if (!test_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags)) {
+- dout("request for non-existent snapshot");
+- rbd_assert(rbd_dev->spec->snap_id != CEPH_NOSNAP);
+- result = -ENXIO;
+- goto end_request;
+- }
++ return;
+
+- result = -EINVAL;
+- if (offset && length > U64_MAX - offset + 1) {
+- rbd_warn(rbd_dev, "bad request range (%llu~%llu)\n",
+- offset, length);
+- goto end_request; /* Shouldn't happen */
+- }
++err_img_request:
++ rbd_img_request_put(img_request);
++err_rq:
++ if (result)
++ rbd_warn(rbd_dev, "%s %llx at %llx result %d",
++ wr ? "write" : "read", length, offset, result);
++ blk_end_request_all(rq, result);
++}
+
+- result = -EIO;
+- if (offset + length > rbd_dev->mapping.size) {
+- rbd_warn(rbd_dev, "beyond EOD (%llu~%llu > %llu)\n",
+- offset, length, rbd_dev->mapping.size);
+- goto end_request;
+- }
++static void rbd_request_workfn(struct work_struct *work)
++{
++ struct rbd_device *rbd_dev =
++ container_of(work, struct rbd_device, rq_work);
++ struct request *rq, *next;
++ LIST_HEAD(requests);
+
+- result = -ENOMEM;
+- img_request = rbd_img_request_create(rbd_dev, offset, length,
+- write_request);
+- if (!img_request)
+- goto end_request;
++ spin_lock_irq(&rbd_dev->lock); /* rq->q->queue_lock */
++ list_splice_init(&rbd_dev->rq_queue, &requests);
++ spin_unlock_irq(&rbd_dev->lock);
+
+- img_request->rq = rq;
++ list_for_each_entry_safe(rq, next, &requests, queuelist) {
++ list_del_init(&rq->queuelist);
++ rbd_handle_request(rbd_dev, rq);
++ }
++}
+
+- result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO,
+- rq->bio);
+- if (!result)
+- result = rbd_img_request_submit(img_request);
+- if (result)
+- rbd_img_request_put(img_request);
+-end_request:
+- spin_lock_irq(q->queue_lock);
+- if (result < 0) {
+- rbd_warn(rbd_dev, "%s %llx at %llx result %d\n",
+- write_request ? "write" : "read",
+- length, offset, result);
+-
+- __blk_end_request_all(rq, result);
++/*
++ * Called with q->queue_lock held and interrupts disabled, possibly on
++ * the way to schedule(). Do not sleep here!
++ */
++static void rbd_request_fn(struct request_queue *q)
++{
++ struct rbd_device *rbd_dev = q->queuedata;
++ struct request *rq;
++ int queued = 0;
++
++ rbd_assert(rbd_dev);
++
++ while ((rq = blk_fetch_request(q))) {
++ /* Ignore any non-FS requests that filter through. */
++ if (rq->cmd_type != REQ_TYPE_FS) {
++ dout("%s: non-fs request type %d\n", __func__,
++ (int) rq->cmd_type);
++ __blk_end_request_all(rq, 0);
++ continue;
+ }
++
++ list_add_tail(&rq->queuelist, &rbd_dev->rq_queue);
++ queued++;
+ }
++
++ if (queued)
++ queue_work(rbd_dev->rq_wq, &rbd_dev->rq_work);
+ }
+
+ /*
+@@ -3848,6 +3879,8 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
+ return NULL;
+
+ spin_lock_init(&rbd_dev->lock);
++ INIT_LIST_HEAD(&rbd_dev->rq_queue);
++ INIT_WORK(&rbd_dev->rq_work, rbd_request_workfn);
+ rbd_dev->flags = 0;
+ atomic_set(&rbd_dev->parent_ref, 0);
+ INIT_LIST_HEAD(&rbd_dev->node);
+@@ -5066,12 +5099,17 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev)
+ ret = rbd_dev_mapping_set(rbd_dev);
+ if (ret)
+ goto err_out_disk;
++
+ set_capacity(rbd_dev->disk, rbd_dev->mapping.size / SECTOR_SIZE);
+ set_disk_ro(rbd_dev->disk, rbd_dev->mapping.read_only);
+
++ rbd_dev->rq_wq = alloc_workqueue(rbd_dev->disk->disk_name, 0, 0);
++ if (!rbd_dev->rq_wq)
++ goto err_out_mapping;
++
+ ret = rbd_bus_add_dev(rbd_dev);
+ if (ret)
+- goto err_out_mapping;
++ goto err_out_workqueue;
+
+ /* Everything's ready. Announce the disk to the world. */
+
+@@ -5083,6 +5121,9 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev)
+
+ return ret;
+
++err_out_workqueue:
++ destroy_workqueue(rbd_dev->rq_wq);
++ rbd_dev->rq_wq = NULL;
+ err_out_mapping:
+ rbd_dev_mapping_clear(rbd_dev);
+ err_out_disk:
+@@ -5314,6 +5355,7 @@ static void rbd_dev_device_release(struct device *dev)
+ {
+ struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
+
++ destroy_workqueue(rbd_dev->rq_wq);
+ rbd_free_disk(rbd_dev);
+ clear_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags);
+ rbd_dev_mapping_clear(rbd_dev);
+diff --git a/drivers/bluetooth/btmrvl_drv.h b/drivers/bluetooth/btmrvl_drv.h
+index dc79f88f8717..54d9f2e73495 100644
+--- a/drivers/bluetooth/btmrvl_drv.h
++++ b/drivers/bluetooth/btmrvl_drv.h
+@@ -68,6 +68,7 @@ struct btmrvl_adapter {
+ u8 hs_state;
+ u8 wakeup_tries;
+ wait_queue_head_t cmd_wait_q;
++ wait_queue_head_t event_hs_wait_q;
+ u8 cmd_complete;
+ bool is_suspended;
+ };
+diff --git a/drivers/bluetooth/btmrvl_main.c b/drivers/bluetooth/btmrvl_main.c
+index e9dbddb0b8f1..3ecba5c979bd 100644
+--- a/drivers/bluetooth/btmrvl_main.c
++++ b/drivers/bluetooth/btmrvl_main.c
+@@ -114,6 +114,7 @@ int btmrvl_process_event(struct btmrvl_private *priv, struct sk_buff *skb)
+ adapter->hs_state = HS_ACTIVATED;
+ if (adapter->psmode)
+ adapter->ps_state = PS_SLEEP;
++ wake_up_interruptible(&adapter->event_hs_wait_q);
+ BT_DBG("HS ACTIVATED!");
+ } else {
+ BT_DBG("HS Enable failed");
+@@ -253,11 +254,31 @@ EXPORT_SYMBOL_GPL(btmrvl_enable_ps);
+
+ int btmrvl_enable_hs(struct btmrvl_private *priv)
+ {
++ struct btmrvl_adapter *adapter = priv->adapter;
+ int ret;
+
+ ret = btmrvl_send_sync_cmd(priv, BT_CMD_HOST_SLEEP_ENABLE, NULL, 0);
+- if (ret)
++ if (ret) {
+ BT_ERR("Host sleep enable command failed\n");
++ return ret;
++ }
++
++ ret = wait_event_interruptible_timeout(adapter->event_hs_wait_q,
++ adapter->hs_state,
++ msecs_to_jiffies(WAIT_UNTIL_HS_STATE_CHANGED));
++ if (ret < 0) {
++ BT_ERR("event_hs_wait_q terminated (%d): %d,%d,%d",
++ ret, adapter->hs_state, adapter->ps_state,
++ adapter->wakeup_tries);
++ } else if (!ret) {
++ BT_ERR("hs_enable timeout: %d,%d,%d", adapter->hs_state,
++ adapter->ps_state, adapter->wakeup_tries);
++ ret = -ETIMEDOUT;
++ } else {
++ BT_DBG("host sleep enabled: %d,%d,%d", adapter->hs_state,
++ adapter->ps_state, adapter->wakeup_tries);
++ ret = 0;
++ }
+
+ return ret;
+ }
+@@ -358,6 +379,7 @@ static void btmrvl_init_adapter(struct btmrvl_private *priv)
+ }
+
+ init_waitqueue_head(&priv->adapter->cmd_wait_q);
++ init_waitqueue_head(&priv->adapter->event_hs_wait_q);
+ }
+
+ static void btmrvl_free_adapter(struct btmrvl_private *priv)
+@@ -666,6 +688,7 @@ int btmrvl_remove_card(struct btmrvl_private *priv)
+ hdev = priv->btmrvl_dev.hcidev;
+
+ wake_up_interruptible(&priv->adapter->cmd_wait_q);
++ wake_up_interruptible(&priv->adapter->event_hs_wait_q);
+
+ kthread_stop(priv->main_thread.task);
+
+diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
+index 62e10fd1e1cb..6af17002a115 100644
+--- a/drivers/char/tpm/tpm-interface.c
++++ b/drivers/char/tpm/tpm-interface.c
+@@ -491,11 +491,10 @@ static int tpm_startup(struct tpm_chip *chip, __be16 startup_type)
+ int tpm_get_timeouts(struct tpm_chip *chip)
+ {
+ struct tpm_cmd_t tpm_cmd;
+- struct timeout_t *timeout_cap;
++ unsigned long new_timeout[4];
++ unsigned long old_timeout[4];
+ struct duration_t *duration_cap;
+ ssize_t rc;
+- u32 timeout;
+- unsigned int scale = 1;
+
+ tpm_cmd.header.in = tpm_getcap_header;
+ tpm_cmd.params.getcap_in.cap = TPM_CAP_PROP;
+@@ -529,25 +528,46 @@ int tpm_get_timeouts(struct tpm_chip *chip)
+ != sizeof(tpm_cmd.header.out) + sizeof(u32) + 4 * sizeof(u32))
+ return -EINVAL;
+
+- timeout_cap = &tpm_cmd.params.getcap_out.cap.timeout;
+- /* Don't overwrite default if value is 0 */
+- timeout = be32_to_cpu(timeout_cap->a);
+- if (timeout && timeout < 1000) {
+- /* timeouts in msec rather usec */
+- scale = 1000;
+- chip->vendor.timeout_adjusted = true;
++ old_timeout[0] = be32_to_cpu(tpm_cmd.params.getcap_out.cap.timeout.a);
++ old_timeout[1] = be32_to_cpu(tpm_cmd.params.getcap_out.cap.timeout.b);
++ old_timeout[2] = be32_to_cpu(tpm_cmd.params.getcap_out.cap.timeout.c);
++ old_timeout[3] = be32_to_cpu(tpm_cmd.params.getcap_out.cap.timeout.d);
++ memcpy(new_timeout, old_timeout, sizeof(new_timeout));
++
++ /*
++ * Provide ability for vendor overrides of timeout values in case
++ * of misreporting.
++ */
++ if (chip->ops->update_timeouts != NULL)
++ chip->vendor.timeout_adjusted =
++ chip->ops->update_timeouts(chip, new_timeout);
++
++ if (!chip->vendor.timeout_adjusted) {
++ /* Don't overwrite default if value is 0 */
++ if (new_timeout[0] != 0 && new_timeout[0] < 1000) {
++ int i;
++
++ /* timeouts in msec rather usec */
++ for (i = 0; i != ARRAY_SIZE(new_timeout); i++)
++ new_timeout[i] *= 1000;
++ chip->vendor.timeout_adjusted = true;
++ }
++ }
++
++ /* Report adjusted timeouts */
++ if (chip->vendor.timeout_adjusted) {
++ dev_info(chip->dev,
++ HW_ERR "Adjusting reported timeouts: A %lu->%luus B %lu->%luus C %lu->%luus D %lu->%luus\n",
++ old_timeout[0], new_timeout[0],
++ old_timeout[1], new_timeout[1],
++ old_timeout[2], new_timeout[2],
++ old_timeout[3], new_timeout[3]);
+ }
+- if (timeout)
+- chip->vendor.timeout_a = usecs_to_jiffies(timeout * scale);
+- timeout = be32_to_cpu(timeout_cap->b);
+- if (timeout)
+- chip->vendor.timeout_b = usecs_to_jiffies(timeout * scale);
+- timeout = be32_to_cpu(timeout_cap->c);
+- if (timeout)
+- chip->vendor.timeout_c = usecs_to_jiffies(timeout * scale);
+- timeout = be32_to_cpu(timeout_cap->d);
+- if (timeout)
+- chip->vendor.timeout_d = usecs_to_jiffies(timeout * scale);
++
++ chip->vendor.timeout_a = usecs_to_jiffies(new_timeout[0]);
++ chip->vendor.timeout_b = usecs_to_jiffies(new_timeout[1]);
++ chip->vendor.timeout_c = usecs_to_jiffies(new_timeout[2]);
++ chip->vendor.timeout_d = usecs_to_jiffies(new_timeout[3]);
+
+ duration:
+ tpm_cmd.header.in = tpm_getcap_header;
+@@ -991,13 +1011,13 @@ int tpm_get_random(u32 chip_num, u8 *out, size_t max)
+ int err, total = 0, retries = 5;
+ u8 *dest = out;
+
++ if (!out || !num_bytes || max > TPM_MAX_RNG_DATA)
++ return -EINVAL;
++
+ chip = tpm_chip_find_get(chip_num);
+ if (chip == NULL)
+ return -ENODEV;
+
+- if (!out || !num_bytes || max > TPM_MAX_RNG_DATA)
+- return -EINVAL;
+-
+ do {
+ tpm_cmd.header.in = tpm_getrandom_header;
+ tpm_cmd.params.getrandom_in.num_bytes = cpu_to_be32(num_bytes);
+@@ -1016,6 +1036,7 @@ int tpm_get_random(u32 chip_num, u8 *out, size_t max)
+ num_bytes -= recd;
+ } while (retries-- && total < max);
+
++ tpm_chip_put(chip);
+ return total ? total : -EIO;
+ }
+ EXPORT_SYMBOL_GPL(tpm_get_random);
+@@ -1095,7 +1116,7 @@ struct tpm_chip *tpm_register_hardware(struct device *dev,
+ goto del_misc;
+
+ if (tpm_add_ppi(&dev->kobj))
+- goto del_misc;
++ goto del_sysfs;
+
+ chip->bios_dir = tpm_bios_log_setup(chip->devname);
+
+@@ -1106,6 +1127,8 @@ struct tpm_chip *tpm_register_hardware(struct device *dev,
+
+ return chip;
+
++del_sysfs:
++ tpm_sysfs_del_device(chip);
+ del_misc:
+ tpm_dev_del_device(chip);
+ put_device:
+diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
+index a9ed2270c25d..2c46734b266d 100644
+--- a/drivers/char/tpm/tpm_tis.c
++++ b/drivers/char/tpm/tpm_tis.c
+@@ -373,6 +373,36 @@ out_err:
+ return rc;
+ }
+
++struct tis_vendor_timeout_override {
++ u32 did_vid;
++ unsigned long timeout_us[4];
++};
++
++static const struct tis_vendor_timeout_override vendor_timeout_overrides[] = {
++ /* Atmel 3204 */
++ { 0x32041114, { (TIS_SHORT_TIMEOUT*1000), (TIS_LONG_TIMEOUT*1000),
++ (TIS_SHORT_TIMEOUT*1000), (TIS_SHORT_TIMEOUT*1000) } },
++};
++
++static bool tpm_tis_update_timeouts(struct tpm_chip *chip,
++ unsigned long *timeout_cap)
++{
++ int i;
++ u32 did_vid;
++
++ did_vid = ioread32(chip->vendor.iobase + TPM_DID_VID(0));
++
++ for (i = 0; i != ARRAY_SIZE(vendor_timeout_overrides); i++) {
++ if (vendor_timeout_overrides[i].did_vid != did_vid)
++ continue;
++ memcpy(timeout_cap, vendor_timeout_overrides[i].timeout_us,
++ sizeof(vendor_timeout_overrides[i].timeout_us));
++ return true;
++ }
++
++ return false;
++}
++
+ /*
+ * Early probing for iTPM with STS_DATA_EXPECT flaw.
+ * Try sending command without itpm flag set and if that
+@@ -437,6 +467,7 @@ static const struct tpm_class_ops tpm_tis = {
+ .recv = tpm_tis_recv,
+ .send = tpm_tis_send,
+ .cancel = tpm_tis_ready,
++ .update_timeouts = tpm_tis_update_timeouts,
+ .req_complete_mask = TPM_STS_DATA_AVAIL | TPM_STS_VALID,
+ .req_complete_val = TPM_STS_DATA_AVAIL | TPM_STS_VALID,
+ .req_canceled = tpm_tis_req_canceled,
+diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
+index bb1d08dc8cc8..379c0837f5a9 100644
+--- a/drivers/cpufreq/powernv-cpufreq.c
++++ b/drivers/cpufreq/powernv-cpufreq.c
+@@ -28,6 +28,7 @@
+ #include <linux/of.h>
+
+ #include <asm/cputhreads.h>
++#include <asm/firmware.h>
+ #include <asm/reg.h>
+ #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
+
+@@ -98,7 +99,11 @@ static int init_powernv_pstates(void)
+ return -ENODEV;
+ }
+
+- WARN_ON(len_ids != len_freqs);
++ if (len_ids != len_freqs) {
++ pr_warn("Entries in ibm,pstate-ids and "
++ "ibm,pstate-frequencies-mhz does not match\n");
++ }
++
+ nr_pstates = min(len_ids, len_freqs) / sizeof(u32);
+ if (!nr_pstates) {
+ pr_warn("No PStates found\n");
+@@ -131,7 +136,12 @@ static unsigned int pstate_id_to_freq(int pstate_id)
+ int i;
+
+ i = powernv_pstate_info.max - pstate_id;
+- BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
++ if (i >= powernv_pstate_info.nr_pstates || i < 0) {
++ pr_warn("PState id %d outside of PState table, "
++ "reporting nominal id %d instead\n",
++ pstate_id, powernv_pstate_info.nominal);
++ i = powernv_pstate_info.max - powernv_pstate_info.nominal;
++ }
+
+ return powernv_freqs[i].frequency;
+ }
+@@ -321,6 +331,10 @@ static int __init powernv_cpufreq_init(void)
+ {
+ int rc = 0;
+
++ /* Don't probe on pseries (guest) platforms */
++ if (!firmware_has_feature(FW_FEATURE_OPALv3))
++ return -ENODEV;
++
+ /* Discover pstates from device tree and init */
+ rc = init_powernv_pstates();
+ if (rc) {
+diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
+index 74f5788d50b1..a64be578dab2 100644
+--- a/drivers/cpuidle/cpuidle-powernv.c
++++ b/drivers/cpuidle/cpuidle-powernv.c
+@@ -160,10 +160,10 @@ static int powernv_cpuidle_driver_init(void)
+ static int powernv_add_idle_states(void)
+ {
+ struct device_node *power_mgt;
+- struct property *prop;
+ int nr_idle_states = 1; /* Snooze */
+ int dt_idle_states;
+- u32 *flags;
++ const __be32 *idle_state_flags;
++ u32 len_flags, flags;
+ int i;
+
+ /* Currently we have snooze statically defined */
+@@ -174,18 +174,18 @@ static int powernv_add_idle_states(void)
+ return nr_idle_states;
+ }
+
+- prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL);
+- if (!prop) {
++ idle_state_flags = of_get_property(power_mgt, "ibm,cpu-idle-state-flags", &len_flags);
++ if (!idle_state_flags) {
+ pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+ return nr_idle_states;
+ }
+
+- dt_idle_states = prop->length / sizeof(u32);
+- flags = (u32 *) prop->value;
++ dt_idle_states = len_flags / sizeof(u32);
+
+ for (i = 0; i < dt_idle_states; i++) {
+
+- if (flags[i] & IDLE_USE_INST_NAP) {
++ flags = be32_to_cpu(idle_state_flags[i]);
++ if (flags & IDLE_USE_INST_NAP) {
+ /* Add NAP state */
+ strcpy(powernv_states[nr_idle_states].name, "Nap");
+ strcpy(powernv_states[nr_idle_states].desc, "Nap");
+@@ -196,7 +196,7 @@ static int powernv_add_idle_states(void)
+ nr_idle_states++;
+ }
+
+- if (flags[i] & IDLE_USE_INST_SLEEP) {
++ if (flags & IDLE_USE_INST_SLEEP) {
+ /* Add FASTSLEEP state */
+ strcpy(powernv_states[nr_idle_states].name, "FastSleep");
+ strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
+diff --git a/drivers/firmware/efi/vars.c b/drivers/firmware/efi/vars.c
+index f0a43646a2f3..5abe943e3404 100644
+--- a/drivers/firmware/efi/vars.c
++++ b/drivers/firmware/efi/vars.c
+@@ -481,7 +481,7 @@ EXPORT_SYMBOL_GPL(efivar_entry_remove);
+ */
+ static void efivar_entry_list_del_unlock(struct efivar_entry *entry)
+ {
+- WARN_ON(!spin_is_locked(&__efivars->lock));
++ lockdep_assert_held(&__efivars->lock);
+
+ list_del(&entry->list);
+ spin_unlock_irq(&__efivars->lock);
+@@ -507,7 +507,7 @@ int __efivar_entry_delete(struct efivar_entry *entry)
+ const struct efivar_operations *ops = __efivars->ops;
+ efi_status_t status;
+
+- WARN_ON(!spin_is_locked(&__efivars->lock));
++ lockdep_assert_held(&__efivars->lock);
+
+ status = ops->set_variable(entry->var.VariableName,
+ &entry->var.VendorGuid,
+@@ -667,7 +667,7 @@ struct efivar_entry *efivar_entry_find(efi_char16_t *name, efi_guid_t guid,
+ int strsize1, strsize2;
+ bool found = false;
+
+- WARN_ON(!spin_is_locked(&__efivars->lock));
++ lockdep_assert_held(&__efivars->lock);
+
+ list_for_each_entry_safe(entry, n, head, list) {
+ strsize1 = ucs2_strsize(name, 1024);
+@@ -739,7 +739,7 @@ int __efivar_entry_get(struct efivar_entry *entry, u32 *attributes,
+ const struct efivar_operations *ops = __efivars->ops;
+ efi_status_t status;
+
+- WARN_ON(!spin_is_locked(&__efivars->lock));
++ lockdep_assert_held(&__efivars->lock);
+
+ status = ops->get_variable(entry->var.VariableName,
+ &entry->var.VendorGuid,
+diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c
+index 47ad74255bf1..dd469dbeaae1 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_display.c
++++ b/drivers/gpu/drm/nouveau/nouveau_display.c
+@@ -404,6 +404,11 @@ nouveau_display_fini(struct drm_device *dev)
+ {
+ struct nouveau_display *disp = nouveau_display(dev);
+ struct drm_connector *connector;
++ int head;
++
++ /* Make sure that drm and hw vblank irqs get properly disabled. */
++ for (head = 0; head < dev->mode_config.num_crtc; head++)
++ drm_vblank_off(dev, head);
+
+ /* disable hotplug interrupts */
+ list_for_each_entry(connector, &dev->mode_config.connector_list, head) {
+@@ -620,6 +625,8 @@ void
+ nouveau_display_resume(struct drm_device *dev)
+ {
+ struct drm_crtc *crtc;
++ int head;
++
+ nouveau_display_init(dev);
+
+ /* Force CLUT to get re-loaded during modeset */
+@@ -629,6 +636,10 @@ nouveau_display_resume(struct drm_device *dev)
+ nv_crtc->lut.depth = 0;
+ }
+
++ /* Make sure that drm and hw vblank irqs get resumed if needed. */
++ for (head = 0; head < dev->mode_config.num_crtc; head++)
++ drm_vblank_on(dev, head);
++
+ drm_helper_resume_force_mode(dev);
+
+ list_for_each_entry(crtc, &dev->mode_config.crtc_list, head) {
+diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.h b/drivers/gpu/drm/nouveau/nouveau_drm.h
+index 7efbafaf7c1d..b628addcdf69 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_drm.h
++++ b/drivers/gpu/drm/nouveau/nouveau_drm.h
+@@ -10,7 +10,7 @@
+
+ #define DRIVER_MAJOR 1
+ #define DRIVER_MINOR 1
+-#define DRIVER_PATCHLEVEL 1
++#define DRIVER_PATCHLEVEL 2
+
+ /*
+ * 1.1.1:
+@@ -21,6 +21,8 @@
+ * to control registers on the MPs to enable performance counters,
+ * and to control the warp error enable mask (OpenGL requires out of
+ * bounds access to local memory to be silently ignored / return 0).
++ * 1.1.2:
++ * - fixes multiple bugs in flip completion events and timestamping
+ */
+
+ #include <core/client.h>
+diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
+index 767f2cc44bd8..65a8cca603a4 100644
+--- a/drivers/gpu/drm/radeon/cik.c
++++ b/drivers/gpu/drm/radeon/cik.c
+@@ -7901,6 +7901,7 @@ restart_ih:
+ static int cik_startup(struct radeon_device *rdev)
+ {
+ struct radeon_ring *ring;
++ u32 nop;
+ int r;
+
+ /* enable pcie gen2/3 link */
+@@ -8034,9 +8035,15 @@ static int cik_startup(struct radeon_device *rdev)
+ }
+ cik_irq_set(rdev);
+
++ if (rdev->family == CHIP_HAWAII) {
++ nop = RADEON_CP_PACKET2;
++ } else {
++ nop = PACKET3(PACKET3_NOP, 0x3FFF);
++ }
++
+ ring = &rdev->ring[RADEON_RING_TYPE_GFX_INDEX];
+ r = radeon_ring_init(rdev, ring, ring->ring_size, RADEON_WB_CP_RPTR_OFFSET,
+- PACKET3(PACKET3_NOP, 0x3FFF));
++ nop);
+ if (r)
+ return r;
+
+@@ -8044,7 +8051,7 @@ static int cik_startup(struct radeon_device *rdev)
+ /* type-2 packets are deprecated on MEC, use type-3 instead */
+ ring = &rdev->ring[CAYMAN_RING_TYPE_CP1_INDEX];
+ r = radeon_ring_init(rdev, ring, ring->ring_size, RADEON_WB_CP1_RPTR_OFFSET,
+- PACKET3(PACKET3_NOP, 0x3FFF));
++ nop);
+ if (r)
+ return r;
+ ring->me = 1; /* first MEC */
+@@ -8055,7 +8062,7 @@ static int cik_startup(struct radeon_device *rdev)
+ /* type-2 packets are deprecated on MEC, use type-3 instead */
+ ring = &rdev->ring[CAYMAN_RING_TYPE_CP2_INDEX];
+ r = radeon_ring_init(rdev, ring, ring->ring_size, RADEON_WB_CP2_RPTR_OFFSET,
+- PACKET3(PACKET3_NOP, 0x3FFF));
++ nop);
+ if (r)
+ return r;
+ /* dGPU only have 1 MEC */
+diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c
+index 3d2e489ab732..ff9163dc1596 100644
+--- a/drivers/infiniband/core/iwcm.c
++++ b/drivers/infiniband/core/iwcm.c
+@@ -46,6 +46,7 @@
+ #include <linux/completion.h>
+ #include <linux/slab.h>
+ #include <linux/module.h>
++#include <linux/sysctl.h>
+
+ #include <rdma/iw_cm.h>
+ #include <rdma/ib_addr.h>
+@@ -65,6 +66,20 @@ struct iwcm_work {
+ struct list_head free_list;
+ };
+
++static unsigned int default_backlog = 256;
++
++static struct ctl_table_header *iwcm_ctl_table_hdr;
++static struct ctl_table iwcm_ctl_table[] = {
++ {
++ .procname = "default_backlog",
++ .data = &default_backlog,
++ .maxlen = sizeof(default_backlog),
++ .mode = 0644,
++ .proc_handler = proc_dointvec,
++ },
++ { }
++};
++
+ /*
+ * The following services provide a mechanism for pre-allocating iwcm_work
+ * elements. The design pre-allocates them based on the cm_id type:
+@@ -425,6 +440,9 @@ int iw_cm_listen(struct iw_cm_id *cm_id, int backlog)
+
+ cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+
++ if (!backlog)
++ backlog = default_backlog;
++
+ ret = alloc_work_entries(cm_id_priv, backlog);
+ if (ret)
+ return ret;
+@@ -1030,11 +1048,20 @@ static int __init iw_cm_init(void)
+ if (!iwcm_wq)
+ return -ENOMEM;
+
++ iwcm_ctl_table_hdr = register_net_sysctl(&init_net, "net/iw_cm",
++ iwcm_ctl_table);
++ if (!iwcm_ctl_table_hdr) {
++ pr_err("iw_cm: couldn't register sysctl paths\n");
++ destroy_workqueue(iwcm_wq);
++ return -ENOMEM;
++ }
++
+ return 0;
+ }
+
+ static void __exit iw_cm_cleanup(void)
+ {
++ unregister_net_sysctl_table(iwcm_ctl_table_hdr);
+ destroy_workqueue(iwcm_wq);
+ }
+
+diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
+index e3c2c5b4297f..767000811cf9 100644
+--- a/drivers/infiniband/ulp/srp/ib_srp.c
++++ b/drivers/infiniband/ulp/srp/ib_srp.c
+@@ -130,6 +130,7 @@ static void srp_send_completion(struct ib_cq *cq, void *target_ptr);
+ static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event);
+
+ static struct scsi_transport_template *ib_srp_transport_template;
++static struct workqueue_struct *srp_remove_wq;
+
+ static struct ib_client srp_client = {
+ .name = "srp",
+@@ -731,7 +732,7 @@ static bool srp_queue_remove_work(struct srp_target_port *target)
+ spin_unlock_irq(&target->lock);
+
+ if (changed)
+- queue_work(system_long_wq, &target->remove_work);
++ queue_work(srp_remove_wq, &target->remove_work);
+
+ return changed;
+ }
+@@ -3261,9 +3262,10 @@ static void srp_remove_one(struct ib_device *device)
+ spin_unlock(&host->target_lock);
+
+ /*
+- * Wait for target port removal tasks.
++ * Wait for tl_err and target port removal tasks.
+ */
+ flush_workqueue(system_long_wq);
++ flush_workqueue(srp_remove_wq);
+
+ kfree(host);
+ }
+@@ -3313,16 +3315,22 @@ static int __init srp_init_module(void)
+ indirect_sg_entries = cmd_sg_entries;
+ }
+
++ srp_remove_wq = create_workqueue("srp_remove");
++ if (IS_ERR(srp_remove_wq)) {
++ ret = PTR_ERR(srp_remove_wq);
++ goto out;
++ }
++
++ ret = -ENOMEM;
+ ib_srp_transport_template =
+ srp_attach_transport(&ib_srp_transport_functions);
+ if (!ib_srp_transport_template)
+- return -ENOMEM;
++ goto destroy_wq;
+
+ ret = class_register(&srp_class);
+ if (ret) {
+ pr_err("couldn't register class infiniband_srp\n");
+- srp_release_transport(ib_srp_transport_template);
+- return ret;
++ goto release_tr;
+ }
+
+ ib_sa_register_client(&srp_sa_client);
+@@ -3330,13 +3338,22 @@ static int __init srp_init_module(void)
+ ret = ib_register_client(&srp_client);
+ if (ret) {
+ pr_err("couldn't register IB client\n");
+- srp_release_transport(ib_srp_transport_template);
+- ib_sa_unregister_client(&srp_sa_client);
+- class_unregister(&srp_class);
+- return ret;
++ goto unreg_sa;
+ }
+
+- return 0;
++out:
++ return ret;
++
++unreg_sa:
++ ib_sa_unregister_client(&srp_sa_client);
++ class_unregister(&srp_class);
++
++release_tr:
++ srp_release_transport(ib_srp_transport_template);
++
++destroy_wq:
++ destroy_workqueue(srp_remove_wq);
++ goto out;
+ }
+
+ static void __exit srp_cleanup_module(void)
+@@ -3345,6 +3362,7 @@ static void __exit srp_cleanup_module(void)
+ ib_sa_unregister_client(&srp_sa_client);
+ class_unregister(&srp_class);
+ srp_release_transport(ib_srp_transport_template);
++ destroy_workqueue(srp_remove_wq);
+ }
+
+ module_init(srp_init_module);
+diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
+index 4aec6a29e316..710ffa1830ae 100644
+--- a/drivers/iommu/amd_iommu.c
++++ b/drivers/iommu/amd_iommu.c
+@@ -3227,14 +3227,16 @@ free_domains:
+
+ static void cleanup_domain(struct protection_domain *domain)
+ {
+- struct iommu_dev_data *dev_data, *next;
++ struct iommu_dev_data *entry;
+ unsigned long flags;
+
+ write_lock_irqsave(&amd_iommu_devtable_lock, flags);
+
+- list_for_each_entry_safe(dev_data, next, &domain->dev_list, list) {
+- __detach_device(dev_data);
+- atomic_set(&dev_data->bind, 0);
++ while (!list_empty(&domain->dev_list)) {
++ entry = list_first_entry(&domain->dev_list,
++ struct iommu_dev_data, list);
++ __detach_device(entry);
++ atomic_set(&entry->bind, 0);
+ }
+
+ write_unlock_irqrestore(&amd_iommu_devtable_lock, flags);
+diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
+index 51b6b77dc3e5..382c1801a8f1 100644
+--- a/drivers/iommu/intel-iommu.c
++++ b/drivers/iommu/intel-iommu.c
+@@ -2523,22 +2523,46 @@ static bool device_has_rmrr(struct device *dev)
+ return false;
+ }
+
++/*
++ * There are a couple cases where we need to restrict the functionality of
++ * devices associated with RMRRs. The first is when evaluating a device for
++ * identity mapping because problems exist when devices are moved in and out
++ * of domains and their respective RMRR information is lost. This means that
++ * a device with associated RMRRs will never be in a "passthrough" domain.
++ * The second is use of the device through the IOMMU API. This interface
++ * expects to have full control of the IOVA space for the device. We cannot
++ * satisfy both the requirement that RMRR access is maintained and have an
++ * unencumbered IOVA space. We also have no ability to quiesce the device's
++ * use of the RMRR space or even inform the IOMMU API user of the restriction.
++ * We therefore prevent devices associated with an RMRR from participating in
++ * the IOMMU API, which eliminates them from device assignment.
++ *
++ * In both cases we assume that PCI USB devices with RMRRs have them largely
++ * for historical reasons and that the RMRR space is not actively used post
++ * boot. This exclusion may change if vendors begin to abuse it.
++ */
++static bool device_is_rmrr_locked(struct device *dev)
++{
++ if (!device_has_rmrr(dev))
++ return false;
++
++ if (dev_is_pci(dev)) {
++ struct pci_dev *pdev = to_pci_dev(dev);
++
++ if ((pdev->class >> 8) == PCI_CLASS_SERIAL_USB)
++ return false;
++ }
++
++ return true;
++}
++
+ static int iommu_should_identity_map(struct device *dev, int startup)
+ {
+
+ if (dev_is_pci(dev)) {
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+- /*
+- * We want to prevent any device associated with an RMRR from
+- * getting placed into the SI Domain. This is done because
+- * problems exist when devices are moved in and out of domains
+- * and their respective RMRR info is lost. We exempt USB devices
+- * from this process due to their usage of RMRRs that are known
+- * to not be needed after BIOS hand-off to OS.
+- */
+- if (device_has_rmrr(dev) &&
+- (pdev->class >> 8) != PCI_CLASS_SERIAL_USB)
++ if (device_is_rmrr_locked(dev))
+ return 0;
+
+ if ((iommu_identity_mapping & IDENTMAP_AZALIA) && IS_AZALIA(pdev))
+@@ -3867,6 +3891,14 @@ static int device_notifier(struct notifier_block *nb,
+ action != BUS_NOTIFY_DEL_DEVICE)
+ return 0;
+
++ /*
++ * If the device is still attached to a device driver we can't
++ * tear down the domain yet as DMA mappings may still be in use.
++ * Wait for the BUS_NOTIFY_UNBOUND_DRIVER event to do that.
++ */
++ if (action == BUS_NOTIFY_DEL_DEVICE && dev->driver != NULL)
++ return 0;
++
+ domain = find_domain(dev);
+ if (!domain)
+ return 0;
+@@ -4202,6 +4234,11 @@ static int intel_iommu_attach_device(struct iommu_domain *domain,
+ int addr_width;
+ u8 bus, devfn;
+
++ if (device_is_rmrr_locked(dev)) {
++ dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement. Contact your platform vendor.\n");
++ return -EPERM;
++ }
++
+ /* normally dev is not mapped */
+ if (unlikely(domain_context_mapped(dev))) {
+ struct dmar_domain *old_domain;
+diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
+index 5f59f1e3e5b1..922791009fc5 100644
+--- a/drivers/md/dm-table.c
++++ b/drivers/md/dm-table.c
+@@ -1386,6 +1386,14 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
+ return q && !blk_queue_add_random(q);
+ }
+
++static int queue_supports_sg_merge(struct dm_target *ti, struct dm_dev *dev,
++ sector_t start, sector_t len, void *data)
++{
++ struct request_queue *q = bdev_get_queue(dev->bdev);
++
++ return q && !test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags);
++}
++
+ static bool dm_table_all_devices_attribute(struct dm_table *t,
+ iterate_devices_callout_fn func)
+ {
+@@ -1464,6 +1472,11 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
+ if (!dm_table_supports_write_same(t))
+ q->limits.max_write_same_sectors = 0;
+
++ if (dm_table_all_devices_attribute(t, queue_supports_sg_merge))
++ queue_flag_clear_unlocked(QUEUE_FLAG_NO_SG_MERGE, q);
++ else
++ queue_flag_set_unlocked(QUEUE_FLAG_NO_SG_MERGE, q);
++
+ dm_table_set_integrity(t);
+
+ /*
+diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
+index 56e24c072b62..d7690f86fdb9 100644
+--- a/drivers/md/raid1.c
++++ b/drivers/md/raid1.c
+@@ -1501,12 +1501,12 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
+ mddev->degraded++;
+ set_bit(Faulty, &rdev->flags);
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+- /*
+- * if recovery is running, make sure it aborts.
+- */
+- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ } else
+ set_bit(Faulty, &rdev->flags);
++ /*
++ * if recovery is running, make sure it aborts.
++ */
++ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ set_bit(MD_CHANGE_DEVS, &mddev->flags);
+ printk(KERN_ALERT
+ "md/raid1:%s: Disk failure on %s, disabling device.\n"
+diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
+index cb882aae9e20..a46124ecafc7 100644
+--- a/drivers/md/raid10.c
++++ b/drivers/md/raid10.c
+@@ -1684,13 +1684,12 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
+ spin_unlock_irqrestore(&conf->device_lock, flags);
+ return;
+ }
+- if (test_and_clear_bit(In_sync, &rdev->flags)) {
++ if (test_and_clear_bit(In_sync, &rdev->flags))
+ mddev->degraded++;
+- /*
+- * if recovery is running, make sure it aborts.
+- */
+- set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+- }
++ /*
++ * If recovery is running, make sure it aborts.
++ */
++ set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ set_bit(Blocked, &rdev->flags);
+ set_bit(Faulty, &rdev->flags);
+ set_bit(MD_CHANGE_DEVS, &mddev->flags);
+@@ -2954,6 +2953,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
+ */
+ if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
+ end_reshape(conf);
++ close_sync(conf);
+ return 0;
+ }
+
+@@ -4411,7 +4411,7 @@ read_more:
+ read_bio->bi_private = r10_bio;
+ read_bio->bi_end_io = end_sync_read;
+ read_bio->bi_rw = READ;
+- read_bio->bi_flags &= ~(BIO_POOL_MASK - 1);
++ read_bio->bi_flags &= (~0UL << BIO_RESET_BITS);
+ read_bio->bi_flags |= 1 << BIO_UPTODATE;
+ read_bio->bi_vcnt = 0;
+ read_bio->bi_iter.bi_size = 0;
+diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
+index 6234b2e84587..183588b11fc1 100644
+--- a/drivers/md/raid5.c
++++ b/drivers/md/raid5.c
+@@ -2922,7 +2922,7 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
+ (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
+ !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
+ (sh->raid_conf->level == 6 && s->failed && s->to_write &&
+- s->to_write < sh->raid_conf->raid_disks - 2 &&
++ s->to_write - s->non_overwrite < sh->raid_conf->raid_disks - 2 &&
+ (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))))) {
+ /* we would like to get this block, possibly by computing it,
+ * otherwise read it if the backing disk is insync
+@@ -3817,6 +3817,8 @@ static void handle_stripe(struct stripe_head *sh)
+ set_bit(R5_Wantwrite, &dev->flags);
+ if (prexor)
+ continue;
++ if (s.failed > 1)
++ continue;
+ if (!test_bit(R5_Insync, &dev->flags) ||
+ ((i == sh->pd_idx || i == sh->qd_idx) &&
+ s.failed == 0))
+diff --git a/drivers/media/common/siano/Kconfig b/drivers/media/common/siano/Kconfig
+index f953d33ee151..4bfbd5f463d1 100644
+--- a/drivers/media/common/siano/Kconfig
++++ b/drivers/media/common/siano/Kconfig
+@@ -22,8 +22,7 @@ config SMS_SIANO_DEBUGFS
+ bool "Enable debugfs for smsdvb"
+ depends on SMS_SIANO_MDTV
+ depends on DEBUG_FS
+- depends on SMS_USB_DRV
+- depends on CONFIG_SMS_USB_DRV = CONFIG_SMS_SDIO_DRV
++ depends on SMS_USB_DRV = SMS_SDIO_DRV
+
+ ---help---
+ Choose Y to enable visualizing a dump of the frontend
+diff --git a/drivers/media/i2c/mt9v032.c b/drivers/media/i2c/mt9v032.c
+index 40172b8d8ea2..f04d0bbd9cfd 100644
+--- a/drivers/media/i2c/mt9v032.c
++++ b/drivers/media/i2c/mt9v032.c
+@@ -305,8 +305,8 @@ mt9v032_update_hblank(struct mt9v032 *mt9v032)
+
+ if (mt9v032->version->version == MT9V034_CHIP_ID_REV1)
+ min_hblank += (mt9v032->hratio - 1) * 10;
+- min_hblank = max_t(unsigned int, (int)mt9v032->model->data->min_row_time - crop->width,
+- (int)min_hblank);
++ min_hblank = max_t(int, mt9v032->model->data->min_row_time - crop->width,
++ min_hblank);
+ hblank = max_t(unsigned int, mt9v032->hblank, min_hblank);
+
+ return mt9v032_write(client, MT9V032_HORIZONTAL_BLANKING, hblank);
+diff --git a/drivers/media/media-device.c b/drivers/media/media-device.c
+index 88b97c9e64ac..73a432934bd8 100644
+--- a/drivers/media/media-device.c
++++ b/drivers/media/media-device.c
+@@ -106,8 +106,6 @@ static long media_device_enum_entities(struct media_device *mdev,
+ if (ent->name) {
+ strncpy(u_ent.name, ent->name, sizeof(u_ent.name));
+ u_ent.name[sizeof(u_ent.name) - 1] = '\0';
+- } else {
+- memset(u_ent.name, 0, sizeof(u_ent.name));
+ }
+ u_ent.type = ent->type;
+ u_ent.revision = ent->revision;
+diff --git a/drivers/media/platform/vsp1/vsp1_video.c b/drivers/media/platform/vsp1/vsp1_video.c
+index 8a1253e51f04..677e3aa04eee 100644
+--- a/drivers/media/platform/vsp1/vsp1_video.c
++++ b/drivers/media/platform/vsp1/vsp1_video.c
+@@ -654,8 +654,6 @@ static int vsp1_video_buffer_prepare(struct vb2_buffer *vb)
+ if (vb->num_planes < format->num_planes)
+ return -EINVAL;
+
+- buf->video = video;
+-
+ for (i = 0; i < vb->num_planes; ++i) {
+ buf->addr[i] = vb2_dma_contig_plane_dma_addr(vb, i);
+ buf->length[i] = vb2_plane_size(vb, i);
+diff --git a/drivers/media/platform/vsp1/vsp1_video.h b/drivers/media/platform/vsp1/vsp1_video.h
+index c04d48fa2999..7284320d5433 100644
+--- a/drivers/media/platform/vsp1/vsp1_video.h
++++ b/drivers/media/platform/vsp1/vsp1_video.h
+@@ -90,7 +90,6 @@ static inline struct vsp1_pipeline *to_vsp1_pipeline(struct media_entity *e)
+ }
+
+ struct vsp1_video_buffer {
+- struct vsp1_video *video;
+ struct vb2_buffer buf;
+ struct list_head queue;
+
+diff --git a/drivers/media/tuners/xc4000.c b/drivers/media/tuners/xc4000.c
+index 2018befabb5a..e71decbfd0af 100644
+--- a/drivers/media/tuners/xc4000.c
++++ b/drivers/media/tuners/xc4000.c
+@@ -93,7 +93,7 @@ struct xc4000_priv {
+ struct firmware_description *firm;
+ int firm_size;
+ u32 if_khz;
+- u32 freq_hz;
++ u32 freq_hz, freq_offset;
+ u32 bandwidth;
+ u8 video_standard;
+ u8 rf_mode;
+@@ -1157,14 +1157,14 @@ static int xc4000_set_params(struct dvb_frontend *fe)
+ case SYS_ATSC:
+ dprintk(1, "%s() VSB modulation\n", __func__);
+ priv->rf_mode = XC_RF_MODE_AIR;
+- priv->freq_hz = c->frequency - 1750000;
++ priv->freq_offset = 1750000;
+ priv->video_standard = XC4000_DTV6;
+ type = DTV6;
+ break;
+ case SYS_DVBC_ANNEX_B:
+ dprintk(1, "%s() QAM modulation\n", __func__);
+ priv->rf_mode = XC_RF_MODE_CABLE;
+- priv->freq_hz = c->frequency - 1750000;
++ priv->freq_offset = 1750000;
+ priv->video_standard = XC4000_DTV6;
+ type = DTV6;
+ break;
+@@ -1173,23 +1173,23 @@ static int xc4000_set_params(struct dvb_frontend *fe)
+ dprintk(1, "%s() OFDM\n", __func__);
+ if (bw == 0) {
+ if (c->frequency < 400000000) {
+- priv->freq_hz = c->frequency - 2250000;
++ priv->freq_offset = 2250000;
+ } else {
+- priv->freq_hz = c->frequency - 2750000;
++ priv->freq_offset = 2750000;
+ }
+ priv->video_standard = XC4000_DTV7_8;
+ type = DTV78;
+ } else if (bw <= 6000000) {
+ priv->video_standard = XC4000_DTV6;
+- priv->freq_hz = c->frequency - 1750000;
++ priv->freq_offset = 1750000;
+ type = DTV6;
+ } else if (bw <= 7000000) {
+ priv->video_standard = XC4000_DTV7;
+- priv->freq_hz = c->frequency - 2250000;
++ priv->freq_offset = 2250000;
+ type = DTV7;
+ } else {
+ priv->video_standard = XC4000_DTV8;
+- priv->freq_hz = c->frequency - 2750000;
++ priv->freq_offset = 2750000;
+ type = DTV8;
+ }
+ priv->rf_mode = XC_RF_MODE_AIR;
+@@ -1200,6 +1200,8 @@ static int xc4000_set_params(struct dvb_frontend *fe)
+ goto fail;
+ }
+
++ priv->freq_hz = c->frequency - priv->freq_offset;
++
+ dprintk(1, "%s() frequency=%d (compensated)\n",
+ __func__, priv->freq_hz);
+
+@@ -1520,7 +1522,7 @@ static int xc4000_get_frequency(struct dvb_frontend *fe, u32 *freq)
+ {
+ struct xc4000_priv *priv = fe->tuner_priv;
+
+- *freq = priv->freq_hz;
++ *freq = priv->freq_hz + priv->freq_offset;
+
+ if (debug) {
+ mutex_lock(&priv->lock);
+diff --git a/drivers/media/tuners/xc5000.c b/drivers/media/tuners/xc5000.c
+index 2b3d514be672..3091cf7be7a1 100644
+--- a/drivers/media/tuners/xc5000.c
++++ b/drivers/media/tuners/xc5000.c
+@@ -56,7 +56,7 @@ struct xc5000_priv {
+
+ u32 if_khz;
+ u16 xtal_khz;
+- u32 freq_hz;
++ u32 freq_hz, freq_offset;
+ u32 bandwidth;
+ u8 video_standard;
+ u8 rf_mode;
+@@ -749,13 +749,13 @@ static int xc5000_set_params(struct dvb_frontend *fe)
+ case SYS_ATSC:
+ dprintk(1, "%s() VSB modulation\n", __func__);
+ priv->rf_mode = XC_RF_MODE_AIR;
+- priv->freq_hz = freq - 1750000;
++ priv->freq_offset = 1750000;
+ priv->video_standard = DTV6;
+ break;
+ case SYS_DVBC_ANNEX_B:
+ dprintk(1, "%s() QAM modulation\n", __func__);
+ priv->rf_mode = XC_RF_MODE_CABLE;
+- priv->freq_hz = freq - 1750000;
++ priv->freq_offset = 1750000;
+ priv->video_standard = DTV6;
+ break;
+ case SYS_ISDBT:
+@@ -770,15 +770,15 @@ static int xc5000_set_params(struct dvb_frontend *fe)
+ switch (bw) {
+ case 6000000:
+ priv->video_standard = DTV6;
+- priv->freq_hz = freq - 1750000;
++ priv->freq_offset = 1750000;
+ break;
+ case 7000000:
+ priv->video_standard = DTV7;
+- priv->freq_hz = freq - 2250000;
++ priv->freq_offset = 2250000;
+ break;
+ case 8000000:
+ priv->video_standard = DTV8;
+- priv->freq_hz = freq - 2750000;
++ priv->freq_offset = 2750000;
+ break;
+ default:
+ printk(KERN_ERR "xc5000 bandwidth not set!\n");
+@@ -792,15 +792,15 @@ static int xc5000_set_params(struct dvb_frontend *fe)
+ priv->rf_mode = XC_RF_MODE_CABLE;
+ if (bw <= 6000000) {
+ priv->video_standard = DTV6;
+- priv->freq_hz = freq - 1750000;
++ priv->freq_offset = 1750000;
+ b = 6;
+ } else if (bw <= 7000000) {
+ priv->video_standard = DTV7;
+- priv->freq_hz = freq - 2250000;
++ priv->freq_offset = 2250000;
+ b = 7;
+ } else {
+ priv->video_standard = DTV7_8;
+- priv->freq_hz = freq - 2750000;
++ priv->freq_offset = 2750000;
+ b = 8;
+ }
+ dprintk(1, "%s() Bandwidth %dMHz (%d)\n", __func__,
+@@ -811,6 +811,8 @@ static int xc5000_set_params(struct dvb_frontend *fe)
+ return -EINVAL;
+ }
+
++ priv->freq_hz = freq - priv->freq_offset;
++
+ dprintk(1, "%s() frequency=%d (compensated to %d)\n",
+ __func__, freq, priv->freq_hz);
+
+@@ -1061,7 +1063,7 @@ static int xc5000_get_frequency(struct dvb_frontend *fe, u32 *freq)
+ {
+ struct xc5000_priv *priv = fe->tuner_priv;
+ dprintk(1, "%s()\n", __func__);
+- *freq = priv->freq_hz;
++ *freq = priv->freq_hz + priv->freq_offset;
+ return 0;
+ }
+
+diff --git a/drivers/media/usb/au0828/au0828-video.c b/drivers/media/usb/au0828/au0828-video.c
+index 9038194513c5..49124b76e4cf 100644
+--- a/drivers/media/usb/au0828/au0828-video.c
++++ b/drivers/media/usb/au0828/au0828-video.c
+@@ -787,11 +787,27 @@ static int au0828_i2s_init(struct au0828_dev *dev)
+
+ /*
+ * Auvitek au0828 analog stream enable
+- * Please set interface0 to AS5 before enable the stream
+ */
+ static int au0828_analog_stream_enable(struct au0828_dev *d)
+ {
++ struct usb_interface *iface;
++ int ret;
++
+ dprintk(1, "au0828_analog_stream_enable called\n");
++
++ iface = usb_ifnum_to_if(d->usbdev, 0);
++ if (iface && iface->cur_altsetting->desc.bAlternateSetting != 5) {
++ dprintk(1, "Changing intf#0 to alt 5\n");
++ /* set au0828 interface0 to AS5 here again */
++ ret = usb_set_interface(d->usbdev, 0, 5);
++ if (ret < 0) {
++ printk(KERN_INFO "Au0828 can't set alt setting to 5!\n");
++ return -EBUSY;
++ }
++ }
++
++ /* FIXME: size should be calculated using d->width, d->height */
++
+ au0828_writereg(d, AU0828_SENSORCTRL_VBI_103, 0x00);
+ au0828_writereg(d, 0x106, 0x00);
+ /* set x position */
+@@ -1002,15 +1018,6 @@ static int au0828_v4l2_open(struct file *filp)
+ return -ERESTARTSYS;
+ }
+ if (dev->users == 0) {
+- /* set au0828 interface0 to AS5 here again */
+- ret = usb_set_interface(dev->usbdev, 0, 5);
+- if (ret < 0) {
+- mutex_unlock(&dev->lock);
+- printk(KERN_INFO "Au0828 can't set alternate to 5!\n");
+- kfree(fh);
+- return -EBUSY;
+- }
+-
+ au0828_analog_stream_enable(dev);
+ au0828_analog_stream_reset(dev);
+
+@@ -1252,13 +1259,6 @@ static int au0828_set_format(struct au0828_dev *dev, unsigned int cmd,
+ }
+ }
+
+- /* set au0828 interface0 to AS5 here again */
+- ret = usb_set_interface(dev->usbdev, 0, 5);
+- if (ret < 0) {
+- printk(KERN_INFO "Au0828 can't set alt setting to 5!\n");
+- return -EBUSY;
+- }
+-
+ au0828_analog_stream_enable(dev);
+
+ return 0;
+diff --git a/drivers/media/v4l2-core/videobuf2-core.c b/drivers/media/v4l2-core/videobuf2-core.c
+index 7c4489c42365..1d67e95311d6 100644
+--- a/drivers/media/v4l2-core/videobuf2-core.c
++++ b/drivers/media/v4l2-core/videobuf2-core.c
+@@ -1750,12 +1750,14 @@ static int vb2_start_streaming(struct vb2_queue *q)
+ __enqueue_in_driver(vb);
+
+ /* Tell the driver to start streaming */
++ q->start_streaming_called = 1;
+ ret = call_qop(q, start_streaming, q,
+ atomic_read(&q->owned_by_drv_count));
+- q->start_streaming_called = ret == 0;
+ if (!ret)
+ return 0;
+
++ q->start_streaming_called = 0;
++
+ dprintk(1, "driver refused to start streaming\n");
+ if (WARN_ON(atomic_read(&q->owned_by_drv_count))) {
+ unsigned i;
+diff --git a/drivers/mfd/omap-usb-host.c b/drivers/mfd/omap-usb-host.c
+index b48d80c367f9..33a9234b701c 100644
+--- a/drivers/mfd/omap-usb-host.c
++++ b/drivers/mfd/omap-usb-host.c
+@@ -445,7 +445,7 @@ static unsigned omap_usbhs_rev1_hostconfig(struct usbhs_hcd_omap *omap,
+
+ for (i = 0; i < omap->nports; i++) {
+ if (is_ehci_phy_mode(pdata->port_mode[i])) {
+- reg &= OMAP_UHH_HOSTCONFIG_ULPI_BYPASS;
++ reg &= ~OMAP_UHH_HOSTCONFIG_ULPI_BYPASS;
+ break;
+ }
+ }
+diff --git a/drivers/mfd/rtsx_usb.c b/drivers/mfd/rtsx_usb.c
+index 6352bec8419a..71f387ce8cbd 100644
+--- a/drivers/mfd/rtsx_usb.c
++++ b/drivers/mfd/rtsx_usb.c
+@@ -744,6 +744,7 @@ static struct usb_device_id rtsx_usb_usb_ids[] = {
+ { USB_DEVICE(0x0BDA, 0x0140) },
+ { }
+ };
++MODULE_DEVICE_TABLE(usb, rtsx_usb_usb_ids);
+
+ static struct usb_driver rtsx_usb_driver = {
+ .name = "rtsx_usb",
+diff --git a/drivers/mfd/twl4030-power.c b/drivers/mfd/twl4030-power.c
+index 3bc969a5916b..4d3ff3771491 100644
+--- a/drivers/mfd/twl4030-power.c
++++ b/drivers/mfd/twl4030-power.c
+@@ -724,24 +724,24 @@ static struct twl4030_script *omap3_idle_scripts[] = {
+ * above.
+ */
+ static struct twl4030_resconfig omap3_idle_rconfig[] = {
+- TWL_REMAP_SLEEP(RES_VAUX1, DEV_GRP_NULL, 0, 0),
+- TWL_REMAP_SLEEP(RES_VAUX2, DEV_GRP_NULL, 0, 0),
+- TWL_REMAP_SLEEP(RES_VAUX3, DEV_GRP_NULL, 0, 0),
+- TWL_REMAP_SLEEP(RES_VAUX4, DEV_GRP_NULL, 0, 0),
+- TWL_REMAP_SLEEP(RES_VMMC1, DEV_GRP_NULL, 0, 0),
+- TWL_REMAP_SLEEP(RES_VMMC2, DEV_GRP_NULL, 0, 0),
++ TWL_REMAP_SLEEP(RES_VAUX1, TWL4030_RESCONFIG_UNDEF, 0, 0),
++ TWL_REMAP_SLEEP(RES_VAUX2, TWL4030_RESCONFIG_UNDEF, 0, 0),
++ TWL_REMAP_SLEEP(RES_VAUX3, TWL4030_RESCONFIG_UNDEF, 0, 0),
++ TWL_REMAP_SLEEP(RES_VAUX4, TWL4030_RESCONFIG_UNDEF, 0, 0),
++ TWL_REMAP_SLEEP(RES_VMMC1, TWL4030_RESCONFIG_UNDEF, 0, 0),
++ TWL_REMAP_SLEEP(RES_VMMC2, TWL4030_RESCONFIG_UNDEF, 0, 0),
+ TWL_REMAP_OFF(RES_VPLL1, DEV_GRP_P1, 3, 1),
+ TWL_REMAP_SLEEP(RES_VPLL2, DEV_GRP_P1, 0, 0),
+- TWL_REMAP_SLEEP(RES_VSIM, DEV_GRP_NULL, 0, 0),
+- TWL_REMAP_SLEEP(RES_VDAC, DEV_GRP_NULL, 0, 0),
++ TWL_REMAP_SLEEP(RES_VSIM, TWL4030_RESCONFIG_UNDEF, 0, 0),
++ TWL_REMAP_SLEEP(RES_VDAC, TWL4030_RESCONFIG_UNDEF, 0, 0),
+ TWL_REMAP_SLEEP(RES_VINTANA1, TWL_DEV_GRP_P123, 1, 2),
+ TWL_REMAP_SLEEP(RES_VINTANA2, TWL_DEV_GRP_P123, 0, 2),
+ TWL_REMAP_SLEEP(RES_VINTDIG, TWL_DEV_GRP_P123, 1, 2),
+ TWL_REMAP_SLEEP(RES_VIO, TWL_DEV_GRP_P123, 2, 2),
+ TWL_REMAP_OFF(RES_VDD1, DEV_GRP_P1, 4, 1),
+ TWL_REMAP_OFF(RES_VDD2, DEV_GRP_P1, 3, 1),
+- TWL_REMAP_SLEEP(RES_VUSB_1V5, DEV_GRP_NULL, 0, 0),
+- TWL_REMAP_SLEEP(RES_VUSB_1V8, DEV_GRP_NULL, 0, 0),
++ TWL_REMAP_SLEEP(RES_VUSB_1V5, TWL4030_RESCONFIG_UNDEF, 0, 0),
++ TWL_REMAP_SLEEP(RES_VUSB_1V8, TWL4030_RESCONFIG_UNDEF, 0, 0),
+ TWL_REMAP_SLEEP(RES_VUSB_3V1, TWL_DEV_GRP_P123, 0, 0),
+ /* Resource #20 USB charge pump skipped */
+ TWL_REMAP_SLEEP(RES_REGEN, TWL_DEV_GRP_P123, 2, 1),
+diff --git a/drivers/mtd/ftl.c b/drivers/mtd/ftl.c
+index 19d637266fcd..71e4f6ccae2f 100644
+--- a/drivers/mtd/ftl.c
++++ b/drivers/mtd/ftl.c
+@@ -1075,7 +1075,6 @@ static void ftl_add_mtd(struct mtd_blktrans_ops *tr, struct mtd_info *mtd)
+ return;
+ }
+
+- ftl_freepart(partition);
+ kfree(partition);
+ }
+
+diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
+index f0ed92e210a1..e2b9b345177a 100644
+--- a/drivers/mtd/nand/omap2.c
++++ b/drivers/mtd/nand/omap2.c
+@@ -931,7 +931,7 @@ static int omap_calculate_ecc(struct mtd_info *mtd, const u_char *dat,
+ u32 val;
+
+ val = readl(info->reg.gpmc_ecc_config);
+- if (((val >> ECC_CONFIG_CS_SHIFT) & ~CS_MASK) != info->gpmc_cs)
++ if (((val >> ECC_CONFIG_CS_SHIFT) & CS_MASK) != info->gpmc_cs)
+ return -EINVAL;
+
+ /* read ecc result */
+diff --git a/drivers/power/bq2415x_charger.c b/drivers/power/bq2415x_charger.c
+index 79a37f6d3307..e384844a1ae1 100644
+--- a/drivers/power/bq2415x_charger.c
++++ b/drivers/power/bq2415x_charger.c
+@@ -840,8 +840,7 @@ static int bq2415x_notifier_call(struct notifier_block *nb,
+ if (bq->automode < 1)
+ return NOTIFY_OK;
+
+- sysfs_notify(&bq->charger.dev->kobj, NULL, "reported_mode");
+- bq2415x_set_mode(bq, bq->reported_mode);
++ schedule_delayed_work(&bq->work, 0);
+
+ return NOTIFY_OK;
+ }
+@@ -892,6 +891,11 @@ static void bq2415x_timer_work(struct work_struct *work)
+ int error;
+ int boost;
+
++ if (bq->automode > 0 && (bq->reported_mode != bq->mode)) {
++ sysfs_notify(&bq->charger.dev->kobj, NULL, "reported_mode");
++ bq2415x_set_mode(bq, bq->reported_mode);
++ }
++
+ if (!bq->autotimer)
+ return;
+
+diff --git a/drivers/regulator/arizona-ldo1.c b/drivers/regulator/arizona-ldo1.c
+index 04f262a836b2..4c9db589f6c1 100644
+--- a/drivers/regulator/arizona-ldo1.c
++++ b/drivers/regulator/arizona-ldo1.c
+@@ -143,8 +143,6 @@ static struct regulator_ops arizona_ldo1_ops = {
+ .map_voltage = regulator_map_voltage_linear,
+ .get_voltage_sel = regulator_get_voltage_sel_regmap,
+ .set_voltage_sel = regulator_set_voltage_sel_regmap,
+- .get_bypass = regulator_get_bypass_regmap,
+- .set_bypass = regulator_set_bypass_regmap,
+ };
+
+ static const struct regulator_desc arizona_ldo1 = {
+diff --git a/drivers/regulator/tps65218-regulator.c b/drivers/regulator/tps65218-regulator.c
+index 9effe48c605e..8b7a0a9ebdfe 100644
+--- a/drivers/regulator/tps65218-regulator.c
++++ b/drivers/regulator/tps65218-regulator.c
+@@ -68,7 +68,7 @@ static const struct regulator_linear_range ldo1_dcdc3_ranges[] = {
+
+ static const struct regulator_linear_range dcdc4_ranges[] = {
+ REGULATOR_LINEAR_RANGE(1175000, 0x0, 0xf, 25000),
+- REGULATOR_LINEAR_RANGE(1550000, 0x10, 0x34, 50000),
++ REGULATOR_LINEAR_RANGE(1600000, 0x10, 0x34, 50000),
+ };
+
+ static struct tps_info tps65218_pmic_regs[] = {
+diff --git a/drivers/scsi/bfa/bfa_ioc.h b/drivers/scsi/bfa/bfa_ioc.h
+index 2e28392c2fb6..a38aafa030b3 100644
+--- a/drivers/scsi/bfa/bfa_ioc.h
++++ b/drivers/scsi/bfa/bfa_ioc.h
+@@ -72,7 +72,7 @@ struct bfa_sge_s {
+ } while (0)
+
+ #define bfa_swap_words(_x) ( \
+- ((_x) << 32) | ((_x) >> 32))
++ ((u64)(_x) << 32) | ((u64)(_x) >> 32))
+
+ #ifdef __BIG_ENDIAN
+ #define bfa_sge_to_be(_x)
+diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
+index 88d46fe6bf98..769be4d50037 100644
+--- a/drivers/scsi/scsi.c
++++ b/drivers/scsi/scsi.c
+@@ -368,8 +368,8 @@ scsi_alloc_host_cmd_pool(struct Scsi_Host *shost)
+ if (!pool)
+ return NULL;
+
+- pool->cmd_name = kasprintf(GFP_KERNEL, "%s_cmd", hostt->name);
+- pool->sense_name = kasprintf(GFP_KERNEL, "%s_sense", hostt->name);
++ pool->cmd_name = kasprintf(GFP_KERNEL, "%s_cmd", hostt->proc_name);
++ pool->sense_name = kasprintf(GFP_KERNEL, "%s_sense", hostt->proc_name);
+ if (!pool->cmd_name || !pool->sense_name) {
+ scsi_free_host_cmd_pool(pool);
+ return NULL;
+@@ -380,6 +380,10 @@ scsi_alloc_host_cmd_pool(struct Scsi_Host *shost)
+ pool->slab_flags |= SLAB_CACHE_DMA;
+ pool->gfp_mask = __GFP_DMA;
+ }
++
++ if (hostt->cmd_size)
++ hostt->cmd_pool = pool;
++
+ return pool;
+ }
+
+@@ -424,8 +428,10 @@ out:
+ out_free_slab:
+ kmem_cache_destroy(pool->cmd_slab);
+ out_free_pool:
+- if (hostt->cmd_size)
++ if (hostt->cmd_size) {
+ scsi_free_host_cmd_pool(pool);
++ hostt->cmd_pool = NULL;
++ }
+ goto out;
+ }
+
+@@ -447,8 +453,10 @@ static void scsi_put_host_cmd_pool(struct Scsi_Host *shost)
+ if (!--pool->users) {
+ kmem_cache_destroy(pool->cmd_slab);
+ kmem_cache_destroy(pool->sense_slab);
+- if (hostt->cmd_size)
++ if (hostt->cmd_size) {
+ scsi_free_host_cmd_pool(pool);
++ hostt->cmd_pool = NULL;
++ }
+ }
+ mutex_unlock(&host_cmd_pool_mutex);
+ }
+diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
+index f969aca0b54e..49014a143c6a 100644
+--- a/drivers/scsi/scsi_devinfo.c
++++ b/drivers/scsi/scsi_devinfo.c
+@@ -222,6 +222,7 @@ static struct {
+ {"PIONEER", "CD-ROM DRM-602X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN},
+ {"PIONEER", "CD-ROM DRM-604X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN},
+ {"PIONEER", "CD-ROM DRM-624X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN},
++ {"Promise", "VTrak E610f", NULL, BLIST_SPARSELUN | BLIST_NO_RSOC},
+ {"Promise", "", NULL, BLIST_SPARSELUN},
+ {"QUANTUM", "XP34301", "1071", BLIST_NOTQ},
+ {"REGAL", "CDC-4X", NULL, BLIST_MAX5LUN | BLIST_SINGLELUN},
+diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
+index e02b3aab56ce..a299b82e6b09 100644
+--- a/drivers/scsi/scsi_scan.c
++++ b/drivers/scsi/scsi_scan.c
+@@ -922,6 +922,12 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
+ if (*bflags & BLIST_USE_10_BYTE_MS)
+ sdev->use_10_for_ms = 1;
+
++ /* some devices don't like REPORT SUPPORTED OPERATION CODES
++ * and will simply timeout causing sd_mod init to take a very
++ * very long time */
++ if (*bflags & BLIST_NO_RSOC)
++ sdev->no_report_opcodes = 1;
++
+ /* set the device running here so that slave configure
+ * may do I/O */
+ ret = scsi_device_set_state(sdev, SDEV_RUNNING);
+@@ -950,7 +956,9 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
+
+ sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;
+
+- if (*bflags & BLIST_SKIP_VPD_PAGES)
++ if (*bflags & BLIST_TRY_VPD_PAGES)
++ sdev->try_vpd_pages = 1;
++ else if (*bflags & BLIST_SKIP_VPD_PAGES)
+ sdev->skip_vpd_pages = 1;
+
+ transport_configure_device(&sdev->sdev_gendev);
+@@ -1239,6 +1247,12 @@ static void scsi_sequential_lun_scan(struct scsi_target *starget,
+ max_dev_lun = min(8U, max_dev_lun);
+
+ /*
++ * Stop scanning at 255 unless BLIST_SCSI3LUN
++ */
++ if (!(bflags & BLIST_SCSI3LUN))
++ max_dev_lun = min(256U, max_dev_lun);
++
++ /*
+ * We have already scanned LUN 0, so start at LUN 1. Keep scanning
+ * until we reach the max, or no LUN is found and we are not
+ * sparse_lun.
+diff --git a/drivers/scsi/scsi_transport_srp.c b/drivers/scsi/scsi_transport_srp.c
+index 13e898332e45..a0c5bfdc5366 100644
+--- a/drivers/scsi/scsi_transport_srp.c
++++ b/drivers/scsi/scsi_transport_srp.c
+@@ -473,7 +473,8 @@ static void __srp_start_tl_fail_timers(struct srp_rport *rport)
+ if (delay > 0)
+ queue_delayed_work(system_long_wq, &rport->reconnect_work,
+ 1UL * delay * HZ);
+- if (srp_rport_set_state(rport, SRP_RPORT_BLOCKED) == 0) {
++ if ((fast_io_fail_tmo >= 0 || dev_loss_tmo >= 0) &&
++ srp_rport_set_state(rport, SRP_RPORT_BLOCKED) == 0) {
+ pr_debug("%s new state: %d\n", dev_name(&shost->shost_gendev),
+ rport->state);
+ scsi_target_block(&shost->shost_gendev);
+diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
+index 6825eda1114a..ed2e99eca336 100644
+--- a/drivers/scsi/sd.c
++++ b/drivers/scsi/sd.c
+@@ -2681,6 +2681,11 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
+
+ static int sd_try_extended_inquiry(struct scsi_device *sdp)
+ {
++ /* Attempt VPD inquiry if the device blacklist explicitly calls
++ * for it.
++ */
++ if (sdp->try_vpd_pages)
++ return 1;
+ /*
+ * Although VPD inquiries can go to SCSI-2 type devices,
+ * some USB ones crash on receiving them, and the pages
+diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
+index 9969fa1ef7c4..ed0f899e8aa5 100644
+--- a/drivers/scsi/storvsc_drv.c
++++ b/drivers/scsi/storvsc_drv.c
+@@ -33,6 +33,7 @@
+ #include <linux/device.h>
+ #include <linux/hyperv.h>
+ #include <linux/mempool.h>
++#include <linux/blkdev.h>
+ #include <scsi/scsi.h>
+ #include <scsi/scsi_cmnd.h>
+ #include <scsi/scsi_host.h>
+@@ -330,17 +331,17 @@ static int storvsc_timeout = 180;
+
+ static void storvsc_on_channel_callback(void *context);
+
+-/*
+- * In Hyper-V, each port/path/target maps to 1 scsi host adapter. In
+- * reality, the path/target is not used (ie always set to 0) so our
+- * scsi host adapter essentially has 1 bus with 1 target that contains
+- * up to 256 luns.
+- */
+-#define STORVSC_MAX_LUNS_PER_TARGET 64
+-#define STORVSC_MAX_TARGETS 1
+-#define STORVSC_MAX_CHANNELS 1
++#define STORVSC_MAX_LUNS_PER_TARGET 255
++#define STORVSC_MAX_TARGETS 2
++#define STORVSC_MAX_CHANNELS 8
+
++#define STORVSC_FC_MAX_LUNS_PER_TARGET 255
++#define STORVSC_FC_MAX_TARGETS 128
++#define STORVSC_FC_MAX_CHANNELS 8
+
++#define STORVSC_IDE_MAX_LUNS_PER_TARGET 64
++#define STORVSC_IDE_MAX_TARGETS 1
++#define STORVSC_IDE_MAX_CHANNELS 1
+
+ struct storvsc_cmd_request {
+ struct list_head entry;
+@@ -1017,6 +1018,13 @@ static void storvsc_handle_error(struct vmscsi_request *vm_srb,
+ case ATA_12:
+ set_host_byte(scmnd, DID_PASSTHROUGH);
+ break;
++ /*
++ * On Some Windows hosts TEST_UNIT_READY command can return
++ * SRB_STATUS_ERROR, let the upper level code deal with it
++ * based on the sense information.
++ */
++ case TEST_UNIT_READY:
++ break;
+ default:
+ set_host_byte(scmnd, DID_TARGET_FAILURE);
+ }
+@@ -1518,6 +1526,16 @@ static int storvsc_host_reset_handler(struct scsi_cmnd *scmnd)
+ return SUCCESS;
+ }
+
++/*
++ * The host guarantees to respond to each command, although I/O latencies might
++ * be unbounded on Azure. Reset the timer unconditionally to give the host a
++ * chance to perform EH.
++ */
++static enum blk_eh_timer_return storvsc_eh_timed_out(struct scsi_cmnd *scmnd)
++{
++ return BLK_EH_RESET_TIMER;
++}
++
+ static bool storvsc_scsi_cmd_ok(struct scsi_cmnd *scmnd)
+ {
+ bool allowed = true;
+@@ -1553,9 +1571,19 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
+ struct vmscsi_request *vm_srb;
+ struct stor_mem_pools *memp = scmnd->device->hostdata;
+
+- if (!storvsc_scsi_cmd_ok(scmnd)) {
+- scmnd->scsi_done(scmnd);
+- return 0;
++ if (vmstor_current_major <= VMSTOR_WIN8_MAJOR) {
++ /*
++ * On legacy hosts filter unimplemented commands.
++ * Future hosts are expected to correctly handle
++ * unsupported commands. Furthermore, it is
++ * possible that some of the currently
++ * unsupported commands maybe supported in
++ * future versions of the host.
++ */
++ if (!storvsc_scsi_cmd_ok(scmnd)) {
++ scmnd->scsi_done(scmnd);
++ return 0;
++ }
+ }
+
+ request_size = sizeof(struct storvsc_cmd_request);
+@@ -1580,26 +1608,24 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
+ vm_srb = &cmd_request->vstor_packet.vm_srb;
+ vm_srb->win8_extension.time_out_value = 60;
+
++ vm_srb->win8_extension.srb_flags |=
++ (SRB_FLAGS_QUEUE_ACTION_ENABLE |
++ SRB_FLAGS_DISABLE_SYNCH_TRANSFER);
+
+ /* Build the SRB */
+ switch (scmnd->sc_data_direction) {
+ case DMA_TO_DEVICE:
+ vm_srb->data_in = WRITE_TYPE;
+ vm_srb->win8_extension.srb_flags |= SRB_FLAGS_DATA_OUT;
+- vm_srb->win8_extension.srb_flags |=
+- (SRB_FLAGS_QUEUE_ACTION_ENABLE |
+- SRB_FLAGS_DISABLE_SYNCH_TRANSFER);
+ break;
+ case DMA_FROM_DEVICE:
+ vm_srb->data_in = READ_TYPE;
+ vm_srb->win8_extension.srb_flags |= SRB_FLAGS_DATA_IN;
+- vm_srb->win8_extension.srb_flags |=
+- (SRB_FLAGS_QUEUE_ACTION_ENABLE |
+- SRB_FLAGS_DISABLE_SYNCH_TRANSFER);
+ break;
+ default:
+ vm_srb->data_in = UNKNOWN_TYPE;
+- vm_srb->win8_extension.srb_flags = 0;
++ vm_srb->win8_extension.srb_flags |= (SRB_FLAGS_DATA_IN |
++ SRB_FLAGS_DATA_OUT);
+ break;
+ }
+
+@@ -1687,11 +1713,11 @@ static struct scsi_host_template scsi_driver = {
+ .bios_param = storvsc_get_chs,
+ .queuecommand = storvsc_queuecommand,
+ .eh_host_reset_handler = storvsc_host_reset_handler,
++ .eh_timed_out = storvsc_eh_timed_out,
+ .slave_alloc = storvsc_device_alloc,
+ .slave_destroy = storvsc_device_destroy,
+ .slave_configure = storvsc_device_configure,
+- .cmd_per_lun = 1,
+- /* 64 max_queue * 1 target */
++ .cmd_per_lun = 255,
+ .can_queue = STORVSC_MAX_IO_REQUESTS*STORVSC_MAX_TARGETS,
+ .this_id = -1,
+ /* no use setting to 0 since ll_blk_rw reset it to 1 */
+@@ -1743,19 +1769,25 @@ static int storvsc_probe(struct hv_device *device,
+ * set state to properly communicate with the host.
+ */
+
+- if (vmbus_proto_version == VERSION_WIN8) {
+- sense_buffer_size = POST_WIN7_STORVSC_SENSE_BUFFER_SIZE;
+- vmscsi_size_delta = 0;
+- vmstor_current_major = VMSTOR_WIN8_MAJOR;
+- vmstor_current_minor = VMSTOR_WIN8_MINOR;
+- } else {
++ switch (vmbus_proto_version) {
++ case VERSION_WS2008:
++ case VERSION_WIN7:
+ sense_buffer_size = PRE_WIN8_STORVSC_SENSE_BUFFER_SIZE;
+ vmscsi_size_delta = sizeof(struct vmscsi_win8_extension);
+ vmstor_current_major = VMSTOR_WIN7_MAJOR;
+ vmstor_current_minor = VMSTOR_WIN7_MINOR;
++ break;
++ default:
++ sense_buffer_size = POST_WIN7_STORVSC_SENSE_BUFFER_SIZE;
++ vmscsi_size_delta = 0;
++ vmstor_current_major = VMSTOR_WIN8_MAJOR;
++ vmstor_current_minor = VMSTOR_WIN8_MINOR;
++ break;
+ }
+
+-
++ if (dev_id->driver_data == SFC_GUID)
++ scsi_driver.can_queue = (STORVSC_MAX_IO_REQUESTS *
++ STORVSC_FC_MAX_TARGETS);
+ host = scsi_host_alloc(&scsi_driver,
+ sizeof(struct hv_host_device));
+ if (!host)
+@@ -1789,12 +1821,25 @@ static int storvsc_probe(struct hv_device *device,
+ host_dev->path = stor_device->path_id;
+ host_dev->target = stor_device->target_id;
+
+- /* max # of devices per target */
+- host->max_lun = STORVSC_MAX_LUNS_PER_TARGET;
+- /* max # of targets per channel */
+- host->max_id = STORVSC_MAX_TARGETS;
+- /* max # of channels */
+- host->max_channel = STORVSC_MAX_CHANNELS - 1;
++ switch (dev_id->driver_data) {
++ case SFC_GUID:
++ host->max_lun = STORVSC_FC_MAX_LUNS_PER_TARGET;
++ host->max_id = STORVSC_FC_MAX_TARGETS;
++ host->max_channel = STORVSC_FC_MAX_CHANNELS - 1;
++ break;
++
++ case SCSI_GUID:
++ host->max_lun = STORVSC_MAX_LUNS_PER_TARGET;
++ host->max_id = STORVSC_MAX_TARGETS;
++ host->max_channel = STORVSC_MAX_CHANNELS - 1;
++ break;
++
++ default:
++ host->max_lun = STORVSC_IDE_MAX_LUNS_PER_TARGET;
++ host->max_id = STORVSC_IDE_MAX_TARGETS;
++ host->max_channel = STORVSC_IDE_MAX_CHANNELS - 1;
++ break;
++ }
+ /* max cmd length */
+ host->max_cmd_len = STORVSC_MAX_CMD_LEN;
+
+diff --git a/drivers/spi/spi-omap2-mcspi.c b/drivers/spi/spi-omap2-mcspi.c
+index 4dc77df38864..68441fa448de 100644
+--- a/drivers/spi/spi-omap2-mcspi.c
++++ b/drivers/spi/spi-omap2-mcspi.c
+@@ -149,6 +149,7 @@ struct omap2_mcspi_cs {
+ void __iomem *base;
+ unsigned long phys;
+ int word_len;
++ u16 mode;
+ struct list_head node;
+ /* Context save and restore shadow register */
+ u32 chconf0, chctrl0;
+@@ -926,6 +927,8 @@ static int omap2_mcspi_setup_transfer(struct spi_device *spi,
+
+ mcspi_write_chconf0(spi, l);
+
++ cs->mode = spi->mode;
++
+ dev_dbg(&spi->dev, "setup: speed %d, sample %s edge, clk %s\n",
+ speed_hz,
+ (spi->mode & SPI_CPHA) ? "trailing" : "leading",
+@@ -998,6 +1001,7 @@ static int omap2_mcspi_setup(struct spi_device *spi)
+ return -ENOMEM;
+ cs->base = mcspi->base + spi->chip_select * 0x14;
+ cs->phys = mcspi->phys + spi->chip_select * 0x14;
++ cs->mode = 0;
+ cs->chconf0 = 0;
+ cs->chctrl0 = 0;
+ spi->controller_state = cs;
+@@ -1079,6 +1083,16 @@ static void omap2_mcspi_work(struct omap2_mcspi *mcspi, struct spi_message *m)
+ cs = spi->controller_state;
+ cd = spi->controller_data;
+
++ /*
++ * The slave driver could have changed spi->mode in which case
++ * it will be different from cs->mode (the current hardware setup).
++ * If so, set par_override (even though its not a parity issue) so
++ * omap2_mcspi_setup_transfer will be called to configure the hardware
++ * with the correct mode on the first iteration of the loop below.
++ */
++ if (spi->mode != cs->mode)
++ par_override = 1;
++
+ omap2_mcspi_set_enable(spi, 0);
+ list_for_each_entry(t, &m->transfers, transfer_list) {
+ if (t->tx_buf == NULL && t->rx_buf == NULL && t->len) {
+diff --git a/drivers/spi/spi-orion.c b/drivers/spi/spi-orion.c
+index d018a4aac3a1..c206a4ad83cd 100644
+--- a/drivers/spi/spi-orion.c
++++ b/drivers/spi/spi-orion.c
+@@ -346,8 +346,6 @@ static int orion_spi_probe(struct platform_device *pdev)
+ struct resource *r;
+ unsigned long tclk_hz;
+ int status = 0;
+- const u32 *iprop;
+- int size;
+
+ master = spi_alloc_master(&pdev->dev, sizeof(*spi));
+ if (master == NULL) {
+@@ -358,10 +356,10 @@ static int orion_spi_probe(struct platform_device *pdev)
+ if (pdev->id != -1)
+ master->bus_num = pdev->id;
+ if (pdev->dev.of_node) {
+- iprop = of_get_property(pdev->dev.of_node, "cell-index",
+- &size);
+- if (iprop && size == sizeof(*iprop))
+- master->bus_num = *iprop;
++ u32 cell_index;
++ if (!of_property_read_u32(pdev->dev.of_node, "cell-index",
++ &cell_index))
++ master->bus_num = cell_index;
+ }
+
+ /* we support only mode 0, and no options */
+diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
+index fe792106bdc5..46f45ca2c694 100644
+--- a/drivers/spi/spi-pxa2xx.c
++++ b/drivers/spi/spi-pxa2xx.c
+@@ -1074,6 +1074,7 @@ static struct acpi_device_id pxa2xx_spi_acpi_match[] = {
+ { "INT3430", 0 },
+ { "INT3431", 0 },
+ { "80860F0E", 0 },
++ { "8086228E", 0 },
+ { },
+ };
+ MODULE_DEVICE_TABLE(acpi, pxa2xx_spi_acpi_match);
+diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
+index 500713882ad5..48dcb2e97b90 100644
+--- a/drivers/xen/events/events_fifo.c
++++ b/drivers/xen/events/events_fifo.c
+@@ -99,6 +99,25 @@ static unsigned evtchn_fifo_nr_channels(void)
+ return event_array_pages * EVENT_WORDS_PER_PAGE;
+ }
+
++static int init_control_block(int cpu,
++ struct evtchn_fifo_control_block *control_block)
++{
++ struct evtchn_fifo_queue *q = &per_cpu(cpu_queue, cpu);
++ struct evtchn_init_control init_control;
++ unsigned int i;
++
++ /* Reset the control block and the local HEADs. */
++ clear_page(control_block);
++ for (i = 0; i < EVTCHN_FIFO_MAX_QUEUES; i++)
++ q->head[i] = 0;
++
++ init_control.control_gfn = virt_to_mfn(control_block);
++ init_control.offset = 0;
++ init_control.vcpu = cpu;
++
++ return HYPERVISOR_event_channel_op(EVTCHNOP_init_control, &init_control);
++}
++
+ static void free_unused_array_pages(void)
+ {
+ unsigned i;
+@@ -323,7 +342,6 @@ static void evtchn_fifo_resume(void)
+
+ for_each_possible_cpu(cpu) {
+ void *control_block = per_cpu(cpu_control_block, cpu);
+- struct evtchn_init_control init_control;
+ int ret;
+
+ if (!control_block)
+@@ -340,12 +358,7 @@ static void evtchn_fifo_resume(void)
+ continue;
+ }
+
+- init_control.control_gfn = virt_to_mfn(control_block);
+- init_control.offset = 0;
+- init_control.vcpu = cpu;
+-
+- ret = HYPERVISOR_event_channel_op(EVTCHNOP_init_control,
+- &init_control);
++ ret = init_control_block(cpu, control_block);
+ if (ret < 0)
+ BUG();
+ }
+@@ -373,30 +386,25 @@ static const struct evtchn_ops evtchn_ops_fifo = {
+ .resume = evtchn_fifo_resume,
+ };
+
+-static int evtchn_fifo_init_control_block(unsigned cpu)
++static int evtchn_fifo_alloc_control_block(unsigned cpu)
+ {
+- struct page *control_block = NULL;
+- struct evtchn_init_control init_control;
++ void *control_block = NULL;
+ int ret = -ENOMEM;
+
+- control_block = alloc_page(GFP_KERNEL|__GFP_ZERO);
++ control_block = (void *)__get_free_page(GFP_KERNEL);
+ if (control_block == NULL)
+ goto error;
+
+- init_control.control_gfn = virt_to_mfn(page_address(control_block));
+- init_control.offset = 0;
+- init_control.vcpu = cpu;
+-
+- ret = HYPERVISOR_event_channel_op(EVTCHNOP_init_control, &init_control);
++ ret = init_control_block(cpu, control_block);
+ if (ret < 0)
+ goto error;
+
+- per_cpu(cpu_control_block, cpu) = page_address(control_block);
++ per_cpu(cpu_control_block, cpu) = control_block;
+
+ return 0;
+
+ error:
+- __free_page(control_block);
++ free_page((unsigned long)control_block);
+ return ret;
+ }
+
+@@ -410,7 +418,7 @@ static int evtchn_fifo_cpu_notification(struct notifier_block *self,
+ switch (action) {
+ case CPU_UP_PREPARE:
+ if (!per_cpu(cpu_control_block, cpu))
+- ret = evtchn_fifo_init_control_block(cpu);
++ ret = evtchn_fifo_alloc_control_block(cpu);
+ break;
+ default:
+ break;
+@@ -427,7 +435,7 @@ int __init xen_evtchn_fifo_init(void)
+ int cpu = get_cpu();
+ int ret;
+
+- ret = evtchn_fifo_init_control_block(cpu);
++ ret = evtchn_fifo_alloc_control_block(cpu);
+ if (ret < 0)
+ goto out;
+
+diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
+index de6aed8c78e5..c97fd86cfb1b 100644
+--- a/fs/cifs/cifsglob.h
++++ b/fs/cifs/cifsglob.h
+@@ -70,11 +70,6 @@
+ #define SERVER_NAME_LENGTH 40
+ #define SERVER_NAME_LEN_WITH_NULL (SERVER_NAME_LENGTH + 1)
+
+-/* used to define string lengths for reversing unicode strings */
+-/* (256+1)*2 = 514 */
+-/* (max path length + 1 for null) * 2 for unicode */
+-#define MAX_NAME 514
+-
+ /* SMB echo "timeout" -- FIXME: tunable? */
+ #define SMB_ECHO_INTERVAL (60 * HZ)
+
+@@ -404,6 +399,8 @@ struct smb_version_operations {
+ const struct cifs_fid *, u32 *);
+ int (*set_acl)(struct cifs_ntsd *, __u32, struct inode *, const char *,
+ int);
++ /* check if we need to issue closedir */
++ bool (*dir_needs_close)(struct cifsFileInfo *);
+ };
+
+ struct smb_version_values {
+diff --git a/fs/cifs/file.c b/fs/cifs/file.c
+index e90a1e9aa627..9de08c9dd106 100644
+--- a/fs/cifs/file.c
++++ b/fs/cifs/file.c
+@@ -762,7 +762,7 @@ int cifs_closedir(struct inode *inode, struct file *file)
+
+ cifs_dbg(FYI, "Freeing private data in close dir\n");
+ spin_lock(&cifs_file_list_lock);
+- if (!cfile->srch_inf.endOfSearch && !cfile->invalidHandle) {
++ if (server->ops->dir_needs_close(cfile)) {
+ cfile->invalidHandle = true;
+ spin_unlock(&cifs_file_list_lock);
+ if (server->ops->close_dir)
+@@ -2823,7 +2823,7 @@ cifs_uncached_read_into_pages(struct TCP_Server_Info *server,
+ total_read += result;
+ }
+
+- return total_read > 0 ? total_read : result;
++ return total_read > 0 && result != -EAGAIN ? total_read : result;
+ }
+
+ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
+@@ -3231,7 +3231,7 @@ cifs_readpages_read_into_pages(struct TCP_Server_Info *server,
+ total_read += result;
+ }
+
+- return total_read > 0 ? total_read : result;
++ return total_read > 0 && result != -EAGAIN ? total_read : result;
+ }
+
+ static int cifs_readpages(struct file *file, struct address_space *mapping,
+diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
+index a174605f6afa..d322e7d4e123 100644
+--- a/fs/cifs/inode.c
++++ b/fs/cifs/inode.c
+@@ -1710,13 +1710,22 @@ cifs_rename(struct inode *source_dir, struct dentry *source_dentry,
+ unlink_target:
+ /* Try unlinking the target dentry if it's not negative */
+ if (target_dentry->d_inode && (rc == -EACCES || rc == -EEXIST)) {
+- tmprc = cifs_unlink(target_dir, target_dentry);
++ if (d_is_dir(target_dentry))
++ tmprc = cifs_rmdir(target_dir, target_dentry);
++ else
++ tmprc = cifs_unlink(target_dir, target_dentry);
+ if (tmprc)
+ goto cifs_rename_exit;
+ rc = cifs_do_rename(xid, source_dentry, from_name,
+ target_dentry, to_name);
+ }
+
++ /* force revalidate to go get info when needed */
++ CIFS_I(source_dir)->time = CIFS_I(target_dir)->time = 0;
++
++ source_dir->i_ctime = source_dir->i_mtime = target_dir->i_ctime =
++ target_dir->i_mtime = current_fs_time(source_dir->i_sb);
++
+ cifs_rename_exit:
+ kfree(info_buf_source);
+ kfree(from_name);
+diff --git a/fs/cifs/readdir.c b/fs/cifs/readdir.c
+index b15862e0f68c..b334a89d6a66 100644
+--- a/fs/cifs/readdir.c
++++ b/fs/cifs/readdir.c
+@@ -593,11 +593,11 @@ find_cifs_entry(const unsigned int xid, struct cifs_tcon *tcon, loff_t pos,
+ /* close and restart search */
+ cifs_dbg(FYI, "search backing up - close and restart search\n");
+ spin_lock(&cifs_file_list_lock);
+- if (!cfile->srch_inf.endOfSearch && !cfile->invalidHandle) {
++ if (server->ops->dir_needs_close(cfile)) {
+ cfile->invalidHandle = true;
+ spin_unlock(&cifs_file_list_lock);
+- if (server->ops->close)
+- server->ops->close(xid, tcon, &cfile->fid);
++ if (server->ops->close_dir)
++ server->ops->close_dir(xid, tcon, &cfile->fid);
+ } else
+ spin_unlock(&cifs_file_list_lock);
+ if (cfile->srch_inf.ntwrk_buf_start) {
+diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c
+index d1fdfa848703..84ca0a4caaeb 100644
+--- a/fs/cifs/smb1ops.c
++++ b/fs/cifs/smb1ops.c
+@@ -1009,6 +1009,12 @@ cifs_is_read_op(__u32 oplock)
+ return oplock == OPLOCK_READ;
+ }
+
++static bool
++cifs_dir_needs_close(struct cifsFileInfo *cfile)
++{
++ return !cfile->srch_inf.endOfSearch && !cfile->invalidHandle;
++}
++
+ struct smb_version_operations smb1_operations = {
+ .send_cancel = send_nt_cancel,
+ .compare_fids = cifs_compare_fids,
+@@ -1078,6 +1084,7 @@ struct smb_version_operations smb1_operations = {
+ .query_mf_symlink = cifs_query_mf_symlink,
+ .create_mf_symlink = cifs_create_mf_symlink,
+ .is_read_op = cifs_is_read_op,
++ .dir_needs_close = cifs_dir_needs_close,
+ #ifdef CONFIG_CIFS_XATTR
+ .query_all_EAs = CIFSSMBQAllEAs,
+ .set_EA = CIFSSMBSetEA,
+diff --git a/fs/cifs/smb2file.c b/fs/cifs/smb2file.c
+index 3f17b4550831..45992944e238 100644
+--- a/fs/cifs/smb2file.c
++++ b/fs/cifs/smb2file.c
+@@ -50,7 +50,7 @@ smb2_open_file(const unsigned int xid, struct cifs_open_parms *oparms,
+ goto out;
+ }
+
+- smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + MAX_NAME * 2,
++ smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + PATH_MAX * 2,
+ GFP_KERNEL);
+ if (smb2_data == NULL) {
+ rc = -ENOMEM;
+diff --git a/fs/cifs/smb2inode.c b/fs/cifs/smb2inode.c
+index 84c012a6aba0..215f8d3e3e53 100644
+--- a/fs/cifs/smb2inode.c
++++ b/fs/cifs/smb2inode.c
+@@ -131,7 +131,7 @@ smb2_query_path_info(const unsigned int xid, struct cifs_tcon *tcon,
+ *adjust_tz = false;
+ *symlink = false;
+
+- smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + MAX_NAME * 2,
++ smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + PATH_MAX * 2,
+ GFP_KERNEL);
+ if (smb2_data == NULL)
+ return -ENOMEM;
+diff --git a/fs/cifs/smb2maperror.c b/fs/cifs/smb2maperror.c
+index 94bd4fbb13d3..a689514e260f 100644
+--- a/fs/cifs/smb2maperror.c
++++ b/fs/cifs/smb2maperror.c
+@@ -214,7 +214,7 @@ static const struct status_to_posix_error smb2_error_map_table[] = {
+ {STATUS_BREAKPOINT, -EIO, "STATUS_BREAKPOINT"},
+ {STATUS_SINGLE_STEP, -EIO, "STATUS_SINGLE_STEP"},
+ {STATUS_BUFFER_OVERFLOW, -EIO, "STATUS_BUFFER_OVERFLOW"},
+- {STATUS_NO_MORE_FILES, -EIO, "STATUS_NO_MORE_FILES"},
++ {STATUS_NO_MORE_FILES, -ENODATA, "STATUS_NO_MORE_FILES"},
+ {STATUS_WAKE_SYSTEM_DEBUGGER, -EIO, "STATUS_WAKE_SYSTEM_DEBUGGER"},
+ {STATUS_HANDLES_CLOSED, -EIO, "STATUS_HANDLES_CLOSED"},
+ {STATUS_NO_INHERITANCE, -EIO, "STATUS_NO_INHERITANCE"},
+@@ -605,7 +605,7 @@ static const struct status_to_posix_error smb2_error_map_table[] = {
+ {STATUS_MAPPED_FILE_SIZE_ZERO, -EIO, "STATUS_MAPPED_FILE_SIZE_ZERO"},
+ {STATUS_TOO_MANY_OPENED_FILES, -EMFILE, "STATUS_TOO_MANY_OPENED_FILES"},
+ {STATUS_CANCELLED, -EIO, "STATUS_CANCELLED"},
+- {STATUS_CANNOT_DELETE, -EIO, "STATUS_CANNOT_DELETE"},
++ {STATUS_CANNOT_DELETE, -EACCES, "STATUS_CANNOT_DELETE"},
+ {STATUS_INVALID_COMPUTER_NAME, -EIO, "STATUS_INVALID_COMPUTER_NAME"},
+ {STATUS_FILE_DELETED, -EIO, "STATUS_FILE_DELETED"},
+ {STATUS_SPECIAL_ACCOUNT, -EIO, "STATUS_SPECIAL_ACCOUNT"},
+diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
+index 787844bde384..f325c59e12e6 100644
+--- a/fs/cifs/smb2ops.c
++++ b/fs/cifs/smb2ops.c
+@@ -339,7 +339,7 @@ smb2_query_file_info(const unsigned int xid, struct cifs_tcon *tcon,
+ int rc;
+ struct smb2_file_all_info *smb2_data;
+
+- smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + MAX_NAME * 2,
++ smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + PATH_MAX * 2,
+ GFP_KERNEL);
+ if (smb2_data == NULL)
+ return -ENOMEM;
+@@ -1104,6 +1104,12 @@ smb3_parse_lease_buf(void *buf, unsigned int *epoch)
+ return le32_to_cpu(lc->lcontext.LeaseState);
+ }
+
++static bool
++smb2_dir_needs_close(struct cifsFileInfo *cfile)
++{
++ return !cfile->invalidHandle;
++}
++
+ struct smb_version_operations smb20_operations = {
+ .compare_fids = smb2_compare_fids,
+ .setup_request = smb2_setup_request,
+@@ -1177,6 +1183,7 @@ struct smb_version_operations smb20_operations = {
+ .create_lease_buf = smb2_create_lease_buf,
+ .parse_lease_buf = smb2_parse_lease_buf,
+ .clone_range = smb2_clone_range,
++ .dir_needs_close = smb2_dir_needs_close,
+ };
+
+ struct smb_version_operations smb21_operations = {
+@@ -1252,6 +1259,7 @@ struct smb_version_operations smb21_operations = {
+ .create_lease_buf = smb2_create_lease_buf,
+ .parse_lease_buf = smb2_parse_lease_buf,
+ .clone_range = smb2_clone_range,
++ .dir_needs_close = smb2_dir_needs_close,
+ };
+
+ struct smb_version_operations smb30_operations = {
+@@ -1330,6 +1338,7 @@ struct smb_version_operations smb30_operations = {
+ .parse_lease_buf = smb3_parse_lease_buf,
+ .clone_range = smb2_clone_range,
+ .validate_negotiate = smb3_validate_negotiate,
++ .dir_needs_close = smb2_dir_needs_close,
+ };
+
+ struct smb_version_values smb20_values = {
+diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
+index b0b260dbb19d..87077559a0ab 100644
+--- a/fs/cifs/smb2pdu.c
++++ b/fs/cifs/smb2pdu.c
+@@ -922,7 +922,8 @@ tcon_exit:
+ tcon_error_exit:
+ if (rsp->hdr.Status == STATUS_BAD_NETWORK_NAME) {
+ cifs_dbg(VFS, "BAD_NETWORK_NAME: %s\n", tree);
+- tcon->bad_network_name = true;
++ if (tcon)
++ tcon->bad_network_name = true;
+ }
+ goto tcon_exit;
+ }
+@@ -1545,7 +1546,7 @@ SMB2_query_info(const unsigned int xid, struct cifs_tcon *tcon,
+ {
+ return query_info(xid, tcon, persistent_fid, volatile_fid,
+ FILE_ALL_INFORMATION,
+- sizeof(struct smb2_file_all_info) + MAX_NAME * 2,
++ sizeof(struct smb2_file_all_info) + PATH_MAX * 2,
+ sizeof(struct smb2_file_all_info), data);
+ }
+
+@@ -2141,6 +2142,10 @@ SMB2_query_directory(const unsigned int xid, struct cifs_tcon *tcon,
+ rsp = (struct smb2_query_directory_rsp *)iov[0].iov_base;
+
+ if (rc) {
++ if (rc == -ENODATA && rsp->hdr.Status == STATUS_NO_MORE_FILES) {
++ srch_inf->endOfSearch = true;
++ rc = 0;
++ }
+ cifs_stats_fail_inc(tcon, SMB2_QUERY_DIRECTORY_HE);
+ goto qdir_exit;
+ }
+@@ -2178,11 +2183,6 @@ SMB2_query_directory(const unsigned int xid, struct cifs_tcon *tcon,
+ else
+ cifs_dbg(VFS, "illegal search buffer type\n");
+
+- if (rsp->hdr.Status == STATUS_NO_MORE_FILES)
+- srch_inf->endOfSearch = 1;
+- else
+- srch_inf->endOfSearch = 0;
+-
+ return rc;
+
+ qdir_exit:
+diff --git a/fs/dcache.c b/fs/dcache.c
+index 06f65857a855..e1308c5423ed 100644
+--- a/fs/dcache.c
++++ b/fs/dcache.c
+@@ -106,8 +106,7 @@ static inline struct hlist_bl_head *d_hash(const struct dentry *parent,
+ unsigned int hash)
+ {
+ hash += (unsigned long) parent / L1_CACHE_BYTES;
+- hash = hash + (hash >> d_hash_shift);
+- return dentry_hashtable + (hash & d_hash_mask);
++ return dentry_hashtable + hash_32(hash, d_hash_shift);
+ }
+
+ /* Statistics gathering. */
+diff --git a/fs/namei.c b/fs/namei.c
+index 9eb787e5c167..17ca8b85c308 100644
+--- a/fs/namei.c
++++ b/fs/namei.c
+@@ -34,6 +34,7 @@
+ #include <linux/device_cgroup.h>
+ #include <linux/fs_struct.h>
+ #include <linux/posix_acl.h>
++#include <linux/hash.h>
+ #include <asm/uaccess.h>
+
+ #include "internal.h"
+@@ -1629,8 +1630,7 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
+
+ static inline unsigned int fold_hash(unsigned long hash)
+ {
+- hash += hash >> (8*sizeof(int));
+- return hash;
++ return hash_64(hash, 32);
+ }
+
+ #else /* 32-bit case */
+diff --git a/fs/namespace.c b/fs/namespace.c
+index 182bc41cd887..140d17705683 100644
+--- a/fs/namespace.c
++++ b/fs/namespace.c
+@@ -779,6 +779,20 @@ static void attach_mnt(struct mount *mnt,
+ list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+ }
+
++static void attach_shadowed(struct mount *mnt,
++ struct mount *parent,
++ struct mount *shadows)
++{
++ if (shadows) {
++ hlist_add_after_rcu(&shadows->mnt_hash, &mnt->mnt_hash);
++ list_add(&mnt->mnt_child, &shadows->mnt_child);
++ } else {
++ hlist_add_head_rcu(&mnt->mnt_hash,
++ m_hash(&parent->mnt, mnt->mnt_mountpoint));
++ list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
++ }
++}
++
+ /*
+ * vfsmount lock must be held for write
+ */
+@@ -797,12 +811,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows)
+
+ list_splice(&head, n->list.prev);
+
+- if (shadows)
+- hlist_add_after_rcu(&shadows->mnt_hash, &mnt->mnt_hash);
+- else
+- hlist_add_head_rcu(&mnt->mnt_hash,
+- m_hash(&parent->mnt, mnt->mnt_mountpoint));
+- list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
++ attach_shadowed(mnt, parent, shadows);
+ touch_mnt_namespace(n);
+ }
+
+@@ -890,8 +899,21 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+
+ mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
+ /* Don't allow unprivileged users to change mount flags */
+- if ((flag & CL_UNPRIVILEGED) && (mnt->mnt.mnt_flags & MNT_READONLY))
+- mnt->mnt.mnt_flags |= MNT_LOCK_READONLY;
++ if (flag & CL_UNPRIVILEGED) {
++ mnt->mnt.mnt_flags |= MNT_LOCK_ATIME;
++
++ if (mnt->mnt.mnt_flags & MNT_READONLY)
++ mnt->mnt.mnt_flags |= MNT_LOCK_READONLY;
++
++ if (mnt->mnt.mnt_flags & MNT_NODEV)
++ mnt->mnt.mnt_flags |= MNT_LOCK_NODEV;
++
++ if (mnt->mnt.mnt_flags & MNT_NOSUID)
++ mnt->mnt.mnt_flags |= MNT_LOCK_NOSUID;
++
++ if (mnt->mnt.mnt_flags & MNT_NOEXEC)
++ mnt->mnt.mnt_flags |= MNT_LOCK_NOEXEC;
++ }
+
+ /* Don't allow unprivileged users to reveal what is under a mount */
+ if ((flag & CL_UNPRIVILEGED) && list_empty(&old->mnt_expire))
+@@ -1213,6 +1235,11 @@ static void namespace_unlock(void)
+ head.first->pprev = &head.first;
+ INIT_HLIST_HEAD(&unmounted);
+
++ /* undo decrements we'd done in umount_tree() */
++ hlist_for_each_entry(mnt, &head, mnt_hash)
++ if (mnt->mnt_ex_mountpoint.mnt)
++ mntget(mnt->mnt_ex_mountpoint.mnt);
++
+ up_write(&namespace_sem);
+
+ synchronize_rcu();
+@@ -1249,6 +1276,9 @@ void umount_tree(struct mount *mnt, int how)
+ hlist_add_head(&p->mnt_hash, &tmp_list);
+ }
+
++ hlist_for_each_entry(p, &tmp_list, mnt_hash)
++ list_del_init(&p->mnt_child);
++
+ if (how)
+ propagate_umount(&tmp_list);
+
+@@ -1259,9 +1289,9 @@ void umount_tree(struct mount *mnt, int how)
+ p->mnt_ns = NULL;
+ if (how < 2)
+ p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
+- list_del_init(&p->mnt_child);
+ if (mnt_has_parent(p)) {
+ put_mountpoint(p->mnt_mp);
++ mnt_add_count(p->mnt_parent, -1);
+ /* move the reference to mountpoint into ->mnt_ex_mountpoint */
+ p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
+ p->mnt_ex_mountpoint.mnt = &p->mnt_parent->mnt;
+@@ -1492,6 +1522,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
+ continue;
+
+ for (s = r; s; s = next_mnt(s, r)) {
++ struct mount *t = NULL;
+ if (!(flag & CL_COPY_UNBINDABLE) &&
+ IS_MNT_UNBINDABLE(s)) {
+ s = skip_mnt_tree(s);
+@@ -1513,7 +1544,14 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
+ goto out;
+ lock_mount_hash();
+ list_add_tail(&q->mnt_list, &res->mnt_list);
+- attach_mnt(q, parent, p->mnt_mp);
++ mnt_set_mountpoint(parent, p->mnt_mp, q);
++ if (!list_empty(&parent->mnt_mounts)) {
++ t = list_last_entry(&parent->mnt_mounts,
++ struct mount, mnt_child);
++ if (t->mnt_mp != p->mnt_mp)
++ t = NULL;
++ }
++ attach_shadowed(q, parent, t);
+ unlock_mount_hash();
+ }
+ }
+@@ -1896,9 +1934,6 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+ if (readonly_request == __mnt_is_readonly(mnt))
+ return 0;
+
+- if (mnt->mnt_flags & MNT_LOCK_READONLY)
+- return -EPERM;
+-
+ if (readonly_request)
+ error = mnt_make_readonly(real_mount(mnt));
+ else
+@@ -1924,6 +1959,33 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
+ if (path->dentry != path->mnt->mnt_root)
+ return -EINVAL;
+
++ /* Don't allow changing of locked mnt flags.
++ *
++ * No locks need to be held here while testing the various
++ * MNT_LOCK flags because those flags can never be cleared
++ * once they are set.
++ */
++ if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
++ !(mnt_flags & MNT_READONLY)) {
++ return -EPERM;
++ }
++ if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
++ !(mnt_flags & MNT_NODEV)) {
++ return -EPERM;
++ }
++ if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
++ !(mnt_flags & MNT_NOSUID)) {
++ return -EPERM;
++ }
++ if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
++ !(mnt_flags & MNT_NOEXEC)) {
++ return -EPERM;
++ }
++ if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
++ ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) {
++ return -EPERM;
++ }
++
+ err = security_sb_remount(sb, data);
+ if (err)
+ return err;
+@@ -1937,7 +1999,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
+ err = do_remount_sb(sb, flags, data, 0);
+ if (!err) {
+ lock_mount_hash();
+- mnt_flags |= mnt->mnt.mnt_flags & MNT_PROPAGATION_MASK;
++ mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
+ mnt->mnt.mnt_flags = mnt_flags;
+ touch_mnt_namespace(mnt->mnt_ns);
+ unlock_mount_hash();
+@@ -2122,7 +2184,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
+ */
+ if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
+ flags |= MS_NODEV;
+- mnt_flags |= MNT_NODEV;
++ mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
+ }
+ }
+
+@@ -2436,6 +2498,14 @@ long do_mount(const char *dev_name, const char *dir_name,
+ if (flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
+
++ /* The default atime for remount is preservation */
++ if ((flags & MS_REMOUNT) &&
++ ((flags & (MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
++ MS_STRICTATIME)) == 0)) {
++ mnt_flags &= ~MNT_ATIME_MASK;
++ mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK;
++ }
++
+ flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
+ MS_STRICTATIME);
+diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
+index ee9cb3795c2b..7e948ffba461 100644
+--- a/fs/notify/fanotify/fanotify.c
++++ b/fs/notify/fanotify/fanotify.c
+@@ -70,8 +70,15 @@ static int fanotify_get_response(struct fsnotify_group *group,
+ wait_event(group->fanotify_data.access_waitq, event->response ||
+ atomic_read(&group->fanotify_data.bypass_perm));
+
+- if (!event->response) /* bypass_perm set */
++ if (!event->response) { /* bypass_perm set */
++ /*
++ * Event was canceled because group is being destroyed. Remove
++ * it from group's event list because we are responsible for
++ * freeing the permission event.
++ */
++ fsnotify_remove_event(group, &event->fae.fse);
+ return 0;
++ }
+
+ /* userspace responded, convert to something usable */
+ switch (event->response) {
+diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
+index 3fdc8a3e1134..2685bc9ea2c9 100644
+--- a/fs/notify/fanotify/fanotify_user.c
++++ b/fs/notify/fanotify/fanotify_user.c
+@@ -359,6 +359,11 @@ static int fanotify_release(struct inode *ignored, struct file *file)
+ #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+ struct fanotify_perm_event_info *event, *next;
+
++ /*
++ * There may be still new events arriving in the notification queue
++ * but since userspace cannot use fanotify fd anymore, no event can
++ * enter or leave access_list by now.
++ */
+ spin_lock(&group->fanotify_data.access_lock);
+
+ atomic_inc(&group->fanotify_data.bypass_perm);
+@@ -373,6 +378,13 @@ static int fanotify_release(struct inode *ignored, struct file *file)
+ }
+ spin_unlock(&group->fanotify_data.access_lock);
+
++ /*
++ * Since bypass_perm is set, newly queued events will not wait for
++ * access response. Wake up the already sleeping ones now.
++ * synchronize_srcu() in fsnotify_destroy_group() will wait for all
++ * processes sleeping in fanotify_handle_event() waiting for access
++ * response and thus also for all permission events to be freed.
++ */
+ wake_up(&group->fanotify_data.access_waitq);
+ #endif
+
+diff --git a/fs/notify/notification.c b/fs/notify/notification.c
+index 1e58402171a5..25a07c70f1c9 100644
+--- a/fs/notify/notification.c
++++ b/fs/notify/notification.c
+@@ -73,7 +73,8 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
+ /* Overflow events are per-group and we don't want to free them */
+ if (!event || event->mask == FS_Q_OVERFLOW)
+ return;
+-
++ /* If the event is still queued, we have a problem... */
++ WARN_ON(!list_empty(&event->list));
+ group->ops->free_event(event);
+ }
+
+@@ -125,6 +126,21 @@ queue:
+ }
+
+ /*
++ * Remove @event from group's notification queue. It is the responsibility of
++ * the caller to destroy the event.
++ */
++void fsnotify_remove_event(struct fsnotify_group *group,
++ struct fsnotify_event *event)
++{
++ mutex_lock(&group->notification_mutex);
++ if (!list_empty(&event->list)) {
++ list_del_init(&event->list);
++ group->q_len--;
++ }
++ mutex_unlock(&group->notification_mutex);
++}
++
++/*
+ * Remove and return the first event from the notification list. It is the
+ * responsibility of the caller to destroy the obtained event
+ */
+diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
+index 6f66b3751ace..53e6c40ed4c6 100644
+--- a/fs/ocfs2/ioctl.c
++++ b/fs/ocfs2/ioctl.c
+@@ -35,9 +35,8 @@
+ copy_to_user((typeof(a) __user *)b, &(a), sizeof(a))
+
+ /*
+- * This call is void because we are already reporting an error that may
+- * be -EFAULT. The error will be returned from the ioctl(2) call. It's
+- * just a best-effort to tell userspace that this request caused the error.
++ * This is just a best-effort to tell userspace that this request
++ * caused the error.
+ */
+ static inline void o2info_set_request_error(struct ocfs2_info_request *kreq,
+ struct ocfs2_info_request __user *req)
+@@ -146,136 +145,105 @@ bail:
+ static int ocfs2_info_handle_blocksize(struct inode *inode,
+ struct ocfs2_info_request __user *req)
+ {
+- int status = -EFAULT;
+ struct ocfs2_info_blocksize oib;
+
+ if (o2info_from_user(oib, req))
+- goto bail;
++ return -EFAULT;
+
+ oib.ib_blocksize = inode->i_sb->s_blocksize;
+
+ o2info_set_request_filled(&oib.ib_req);
+
+ if (o2info_to_user(oib, req))
+- goto bail;
+-
+- status = 0;
+-bail:
+- if (status)
+- o2info_set_request_error(&oib.ib_req, req);
++ return -EFAULT;
+
+- return status;
++ return 0;
+ }
+
+ static int ocfs2_info_handle_clustersize(struct inode *inode,
+ struct ocfs2_info_request __user *req)
+ {
+- int status = -EFAULT;
+ struct ocfs2_info_clustersize oic;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+ if (o2info_from_user(oic, req))
+- goto bail;
++ return -EFAULT;
+
+ oic.ic_clustersize = osb->s_clustersize;
+
+ o2info_set_request_filled(&oic.ic_req);
+
+ if (o2info_to_user(oic, req))
+- goto bail;
+-
+- status = 0;
+-bail:
+- if (status)
+- o2info_set_request_error(&oic.ic_req, req);
++ return -EFAULT;
+
+- return status;
++ return 0;
+ }
+
+ static int ocfs2_info_handle_maxslots(struct inode *inode,
+ struct ocfs2_info_request __user *req)
+ {
+- int status = -EFAULT;
+ struct ocfs2_info_maxslots oim;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+ if (o2info_from_user(oim, req))
+- goto bail;
++ return -EFAULT;
+
+ oim.im_max_slots = osb->max_slots;
+
+ o2info_set_request_filled(&oim.im_req);
+
+ if (o2info_to_user(oim, req))
+- goto bail;
++ return -EFAULT;
+
+- status = 0;
+-bail:
+- if (status)
+- o2info_set_request_error(&oim.im_req, req);
+-
+- return status;
++ return 0;
+ }
+
+ static int ocfs2_info_handle_label(struct inode *inode,
+ struct ocfs2_info_request __user *req)
+ {
+- int status = -EFAULT;
+ struct ocfs2_info_label oil;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+ if (o2info_from_user(oil, req))
+- goto bail;
++ return -EFAULT;
+
+ memcpy(oil.il_label, osb->vol_label, OCFS2_MAX_VOL_LABEL_LEN);
+
+ o2info_set_request_filled(&oil.il_req);
+
+ if (o2info_to_user(oil, req))
+- goto bail;
++ return -EFAULT;
+
+- status = 0;
+-bail:
+- if (status)
+- o2info_set_request_error(&oil.il_req, req);
+-
+- return status;
++ return 0;
+ }
+
+ static int ocfs2_info_handle_uuid(struct inode *inode,
+ struct ocfs2_info_request __user *req)
+ {
+- int status = -EFAULT;
+ struct ocfs2_info_uuid oiu;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+ if (o2info_from_user(oiu, req))
+- goto bail;
++ return -EFAULT;
+
+ memcpy(oiu.iu_uuid_str, osb->uuid_str, OCFS2_TEXT_UUID_LEN + 1);
+
+ o2info_set_request_filled(&oiu.iu_req);
+
+ if (o2info_to_user(oiu, req))
+- goto bail;
+-
+- status = 0;
+-bail:
+- if (status)
+- o2info_set_request_error(&oiu.iu_req, req);
++ return -EFAULT;
+
+- return status;
++ return 0;
+ }
+
+ static int ocfs2_info_handle_fs_features(struct inode *inode,
+ struct ocfs2_info_request __user *req)
+ {
+- int status = -EFAULT;
+ struct ocfs2_info_fs_features oif;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+ if (o2info_from_user(oif, req))
+- goto bail;
++ return -EFAULT;
+
+ oif.if_compat_features = osb->s_feature_compat;
+ oif.if_incompat_features = osb->s_feature_incompat;
+@@ -284,39 +252,28 @@ static int ocfs2_info_handle_fs_features(struct inode *inode,
+ o2info_set_request_filled(&oif.if_req);
+
+ if (o2info_to_user(oif, req))
+- goto bail;
++ return -EFAULT;
+
+- status = 0;
+-bail:
+- if (status)
+- o2info_set_request_error(&oif.if_req, req);
+-
+- return status;
++ return 0;
+ }
+
+ static int ocfs2_info_handle_journal_size(struct inode *inode,
+ struct ocfs2_info_request __user *req)
+ {
+- int status = -EFAULT;
+ struct ocfs2_info_journal_size oij;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+ if (o2info_from_user(oij, req))
+- goto bail;
++ return -EFAULT;
+
+ oij.ij_journal_size = i_size_read(osb->journal->j_inode);
+
+ o2info_set_request_filled(&oij.ij_req);
+
+ if (o2info_to_user(oij, req))
+- goto bail;
++ return -EFAULT;
+
+- status = 0;
+-bail:
+- if (status)
+- o2info_set_request_error(&oij.ij_req, req);
+-
+- return status;
++ return 0;
+ }
+
+ static int ocfs2_info_scan_inode_alloc(struct ocfs2_super *osb,
+@@ -373,7 +330,7 @@ static int ocfs2_info_handle_freeinode(struct inode *inode,
+ u32 i;
+ u64 blkno = -1;
+ char namebuf[40];
+- int status = -EFAULT, type = INODE_ALLOC_SYSTEM_INODE;
++ int status, type = INODE_ALLOC_SYSTEM_INODE;
+ struct ocfs2_info_freeinode *oifi = NULL;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ struct inode *inode_alloc = NULL;
+@@ -385,8 +342,10 @@ static int ocfs2_info_handle_freeinode(struct inode *inode,
+ goto out_err;
+ }
+
+- if (o2info_from_user(*oifi, req))
+- goto bail;
++ if (o2info_from_user(*oifi, req)) {
++ status = -EFAULT;
++ goto out_free;
++ }
+
+ oifi->ifi_slotnum = osb->max_slots;
+
+@@ -424,14 +383,16 @@ static int ocfs2_info_handle_freeinode(struct inode *inode,
+
+ o2info_set_request_filled(&oifi->ifi_req);
+
+- if (o2info_to_user(*oifi, req))
+- goto bail;
++ if (o2info_to_user(*oifi, req)) {
++ status = -EFAULT;
++ goto out_free;
++ }
+
+ status = 0;
+ bail:
+ if (status)
+ o2info_set_request_error(&oifi->ifi_req, req);
+-
++out_free:
+ kfree(oifi);
+ out_err:
+ return status;
+@@ -658,7 +619,7 @@ static int ocfs2_info_handle_freefrag(struct inode *inode,
+ {
+ u64 blkno = -1;
+ char namebuf[40];
+- int status = -EFAULT, type = GLOBAL_BITMAP_SYSTEM_INODE;
++ int status, type = GLOBAL_BITMAP_SYSTEM_INODE;
+
+ struct ocfs2_info_freefrag *oiff;
+ struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+@@ -671,8 +632,10 @@ static int ocfs2_info_handle_freefrag(struct inode *inode,
+ goto out_err;
+ }
+
+- if (o2info_from_user(*oiff, req))
+- goto bail;
++ if (o2info_from_user(*oiff, req)) {
++ status = -EFAULT;
++ goto out_free;
++ }
+ /*
+ * chunksize from userspace should be power of 2.
+ */
+@@ -711,14 +674,14 @@ static int ocfs2_info_handle_freefrag(struct inode *inode,
+
+ if (o2info_to_user(*oiff, req)) {
+ status = -EFAULT;
+- goto bail;
++ goto out_free;
+ }
+
+ status = 0;
+ bail:
+ if (status)
+ o2info_set_request_error(&oiff->iff_req, req);
+-
++out_free:
+ kfree(oiff);
+ out_err:
+ return status;
+@@ -727,23 +690,17 @@ out_err:
+ static int ocfs2_info_handle_unknown(struct inode *inode,
+ struct ocfs2_info_request __user *req)
+ {
+- int status = -EFAULT;
+ struct ocfs2_info_request oir;
+
+ if (o2info_from_user(oir, req))
+- goto bail;
++ return -EFAULT;
+
+ o2info_clear_request_filled(&oir);
+
+ if (o2info_to_user(oir, req))
+- goto bail;
++ return -EFAULT;
+
+- status = 0;
+-bail:
+- if (status)
+- o2info_set_request_error(&oir, req);
+-
+- return status;
++ return 0;
+ }
+
+ /*
+diff --git a/fs/pnode.c b/fs/pnode.c
+index 302bf22c4a30..aae331a5d03b 100644
+--- a/fs/pnode.c
++++ b/fs/pnode.c
+@@ -381,6 +381,7 @@ static void __propagate_umount(struct mount *mnt)
+ * other children
+ */
+ if (child && list_empty(&child->mnt_mounts)) {
++ list_del_init(&child->mnt_child);
+ hlist_del_init_rcu(&child->mnt_hash);
+ hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
+ }
+diff --git a/fs/proc/array.c b/fs/proc/array.c
+index 64db2bceac59..3e1290b0492e 100644
+--- a/fs/proc/array.c
++++ b/fs/proc/array.c
+@@ -297,15 +297,11 @@ static void render_cap_t(struct seq_file *m, const char *header,
+ seq_puts(m, header);
+ CAP_FOR_EACH_U32(__capi) {
+ seq_printf(m, "%08x",
+- a->cap[(_KERNEL_CAPABILITY_U32S-1) - __capi]);
++ a->cap[CAP_LAST_U32 - __capi]);
+ }
+ seq_putc(m, '\n');
+ }
+
+-/* Remove non-existent capabilities */
+-#define NORM_CAPS(v) (v.cap[CAP_TO_INDEX(CAP_LAST_CAP)] &= \
+- CAP_TO_MASK(CAP_LAST_CAP + 1) - 1)
+-
+ static inline void task_cap(struct seq_file *m, struct task_struct *p)
+ {
+ const struct cred *cred;
+@@ -319,11 +315,6 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
+ cap_bset = cred->cap_bset;
+ rcu_read_unlock();
+
+- NORM_CAPS(cap_inheritable);
+- NORM_CAPS(cap_permitted);
+- NORM_CAPS(cap_effective);
+- NORM_CAPS(cap_bset);
+-
+ render_cap_t(m, "CapInh:\t", &cap_inheritable);
+ render_cap_t(m, "CapPrm:\t", &cap_permitted);
+ render_cap_t(m, "CapEff:\t", &cap_effective);
+diff --git a/fs/reiserfs/do_balan.c b/fs/reiserfs/do_balan.c
+index 54fdf196bfb2..4d5e5297793f 100644
+--- a/fs/reiserfs/do_balan.c
++++ b/fs/reiserfs/do_balan.c
+@@ -286,12 +286,14 @@ static int balance_leaf_when_delete(struct tree_balance *tb, int flag)
+ return 0;
+ }
+
+-static void balance_leaf_insert_left(struct tree_balance *tb,
+- struct item_head *ih, const char *body)
++static unsigned int balance_leaf_insert_left(struct tree_balance *tb,
++ struct item_head *const ih,
++ const char * const body)
+ {
+ int ret;
+ struct buffer_info bi;
+ int n = B_NR_ITEMS(tb->L[0]);
++ unsigned body_shift_bytes = 0;
+
+ if (tb->item_pos == tb->lnum[0] - 1 && tb->lbytes != -1) {
+ /* part of new item falls into L[0] */
+@@ -329,7 +331,7 @@ static void balance_leaf_insert_left(struct tree_balance *tb,
+
+ put_ih_item_len(ih, new_item_len);
+ if (tb->lbytes > tb->zeroes_num) {
+- body += (tb->lbytes - tb->zeroes_num);
++ body_shift_bytes = tb->lbytes - tb->zeroes_num;
+ tb->zeroes_num = 0;
+ } else
+ tb->zeroes_num -= tb->lbytes;
+@@ -349,11 +351,12 @@ static void balance_leaf_insert_left(struct tree_balance *tb,
+ tb->insert_size[0] = 0;
+ tb->zeroes_num = 0;
+ }
++ return body_shift_bytes;
+ }
+
+ static void balance_leaf_paste_left_shift_dirent(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ int n = B_NR_ITEMS(tb->L[0]);
+ struct buffer_info bi;
+@@ -413,17 +416,18 @@ static void balance_leaf_paste_left_shift_dirent(struct tree_balance *tb,
+ tb->pos_in_item -= tb->lbytes;
+ }
+
+-static void balance_leaf_paste_left_shift(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body)
++static unsigned int balance_leaf_paste_left_shift(struct tree_balance *tb,
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ int n = B_NR_ITEMS(tb->L[0]);
+ struct buffer_info bi;
++ int body_shift_bytes = 0;
+
+ if (is_direntry_le_ih(item_head(tbS0, tb->item_pos))) {
+ balance_leaf_paste_left_shift_dirent(tb, ih, body);
+- return;
++ return 0;
+ }
+
+ RFALSE(tb->lbytes <= 0,
+@@ -497,7 +501,7 @@ static void balance_leaf_paste_left_shift(struct tree_balance *tb,
+ * insert_size[0]
+ */
+ if (l_n > tb->zeroes_num) {
+- body += (l_n - tb->zeroes_num);
++ body_shift_bytes = l_n - tb->zeroes_num;
+ tb->zeroes_num = 0;
+ } else
+ tb->zeroes_num -= l_n;
+@@ -526,13 +530,14 @@ static void balance_leaf_paste_left_shift(struct tree_balance *tb,
+ */
+ leaf_shift_left(tb, tb->lnum[0], tb->lbytes);
+ }
++ return body_shift_bytes;
+ }
+
+
+ /* appended item will be in L[0] in whole */
+ static void balance_leaf_paste_left_whole(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ int n = B_NR_ITEMS(tb->L[0]);
+@@ -584,39 +589,44 @@ static void balance_leaf_paste_left_whole(struct tree_balance *tb,
+ tb->zeroes_num = 0;
+ }
+
+-static void balance_leaf_paste_left(struct tree_balance *tb,
+- struct item_head *ih, const char *body)
++static unsigned int balance_leaf_paste_left(struct tree_balance *tb,
++ struct item_head * const ih,
++ const char * const body)
+ {
+ /* we must shift the part of the appended item */
+ if (tb->item_pos == tb->lnum[0] - 1 && tb->lbytes != -1)
+- balance_leaf_paste_left_shift(tb, ih, body);
++ return balance_leaf_paste_left_shift(tb, ih, body);
+ else
+ balance_leaf_paste_left_whole(tb, ih, body);
++ return 0;
+ }
+
+ /* Shift lnum[0] items from S[0] to the left neighbor L[0] */
+-static void balance_leaf_left(struct tree_balance *tb, struct item_head *ih,
+- const char *body, int flag)
++static unsigned int balance_leaf_left(struct tree_balance *tb,
++ struct item_head * const ih,
++ const char * const body, int flag)
+ {
+ if (tb->lnum[0] <= 0)
+- return;
++ return 0;
+
+ /* new item or it part falls to L[0], shift it too */
+ if (tb->item_pos < tb->lnum[0]) {
+ BUG_ON(flag != M_INSERT && flag != M_PASTE);
+
+ if (flag == M_INSERT)
+- balance_leaf_insert_left(tb, ih, body);
++ return balance_leaf_insert_left(tb, ih, body);
+ else /* M_PASTE */
+- balance_leaf_paste_left(tb, ih, body);
++ return balance_leaf_paste_left(tb, ih, body);
+ } else
+ /* new item doesn't fall into L[0] */
+ leaf_shift_left(tb, tb->lnum[0], tb->lbytes);
++ return 0;
+ }
+
+
+ static void balance_leaf_insert_right(struct tree_balance *tb,
+- struct item_head *ih, const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+@@ -704,7 +714,8 @@ static void balance_leaf_insert_right(struct tree_balance *tb,
+
+
+ static void balance_leaf_paste_right_shift_dirent(struct tree_balance *tb,
+- struct item_head *ih, const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ struct buffer_info bi;
+@@ -754,7 +765,8 @@ static void balance_leaf_paste_right_shift_dirent(struct tree_balance *tb,
+ }
+
+ static void balance_leaf_paste_right_shift(struct tree_balance *tb,
+- struct item_head *ih, const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ int n_shift, n_rem, r_zeroes_number, version;
+@@ -831,7 +843,8 @@ static void balance_leaf_paste_right_shift(struct tree_balance *tb,
+ }
+
+ static void balance_leaf_paste_right_whole(struct tree_balance *tb,
+- struct item_head *ih, const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ int n = B_NR_ITEMS(tbS0);
+@@ -874,7 +887,8 @@ static void balance_leaf_paste_right_whole(struct tree_balance *tb,
+ }
+
+ static void balance_leaf_paste_right(struct tree_balance *tb,
+- struct item_head *ih, const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ int n = B_NR_ITEMS(tbS0);
+@@ -896,8 +910,9 @@ static void balance_leaf_paste_right(struct tree_balance *tb,
+ }
+
+ /* shift rnum[0] items from S[0] to the right neighbor R[0] */
+-static void balance_leaf_right(struct tree_balance *tb, struct item_head *ih,
+- const char *body, int flag)
++static void balance_leaf_right(struct tree_balance *tb,
++ struct item_head * const ih,
++ const char * const body, int flag)
+ {
+ if (tb->rnum[0] <= 0)
+ return;
+@@ -911,8 +926,8 @@ static void balance_leaf_right(struct tree_balance *tb, struct item_head *ih,
+ }
+
+ static void balance_leaf_new_nodes_insert(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body,
++ struct item_head * const ih,
++ const char * const body,
+ struct item_head *insert_key,
+ struct buffer_head **insert_ptr,
+ int i)
+@@ -1003,8 +1018,8 @@ static void balance_leaf_new_nodes_insert(struct tree_balance *tb,
+
+ /* we append to directory item */
+ static void balance_leaf_new_nodes_paste_dirent(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body,
++ struct item_head * const ih,
++ const char * const body,
+ struct item_head *insert_key,
+ struct buffer_head **insert_ptr,
+ int i)
+@@ -1058,8 +1073,8 @@ static void balance_leaf_new_nodes_paste_dirent(struct tree_balance *tb,
+ }
+
+ static void balance_leaf_new_nodes_paste_shift(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body,
++ struct item_head * const ih,
++ const char * const body,
+ struct item_head *insert_key,
+ struct buffer_head **insert_ptr,
+ int i)
+@@ -1131,8 +1146,8 @@ static void balance_leaf_new_nodes_paste_shift(struct tree_balance *tb,
+ }
+
+ static void balance_leaf_new_nodes_paste_whole(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body,
++ struct item_head * const ih,
++ const char * const body,
+ struct item_head *insert_key,
+ struct buffer_head **insert_ptr,
+ int i)
+@@ -1184,8 +1199,8 @@ static void balance_leaf_new_nodes_paste_whole(struct tree_balance *tb,
+
+ }
+ static void balance_leaf_new_nodes_paste(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body,
++ struct item_head * const ih,
++ const char * const body,
+ struct item_head *insert_key,
+ struct buffer_head **insert_ptr,
+ int i)
+@@ -1214,8 +1229,8 @@ static void balance_leaf_new_nodes_paste(struct tree_balance *tb,
+
+ /* Fill new nodes that appear in place of S[0] */
+ static void balance_leaf_new_nodes(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body,
++ struct item_head * const ih,
++ const char * const body,
+ struct item_head *insert_key,
+ struct buffer_head **insert_ptr,
+ int flag)
+@@ -1254,8 +1269,8 @@ static void balance_leaf_new_nodes(struct tree_balance *tb,
+ }
+
+ static void balance_leaf_finish_node_insert(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ struct buffer_info bi;
+@@ -1271,8 +1286,8 @@ static void balance_leaf_finish_node_insert(struct tree_balance *tb,
+ }
+
+ static void balance_leaf_finish_node_paste_dirent(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ struct item_head *pasted = item_head(tbS0, tb->item_pos);
+@@ -1305,8 +1320,8 @@ static void balance_leaf_finish_node_paste_dirent(struct tree_balance *tb,
+ }
+
+ static void balance_leaf_finish_node_paste(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body)
++ struct item_head * const ih,
++ const char * const body)
+ {
+ struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ struct buffer_info bi;
+@@ -1349,8 +1364,8 @@ static void balance_leaf_finish_node_paste(struct tree_balance *tb,
+ * of the affected item which remains in S
+ */
+ static void balance_leaf_finish_node(struct tree_balance *tb,
+- struct item_head *ih,
+- const char *body, int flag)
++ struct item_head * const ih,
++ const char * const body, int flag)
+ {
+ /* if we must insert or append into buffer S[0] */
+ if (0 <= tb->item_pos && tb->item_pos < tb->s0num) {
+@@ -1402,7 +1417,7 @@ static int balance_leaf(struct tree_balance *tb, struct item_head *ih,
+ && is_indirect_le_ih(item_head(tbS0, tb->item_pos)))
+ tb->pos_in_item *= UNFM_P_SIZE;
+
+- balance_leaf_left(tb, ih, body, flag);
++ body += balance_leaf_left(tb, ih, body, flag);
+
+ /* tb->lnum[0] > 0 */
+ /* Calculate new item position */
+diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
+index e8870de4627e..a88b1b3e7db3 100644
+--- a/fs/reiserfs/journal.c
++++ b/fs/reiserfs/journal.c
+@@ -1947,8 +1947,6 @@ static int do_journal_release(struct reiserfs_transaction_handle *th,
+ }
+ }
+
+- /* wait for all commits to finish */
+- cancel_delayed_work(&SB_JOURNAL(sb)->j_work);
+
+ /*
+ * We must release the write lock here because
+@@ -1956,8 +1954,14 @@ static int do_journal_release(struct reiserfs_transaction_handle *th,
+ */
+ reiserfs_write_unlock(sb);
+
++ /*
++ * Cancel flushing of old commits. Note that neither of these works
++ * will be requeued because superblock is being shutdown and doesn't
++ * have MS_ACTIVE set.
++ */
+ cancel_delayed_work_sync(&REISERFS_SB(sb)->old_work);
+- flush_workqueue(REISERFS_SB(sb)->commit_wq);
++ /* wait for all commits to finish */
++ cancel_delayed_work_sync(&SB_JOURNAL(sb)->j_work);
+
+ free_journal_ram(sb);
+
+@@ -4292,9 +4296,15 @@ static int do_journal_end(struct reiserfs_transaction_handle *th, int flags)
+ if (flush) {
+ flush_commit_list(sb, jl, 1);
+ flush_journal_list(sb, jl, 1);
+- } else if (!(jl->j_state & LIST_COMMIT_PENDING))
+- queue_delayed_work(REISERFS_SB(sb)->commit_wq,
+- &journal->j_work, HZ / 10);
++ } else if (!(jl->j_state & LIST_COMMIT_PENDING)) {
++ /*
++ * Avoid queueing work when sb is being shut down. Transaction
++ * will be flushed on journal shutdown.
++ */
++ if (sb->s_flags & MS_ACTIVE)
++ queue_delayed_work(REISERFS_SB(sb)->commit_wq,
++ &journal->j_work, HZ / 10);
++ }
+
+ /*
+ * if the next transaction has any chance of wrapping, flush
+diff --git a/fs/reiserfs/lbalance.c b/fs/reiserfs/lbalance.c
+index d6744c8b24e1..3a74d15eb814 100644
+--- a/fs/reiserfs/lbalance.c
++++ b/fs/reiserfs/lbalance.c
+@@ -899,8 +899,9 @@ void leaf_delete_items(struct buffer_info *cur_bi, int last_first,
+
+ /* insert item into the leaf node in position before */
+ void leaf_insert_into_buf(struct buffer_info *bi, int before,
+- struct item_head *inserted_item_ih,
+- const char *inserted_item_body, int zeros_number)
++ struct item_head * const inserted_item_ih,
++ const char * const inserted_item_body,
++ int zeros_number)
+ {
+ struct buffer_head *bh = bi->bi_bh;
+ int nr, free_space;
+diff --git a/fs/reiserfs/reiserfs.h b/fs/reiserfs/reiserfs.h
+index bf53888c7f59..735c2c2b4536 100644
+--- a/fs/reiserfs/reiserfs.h
++++ b/fs/reiserfs/reiserfs.h
+@@ -3216,11 +3216,12 @@ int leaf_shift_right(struct tree_balance *tb, int shift_num, int shift_bytes);
+ void leaf_delete_items(struct buffer_info *cur_bi, int last_first, int first,
+ int del_num, int del_bytes);
+ void leaf_insert_into_buf(struct buffer_info *bi, int before,
+- struct item_head *inserted_item_ih,
+- const char *inserted_item_body, int zeros_number);
+-void leaf_paste_in_buffer(struct buffer_info *bi, int pasted_item_num,
+- int pos_in_item, int paste_size, const char *body,
++ struct item_head * const inserted_item_ih,
++ const char * const inserted_item_body,
+ int zeros_number);
++void leaf_paste_in_buffer(struct buffer_info *bi, int pasted_item_num,
++ int pos_in_item, int paste_size,
++ const char * const body, int zeros_number);
+ void leaf_cut_from_buffer(struct buffer_info *bi, int cut_item_num,
+ int pos_in_item, int cut_size);
+ void leaf_paste_entries(struct buffer_info *bi, int item_num, int before,
+diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
+index a392cef6acc6..5fd8f57e07fc 100644
+--- a/fs/reiserfs/super.c
++++ b/fs/reiserfs/super.c
+@@ -100,7 +100,11 @@ void reiserfs_schedule_old_flush(struct super_block *s)
+ struct reiserfs_sb_info *sbi = REISERFS_SB(s);
+ unsigned long delay;
+
+- if (s->s_flags & MS_RDONLY)
++ /*
++ * Avoid scheduling flush when sb is being shut down. It can race
++ * with journal shutdown and free still queued delayed work.
++ */
++ if (s->s_flags & MS_RDONLY || !(s->s_flags & MS_ACTIVE))
+ return;
+
+ spin_lock(&sbi->old_work_lock);
+diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
+index faaf716e2080..02614349690d 100644
+--- a/fs/xfs/xfs_aops.c
++++ b/fs/xfs/xfs_aops.c
+@@ -1753,11 +1753,72 @@ xfs_vm_readpages(
+ return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
+ }
+
++/*
++ * This is basically a copy of __set_page_dirty_buffers() with one
++ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
++ * dirty, we'll never be able to clean them because we don't write buffers
++ * beyond EOF, and that means we can't invalidate pages that span EOF
++ * that have been marked dirty. Further, the dirty state can leak into
++ * the file interior if the file is extended, resulting in all sorts of
++ * bad things happening as the state does not match the underlying data.
++ *
++ * XXX: this really indicates that bufferheads in XFS need to die. Warts like
++ * this only exist because of bufferheads and how the generic code manages them.
++ */
++STATIC int
++xfs_vm_set_page_dirty(
++ struct page *page)
++{
++ struct address_space *mapping = page->mapping;
++ struct inode *inode = mapping->host;
++ loff_t end_offset;
++ loff_t offset;
++ int newly_dirty;
++
++ if (unlikely(!mapping))
++ return !TestSetPageDirty(page);
++
++ end_offset = i_size_read(inode);
++ offset = page_offset(page);
++
++ spin_lock(&mapping->private_lock);
++ if (page_has_buffers(page)) {
++ struct buffer_head *head = page_buffers(page);
++ struct buffer_head *bh = head;
++
++ do {
++ if (offset < end_offset)
++ set_buffer_dirty(bh);
++ bh = bh->b_this_page;
++ offset += 1 << inode->i_blkbits;
++ } while (bh != head);
++ }
++ newly_dirty = !TestSetPageDirty(page);
++ spin_unlock(&mapping->private_lock);
++
++ if (newly_dirty) {
++ /* sigh - __set_page_dirty() is static, so copy it here, too */
++ unsigned long flags;
++
++ spin_lock_irqsave(&mapping->tree_lock, flags);
++ if (page->mapping) { /* Race with truncate? */
++ WARN_ON_ONCE(!PageUptodate(page));
++ account_page_dirtied(page, mapping);
++ radix_tree_tag_set(&mapping->page_tree,
++ page_index(page), PAGECACHE_TAG_DIRTY);
++ }
++ spin_unlock_irqrestore(&mapping->tree_lock, flags);
++ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
++ }
++ return newly_dirty;
++}
++
+ const struct address_space_operations xfs_address_space_operations = {
+ .readpage = xfs_vm_readpage,
+ .readpages = xfs_vm_readpages,
+ .writepage = xfs_vm_writepage,
+ .writepages = xfs_vm_writepages,
++ .set_page_dirty = xfs_vm_set_page_dirty,
+ .releasepage = xfs_vm_releasepage,
+ .invalidatepage = xfs_vm_invalidatepage,
+ .write_begin = xfs_vm_write_begin,
+diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
+index 3ee0cd43edc0..c9656491d823 100644
+--- a/fs/xfs/xfs_dquot.c
++++ b/fs/xfs/xfs_dquot.c
+@@ -974,7 +974,8 @@ xfs_qm_dqflush(
+ * Get the buffer containing the on-disk dquot
+ */
+ error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp, dqp->q_blkno,
+- mp->m_quotainfo->qi_dqchunklen, 0, &bp, NULL);
++ mp->m_quotainfo->qi_dqchunklen, 0, &bp,
++ &xfs_dquot_buf_ops);
+ if (error)
+ goto out_unlock;
+
+diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
+index 1f66779d7a46..055459999660 100644
+--- a/fs/xfs/xfs_file.c
++++ b/fs/xfs/xfs_file.c
+@@ -295,7 +295,16 @@ xfs_file_read_iter(
+ xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
+ return ret;
+ }
+- truncate_pagecache_range(VFS_I(ip), pos, -1);
++
++ /*
++ * Invalidate whole pages. This can return an error if
++ * we fail to invalidate a page, but this should never
++ * happen on XFS. Warn if it does fail.
++ */
++ ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
++ pos >> PAGE_CACHE_SHIFT, -1);
++ WARN_ON_ONCE(ret);
++ ret = 0;
+ }
+ xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
+ }
+@@ -634,7 +643,15 @@ xfs_file_dio_aio_write(
+ pos, -1);
+ if (ret)
+ goto out;
+- truncate_pagecache_range(VFS_I(ip), pos, -1);
++ /*
++ * Invalidate whole pages. This can return an error if
++ * we fail to invalidate a page, but this should never
++ * happen on XFS. Warn if it does fail.
++ */
++ ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
++ pos >> PAGE_CACHE_SHIFT, -1);
++ WARN_ON_ONCE(ret);
++ ret = 0;
+ }
+
+ /*
+diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
+index 981af0f6504b..8c962890fe17 100644
+--- a/fs/xfs/xfs_log_recover.c
++++ b/fs/xfs/xfs_log_recover.c
+@@ -2125,6 +2125,17 @@ xlog_recover_validate_buf_type(
+ __uint16_t magic16;
+ __uint16_t magicda;
+
++ /*
++ * We can only do post recovery validation on items on CRC enabled
++ * fielsystems as we need to know when the buffer was written to be able
++ * to determine if we should have replayed the item. If we replay old
++ * metadata over a newer buffer, then it will enter a temporarily
++ * inconsistent state resulting in verification failures. Hence for now
++ * just avoid the verification stage for non-crc filesystems
++ */
++ if (!xfs_sb_version_hascrc(&mp->m_sb))
++ return;
++
+ magic32 = be32_to_cpu(*(__be32 *)bp->b_addr);
+ magic16 = be16_to_cpu(*(__be16*)bp->b_addr);
+ magicda = be16_to_cpu(info->magic);
+@@ -2162,8 +2173,6 @@ xlog_recover_validate_buf_type(
+ bp->b_ops = &xfs_agf_buf_ops;
+ break;
+ case XFS_BLFT_AGFL_BUF:
+- if (!xfs_sb_version_hascrc(&mp->m_sb))
+- break;
+ if (magic32 != XFS_AGFL_MAGIC) {
+ xfs_warn(mp, "Bad AGFL block magic!");
+ ASSERT(0);
+@@ -2196,10 +2205,6 @@ xlog_recover_validate_buf_type(
+ #endif
+ break;
+ case XFS_BLFT_DINO_BUF:
+- /*
+- * we get here with inode allocation buffers, not buffers that
+- * track unlinked list changes.
+- */
+ if (magic16 != XFS_DINODE_MAGIC) {
+ xfs_warn(mp, "Bad INODE block magic!");
+ ASSERT(0);
+@@ -2279,8 +2284,6 @@ xlog_recover_validate_buf_type(
+ bp->b_ops = &xfs_attr3_leaf_buf_ops;
+ break;
+ case XFS_BLFT_ATTR_RMT_BUF:
+- if (!xfs_sb_version_hascrc(&mp->m_sb))
+- break;
+ if (magic32 != XFS_ATTR3_RMT_MAGIC) {
+ xfs_warn(mp, "Bad attr remote magic!");
+ ASSERT(0);
+@@ -2387,16 +2390,7 @@ xlog_recover_do_reg_buffer(
+ /* Shouldn't be any more regions */
+ ASSERT(i == item->ri_total);
+
+- /*
+- * We can only do post recovery validation on items on CRC enabled
+- * fielsystems as we need to know when the buffer was written to be able
+- * to determine if we should have replayed the item. If we replay old
+- * metadata over a newer buffer, then it will enter a temporarily
+- * inconsistent state resulting in verification failures. Hence for now
+- * just avoid the verification stage for non-crc filesystems
+- */
+- if (xfs_sb_version_hascrc(&mp->m_sb))
+- xlog_recover_validate_buf_type(mp, bp, buf_f);
++ xlog_recover_validate_buf_type(mp, bp, buf_f);
+ }
+
+ /*
+@@ -2504,12 +2498,29 @@ xlog_recover_buffer_pass2(
+ }
+
+ /*
+- * recover the buffer only if we get an LSN from it and it's less than
++ * Recover the buffer only if we get an LSN from it and it's less than
+ * the lsn of the transaction we are replaying.
++ *
++ * Note that we have to be extremely careful of readahead here.
++ * Readahead does not attach verfiers to the buffers so if we don't
++ * actually do any replay after readahead because of the LSN we found
++ * in the buffer if more recent than that current transaction then we
++ * need to attach the verifier directly. Failure to do so can lead to
++ * future recovery actions (e.g. EFI and unlinked list recovery) can
++ * operate on the buffers and they won't get the verifier attached. This
++ * can lead to blocks on disk having the correct content but a stale
++ * CRC.
++ *
++ * It is safe to assume these clean buffers are currently up to date.
++ * If the buffer is dirtied by a later transaction being replayed, then
++ * the verifier will be reset to match whatever recover turns that
++ * buffer into.
+ */
+ lsn = xlog_recover_get_buf_lsn(mp, bp);
+- if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0)
++ if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
++ xlog_recover_validate_buf_type(mp, bp, buf_f);
+ goto out_release;
++ }
+
+ if (buf_f->blf_flags & XFS_BLF_INODE_BUF) {
+ error = xlog_recover_do_inode_buffer(mp, item, bp, buf_f);
+diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
+index 6d26759c779a..6c51e2f97c0a 100644
+--- a/fs/xfs/xfs_qm.c
++++ b/fs/xfs/xfs_qm.c
+@@ -1005,6 +1005,12 @@ xfs_qm_dqiter_bufs(
+ if (error)
+ break;
+
++ /*
++ * A corrupt buffer might not have a verifier attached, so
++ * make sure we have the correct one attached before writeback
++ * occurs.
++ */
++ bp->b_ops = &xfs_dquot_buf_ops;
+ xfs_qm_reset_dqcounts(mp, bp, firstid, type);
+ xfs_buf_delwri_queue(bp, buffer_list);
+ xfs_buf_relse(bp);
+@@ -1090,7 +1096,7 @@ xfs_qm_dqiterate(
+ xfs_buf_readahead(mp->m_ddev_targp,
+ XFS_FSB_TO_DADDR(mp, rablkno),
+ mp->m_quotainfo->qi_dqchunklen,
+- NULL);
++ &xfs_dquot_buf_ops);
+ rablkno++;
+ }
+ }
+diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
+index b5714580801a..0826a4407e8e 100644
+--- a/include/acpi/acpi_bus.h
++++ b/include/acpi/acpi_bus.h
+@@ -246,7 +246,6 @@ struct acpi_device_pnp {
+ acpi_device_name device_name; /* Driver-determined */
+ acpi_device_class device_class; /* " */
+ union acpi_object *str_obj; /* unicode string for _STR method */
+- unsigned long sun; /* _SUN */
+ };
+
+ #define acpi_device_bid(d) ((d)->pnp.bus_id)
+diff --git a/include/linux/capability.h b/include/linux/capability.h
+index 84b13ad67c1c..aa93e5ef594c 100644
+--- a/include/linux/capability.h
++++ b/include/linux/capability.h
+@@ -78,8 +78,11 @@ extern const kernel_cap_t __cap_init_eff_set;
+ # error Fix up hand-coded capability macro initializers
+ #else /* HAND-CODED capability initializers */
+
++#define CAP_LAST_U32 ((_KERNEL_CAPABILITY_U32S) - 1)
++#define CAP_LAST_U32_VALID_MASK (CAP_TO_MASK(CAP_LAST_CAP + 1) -1)
++
+ # define CAP_EMPTY_SET ((kernel_cap_t){{ 0, 0 }})
+-# define CAP_FULL_SET ((kernel_cap_t){{ ~0, ~0 }})
++# define CAP_FULL_SET ((kernel_cap_t){{ ~0, CAP_LAST_U32_VALID_MASK }})
+ # define CAP_FS_SET ((kernel_cap_t){{ CAP_FS_MASK_B0 \
+ | CAP_TO_MASK(CAP_LINUX_IMMUTABLE), \
+ CAP_FS_MASK_B1 } })
+diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
+index fc7718c6bd3e..d2be2526ec48 100644
+--- a/include/linux/fsnotify_backend.h
++++ b/include/linux/fsnotify_backend.h
+@@ -326,6 +326,8 @@ extern int fsnotify_add_notify_event(struct fsnotify_group *group,
+ struct fsnotify_event *event,
+ int (*merge)(struct list_head *,
+ struct fsnotify_event *));
++/* Remove passed event from groups notification queue */
++extern void fsnotify_remove_event(struct fsnotify_group *group, struct fsnotify_event *event);
+ /* true if the group notification queue is empty */
+ extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group);
+ /* return, but do not dequeue the first event on the notification queue */
+diff --git a/include/linux/mount.h b/include/linux/mount.h
+index 839bac270904..b0c1e6574e7f 100644
+--- a/include/linux/mount.h
++++ b/include/linux/mount.h
+@@ -42,13 +42,20 @@ struct mnt_namespace;
+ * flag, consider how it interacts with shared mounts.
+ */
+ #define MNT_SHARED_MASK (MNT_UNBINDABLE)
+-#define MNT_PROPAGATION_MASK (MNT_SHARED | MNT_UNBINDABLE)
++#define MNT_USER_SETTABLE_MASK (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
++ | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
++ | MNT_READONLY)
++#define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
+
+ #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
+ MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
+
+ #define MNT_INTERNAL 0x4000
+
++#define MNT_LOCK_ATIME 0x040000
++#define MNT_LOCK_NOEXEC 0x080000
++#define MNT_LOCK_NOSUID 0x100000
++#define MNT_LOCK_NODEV 0x200000
+ #define MNT_LOCK_READONLY 0x400000
+ #define MNT_LOCKED 0x800000
+ #define MNT_DOOMED 0x1000000
+diff --git a/include/linux/tpm.h b/include/linux/tpm.h
+index fff1d0976f80..8350c538b486 100644
+--- a/include/linux/tpm.h
++++ b/include/linux/tpm.h
+@@ -39,6 +39,9 @@ struct tpm_class_ops {
+ int (*send) (struct tpm_chip *chip, u8 *buf, size_t len);
+ void (*cancel) (struct tpm_chip *chip);
+ u8 (*status) (struct tpm_chip *chip);
++ bool (*update_timeouts)(struct tpm_chip *chip,
++ unsigned long *timeout_cap);
++
+ };
+
+ #if defined(CONFIG_TCG_TPM) || defined(CONFIG_TCG_TPM_MODULE)
+diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
+index 27ab31017f09..758bc9f0f399 100644
+--- a/include/scsi/scsi_device.h
++++ b/include/scsi/scsi_device.h
+@@ -155,6 +155,7 @@ struct scsi_device {
+ unsigned skip_ms_page_8:1; /* do not use MODE SENSE page 0x08 */
+ unsigned skip_ms_page_3f:1; /* do not use MODE SENSE page 0x3f */
+ unsigned skip_vpd_pages:1; /* do not read VPD pages */
++ unsigned try_vpd_pages:1; /* attempt to read VPD pages */
+ unsigned use_192_bytes_for_3f:1; /* ask for 192 bytes from page 0x3f */
+ unsigned no_start_on_add:1; /* do not issue start on add */
+ unsigned allow_restart:1; /* issue START_UNIT in error handler */
+diff --git a/include/scsi/scsi_devinfo.h b/include/scsi/scsi_devinfo.h
+index 447d2d7466fc..183eaab7c380 100644
+--- a/include/scsi/scsi_devinfo.h
++++ b/include/scsi/scsi_devinfo.h
+@@ -32,4 +32,9 @@
+ #define BLIST_ATTACH_PQ3 0x1000000 /* Scan: Attach to PQ3 devices */
+ #define BLIST_NO_DIF 0x2000000 /* Disable T10 PI (DIF) */
+ #define BLIST_SKIP_VPD_PAGES 0x4000000 /* Ignore SBC-3 VPD pages */
++#define BLIST_SCSI3LUN 0x8000000 /* Scan more than 256 LUNs
++ for sequential scan */
++#define BLIST_TRY_VPD_PAGES 0x10000000 /* Attempt to read VPD pages */
++#define BLIST_NO_RSOC 0x20000000 /* don't try to issue RSOC */
++
+ #endif
+diff --git a/include/uapi/rdma/rdma_user_cm.h b/include/uapi/rdma/rdma_user_cm.h
+index 99b80abf360a..3066718eb120 100644
+--- a/include/uapi/rdma/rdma_user_cm.h
++++ b/include/uapi/rdma/rdma_user_cm.h
+@@ -34,6 +34,7 @@
+ #define RDMA_USER_CM_H
+
+ #include <linux/types.h>
++#include <linux/socket.h>
+ #include <linux/in6.h>
+ #include <rdma/ib_user_verbs.h>
+ #include <rdma/ib_user_sa.h>
+diff --git a/kernel/audit.c b/kernel/audit.c
+index 3ef2e0e797e8..ba2ff5a5c600 100644
+--- a/kernel/audit.c
++++ b/kernel/audit.c
+@@ -1677,7 +1677,7 @@ void audit_log_cap(struct audit_buffer *ab, char *prefix, kernel_cap_t *cap)
+ audit_log_format(ab, " %s=", prefix);
+ CAP_FOR_EACH_U32(i) {
+ audit_log_format(ab, "%08x",
+- cap->cap[(_KERNEL_CAPABILITY_U32S-1) - i]);
++ cap->cap[CAP_LAST_U32 - i]);
+ }
+ }
+
+diff --git a/kernel/capability.c b/kernel/capability.c
+index a5cf13c018ce..989f5bfc57dc 100644
+--- a/kernel/capability.c
++++ b/kernel/capability.c
+@@ -258,6 +258,10 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
+ i++;
+ }
+
++ effective.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++ permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++ inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++
+ new = prepare_creds();
+ if (!new)
+ return -ENOMEM;
+diff --git a/kernel/smp.c b/kernel/smp.c
+index 80c33f8de14f..86e59ee8dd76 100644
+--- a/kernel/smp.c
++++ b/kernel/smp.c
+@@ -661,7 +661,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+ if (cond_func(cpu, info)) {
+ ret = smp_call_function_single(cpu, func,
+ info, wait);
+- WARN_ON_ONCE(!ret);
++ WARN_ON_ONCE(ret);
+ }
+ preempt_enable();
+ }
+diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
+index ff7027199a9a..b95381ebdd5e 100644
+--- a/kernel/trace/ring_buffer.c
++++ b/kernel/trace/ring_buffer.c
+@@ -1984,7 +1984,7 @@ rb_add_time_stamp(struct ring_buffer_event *event, u64 delta)
+
+ /**
+ * rb_update_event - update event type and data
+- * @event: the even to update
++ * @event: the event to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+@@ -3357,21 +3357,16 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
+ struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+ /* Iterator usage is expected to have record disabled */
+- if (list_empty(&cpu_buffer->reader_page->list)) {
+- iter->head_page = rb_set_head_page(cpu_buffer);
+- if (unlikely(!iter->head_page))
+- return;
+- iter->head = iter->head_page->read;
+- } else {
+- iter->head_page = cpu_buffer->reader_page;
+- iter->head = cpu_buffer->reader_page->read;
+- }
++ iter->head_page = cpu_buffer->reader_page;
++ iter->head = cpu_buffer->reader_page->read;
++
++ iter->cache_reader_page = iter->head_page;
++ iter->cache_read = iter->head;
++
+ if (iter->head)
+ iter->read_stamp = cpu_buffer->read_stamp;
+ else
+ iter->read_stamp = iter->head_page->page->time_stamp;
+- iter->cache_reader_page = cpu_buffer->reader_page;
+- iter->cache_read = cpu_buffer->read;
+ }
+
+ /**
+@@ -3764,12 +3759,14 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+ return NULL;
+
+ /*
+- * We repeat when a time extend is encountered.
+- * Since the time extend is always attached to a data event,
+- * we should never loop more than once.
+- * (We never hit the following condition more than twice).
++ * We repeat when a time extend is encountered or we hit
++ * the end of the page. Since the time extend is always attached
++ * to a data event, we should never loop more than three times.
++ * Once for going to next page, once on time extend, and
++ * finally once to get the event.
++ * (We never hit the following condition more than thrice).
+ */
+- if (RB_WARN_ON(cpu_buffer, ++nr_loops > 2))
++ if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3))
+ return NULL;
+
+ if (rb_per_cpu_empty(cpu_buffer))
+diff --git a/lib/assoc_array.c b/lib/assoc_array.c
+index c0b1007011e1..2404d03e251a 100644
+--- a/lib/assoc_array.c
++++ b/lib/assoc_array.c
+@@ -1723,11 +1723,13 @@ ascend_old_tree:
+ shortcut = assoc_array_ptr_to_shortcut(ptr);
+ slot = shortcut->parent_slot;
+ cursor = shortcut->back_pointer;
++ if (!cursor)
++ goto gc_complete;
+ } else {
+ slot = node->parent_slot;
+ cursor = ptr;
+ }
+- BUG_ON(!ptr);
++ BUG_ON(!cursor);
+ node = assoc_array_ptr_to_node(cursor);
+ slot++;
+ goto continue_node;
+@@ -1735,7 +1737,7 @@ ascend_old_tree:
+ gc_complete:
+ edit->set[0].to = new_root;
+ assoc_array_apply_edit(edit);
+- edit->array->nr_leaves_on_tree = nr_leaves_on_tree;
++ array->nr_leaves_on_tree = nr_leaves_on_tree;
+ return 0;
+
+ enomem:
+diff --git a/mm/filemap.c b/mm/filemap.c
+index 900edfaf6df5..8163e0439493 100644
+--- a/mm/filemap.c
++++ b/mm/filemap.c
+@@ -2584,7 +2584,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ * that this differs from normal direct-io semantics, which
+ * will return -EFOO even if some bytes were written.
+ */
+- if (unlikely(status < 0) && !written) {
++ if (unlikely(status < 0)) {
+ err = status;
+ goto out;
+ }
+diff --git a/mm/hugetlb.c b/mm/hugetlb.c
+index 7a0a73d2fcff..7ae54449f252 100644
+--- a/mm/hugetlb.c
++++ b/mm/hugetlb.c
+@@ -1089,6 +1089,9 @@ void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
+ unsigned long pfn;
+ struct hstate *h;
+
++ if (!hugepages_supported())
++ return;
++
+ /* Set scan step to minimum hugepage size */
+ for_each_hstate(h)
+ if (order > huge_page_order(h))
+diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
+index 640c54ec1bd2..3787be160c2b 100644
+--- a/net/bluetooth/hci_event.c
++++ b/net/bluetooth/hci_event.c
+@@ -3538,18 +3538,14 @@ static void hci_io_capa_request_evt(struct hci_dev *hdev, struct sk_buff *skb)
+
+ /* If we are initiators, there is no remote information yet */
+ if (conn->remote_auth == 0xff) {
+- cp.authentication = conn->auth_type;
+-
+ /* Request MITM protection if our IO caps allow it
+ * except for the no-bonding case.
+- * conn->auth_type is not updated here since
+- * that might cause the user confirmation to be
+- * rejected in case the remote doesn't have the
+- * IO capabilities for MITM.
+ */
+ if (conn->io_capability != HCI_IO_NO_INPUT_OUTPUT &&
+- cp.authentication != HCI_AT_NO_BONDING)
+- cp.authentication |= 0x01;
++ conn->auth_type != HCI_AT_NO_BONDING)
++ conn->auth_type |= 0x01;
++
++ cp.authentication = conn->auth_type;
+ } else {
+ conn->auth_type = hci_get_auth_req(conn);
+ cp.authentication = conn->auth_type;
+@@ -3621,9 +3617,12 @@ static void hci_user_confirm_request_evt(struct hci_dev *hdev,
+ rem_mitm = (conn->remote_auth & 0x01);
+
+ /* If we require MITM but the remote device can't provide that
+- * (it has NoInputNoOutput) then reject the confirmation request
++ * (it has NoInputNoOutput) then reject the confirmation
++ * request. We check the security level here since it doesn't
++ * necessarily match conn->auth_type.
+ */
+- if (loc_mitm && conn->remote_cap == HCI_IO_NO_INPUT_OUTPUT) {
++ if (conn->pending_sec_level > BT_SECURITY_MEDIUM &&
++ conn->remote_cap == HCI_IO_NO_INPUT_OUTPUT) {
+ BT_DBG("Rejecting request: remote device can't provide MITM");
+ hci_send_cmd(hdev, HCI_OP_USER_CONFIRM_NEG_REPLY,
+ sizeof(ev->bdaddr), &ev->bdaddr);
+@@ -4177,8 +4176,8 @@ static void process_adv_report(struct hci_dev *hdev, u8 type, bdaddr_t *bdaddr,
+ * sending a merged device found event.
+ */
+ mgmt_device_found(hdev, &d->last_adv_addr, LE_LINK,
+- d->last_adv_addr_type, NULL, rssi, 0, 1, data, len,
+- d->last_adv_data, d->last_adv_data_len);
++ d->last_adv_addr_type, NULL, rssi, 0, 1,
++ d->last_adv_data, d->last_adv_data_len, data, len);
+ clear_pending_adv_report(hdev);
+ }
+
+diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
+index e1378693cc90..d0fd8b04f2e6 100644
+--- a/net/bluetooth/l2cap_sock.c
++++ b/net/bluetooth/l2cap_sock.c
+@@ -1111,7 +1111,8 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
+ l2cap_chan_close(chan, 0);
+ lock_sock(sk);
+
+- if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime)
++ if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
++ !(current->flags & PF_EXITING))
+ err = bt_sock_wait_state(sk, BT_CLOSED,
+ sk->sk_lingertime);
+ }
+diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
+index 754b6fe4f742..881f7de412cc 100644
+--- a/net/bluetooth/rfcomm/core.c
++++ b/net/bluetooth/rfcomm/core.c
+@@ -1909,10 +1909,13 @@ static struct rfcomm_session *rfcomm_process_rx(struct rfcomm_session *s)
+ /* Get data directly from socket receive queue without copying it. */
+ while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
+ skb_orphan(skb);
+- if (!skb_linearize(skb))
++ if (!skb_linearize(skb)) {
+ s = rfcomm_recv_frame(s, skb);
+- else
++ if (!s)
++ break;
++ } else {
+ kfree_skb(skb);
++ }
+ }
+
+ if (s && (sk->sk_state == BT_CLOSED))
+diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
+index c603a5eb4720..8bbbb5ec468c 100644
+--- a/net/bluetooth/rfcomm/sock.c
++++ b/net/bluetooth/rfcomm/sock.c
+@@ -918,7 +918,8 @@ static int rfcomm_sock_shutdown(struct socket *sock, int how)
+ sk->sk_shutdown = SHUTDOWN_MASK;
+ __rfcomm_sock_close(sk);
+
+- if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime)
++ if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
++ !(current->flags & PF_EXITING))
+ err = bt_sock_wait_state(sk, BT_CLOSED, sk->sk_lingertime);
+ }
+ release_sock(sk);
+diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
+index c06dbd3938e8..dbbbc0292bd0 100644
+--- a/net/bluetooth/sco.c
++++ b/net/bluetooth/sco.c
+@@ -909,7 +909,8 @@ static int sco_sock_shutdown(struct socket *sock, int how)
+ sco_sock_clear_timer(sk);
+ __sco_sock_close(sk);
+
+- if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime)
++ if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
++ !(current->flags & PF_EXITING))
+ err = bt_sock_wait_state(sk, BT_CLOSED,
+ sk->sk_lingertime);
+ }
+@@ -929,7 +930,8 @@ static int sco_sock_release(struct socket *sock)
+
+ sco_sock_close(sk);
+
+- if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime) {
++ if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
++ !(current->flags & PF_EXITING)) {
+ lock_sock(sk);
+ err = bt_sock_wait_state(sk, BT_CLOSED, sk->sk_lingertime);
+ release_sock(sk);
+diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c
+index 96238ba95f2b..de6662b14e1f 100644
+--- a/net/ceph/auth_x.c
++++ b/net/ceph/auth_x.c
+@@ -13,8 +13,6 @@
+ #include "auth_x.h"
+ #include "auth_x_protocol.h"
+
+-#define TEMP_TICKET_BUF_LEN 256
+-
+ static void ceph_x_validate_tickets(struct ceph_auth_client *ac, int *pneed);
+
+ static int ceph_x_is_authenticated(struct ceph_auth_client *ac)
+@@ -64,7 +62,7 @@ static int ceph_x_encrypt(struct ceph_crypto_key *secret,
+ }
+
+ static int ceph_x_decrypt(struct ceph_crypto_key *secret,
+- void **p, void *end, void *obuf, size_t olen)
++ void **p, void *end, void **obuf, size_t olen)
+ {
+ struct ceph_x_encrypt_header head;
+ size_t head_len = sizeof(head);
+@@ -75,8 +73,14 @@ static int ceph_x_decrypt(struct ceph_crypto_key *secret,
+ return -EINVAL;
+
+ dout("ceph_x_decrypt len %d\n", len);
+- ret = ceph_decrypt2(secret, &head, &head_len, obuf, &olen,
+- *p, len);
++ if (*obuf == NULL) {
++ *obuf = kmalloc(len, GFP_NOFS);
++ if (!*obuf)
++ return -ENOMEM;
++ olen = len;
++ }
++
++ ret = ceph_decrypt2(secret, &head, &head_len, *obuf, &olen, *p, len);
+ if (ret)
+ return ret;
+ if (head.struct_v != 1 || le64_to_cpu(head.magic) != CEPHX_ENC_MAGIC)
+@@ -129,139 +133,120 @@ static void remove_ticket_handler(struct ceph_auth_client *ac,
+ kfree(th);
+ }
+
+-static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac,
+- struct ceph_crypto_key *secret,
+- void *buf, void *end)
++static int process_one_ticket(struct ceph_auth_client *ac,
++ struct ceph_crypto_key *secret,
++ void **p, void *end)
+ {
+ struct ceph_x_info *xi = ac->private;
+- int num;
+- void *p = buf;
++ int type;
++ u8 tkt_struct_v, blob_struct_v;
++ struct ceph_x_ticket_handler *th;
++ void *dbuf = NULL;
++ void *dp, *dend;
++ int dlen;
++ char is_enc;
++ struct timespec validity;
++ struct ceph_crypto_key old_key;
++ void *ticket_buf = NULL;
++ void *tp, *tpend;
++ struct ceph_timespec new_validity;
++ struct ceph_crypto_key new_session_key;
++ struct ceph_buffer *new_ticket_blob;
++ unsigned long new_expires, new_renew_after;
++ u64 new_secret_id;
+ int ret;
+- char *dbuf;
+- char *ticket_buf;
+- u8 reply_struct_v;
+
+- dbuf = kmalloc(TEMP_TICKET_BUF_LEN, GFP_NOFS);
+- if (!dbuf)
+- return -ENOMEM;
++ ceph_decode_need(p, end, sizeof(u32) + 1, bad);
+
+- ret = -ENOMEM;
+- ticket_buf = kmalloc(TEMP_TICKET_BUF_LEN, GFP_NOFS);
+- if (!ticket_buf)
+- goto out_dbuf;
++ type = ceph_decode_32(p);
++ dout(" ticket type %d %s\n", type, ceph_entity_type_name(type));
+
+- ceph_decode_need(&p, end, 1 + sizeof(u32), bad);
+- reply_struct_v = ceph_decode_8(&p);
+- if (reply_struct_v != 1)
++ tkt_struct_v = ceph_decode_8(p);
++ if (tkt_struct_v != 1)
+ goto bad;
+- num = ceph_decode_32(&p);
+- dout("%d tickets\n", num);
+- while (num--) {
+- int type;
+- u8 tkt_struct_v, blob_struct_v;
+- struct ceph_x_ticket_handler *th;
+- void *dp, *dend;
+- int dlen;
+- char is_enc;
+- struct timespec validity;
+- struct ceph_crypto_key old_key;
+- void *tp, *tpend;
+- struct ceph_timespec new_validity;
+- struct ceph_crypto_key new_session_key;
+- struct ceph_buffer *new_ticket_blob;
+- unsigned long new_expires, new_renew_after;
+- u64 new_secret_id;
+-
+- ceph_decode_need(&p, end, sizeof(u32) + 1, bad);
+-
+- type = ceph_decode_32(&p);
+- dout(" ticket type %d %s\n", type, ceph_entity_type_name(type));
+-
+- tkt_struct_v = ceph_decode_8(&p);
+- if (tkt_struct_v != 1)
+- goto bad;
+-
+- th = get_ticket_handler(ac, type);
+- if (IS_ERR(th)) {
+- ret = PTR_ERR(th);
+- goto out;
+- }
+
+- /* blob for me */
+- dlen = ceph_x_decrypt(secret, &p, end, dbuf,
+- TEMP_TICKET_BUF_LEN);
+- if (dlen <= 0) {
+- ret = dlen;
+- goto out;
+- }
+- dout(" decrypted %d bytes\n", dlen);
+- dend = dbuf + dlen;
+- dp = dbuf;
++ th = get_ticket_handler(ac, type);
++ if (IS_ERR(th)) {
++ ret = PTR_ERR(th);
++ goto out;
++ }
+
+- tkt_struct_v = ceph_decode_8(&dp);
+- if (tkt_struct_v != 1)
+- goto bad;
++ /* blob for me */
++ dlen = ceph_x_decrypt(secret, p, end, &dbuf, 0);
++ if (dlen <= 0) {
++ ret = dlen;
++ goto out;
++ }
++ dout(" decrypted %d bytes\n", dlen);
++ dp = dbuf;
++ dend = dp + dlen;
+
+- memcpy(&old_key, &th->session_key, sizeof(old_key));
+- ret = ceph_crypto_key_decode(&new_session_key, &dp, dend);
+- if (ret)
+- goto out;
++ tkt_struct_v = ceph_decode_8(&dp);
++ if (tkt_struct_v != 1)
++ goto bad;
+
+- ceph_decode_copy(&dp, &new_validity, sizeof(new_validity));
+- ceph_decode_timespec(&validity, &new_validity);
+- new_expires = get_seconds() + validity.tv_sec;
+- new_renew_after = new_expires - (validity.tv_sec / 4);
+- dout(" expires=%lu renew_after=%lu\n", new_expires,
+- new_renew_after);
++ memcpy(&old_key, &th->session_key, sizeof(old_key));
++ ret = ceph_crypto_key_decode(&new_session_key, &dp, dend);
++ if (ret)
++ goto out;
+
+- /* ticket blob for service */
+- ceph_decode_8_safe(&p, end, is_enc, bad);
+- tp = ticket_buf;
+- if (is_enc) {
+- /* encrypted */
+- dout(" encrypted ticket\n");
+- dlen = ceph_x_decrypt(&old_key, &p, end, ticket_buf,
+- TEMP_TICKET_BUF_LEN);
+- if (dlen < 0) {
+- ret = dlen;
+- goto out;
+- }
+- dlen = ceph_decode_32(&tp);
+- } else {
+- /* unencrypted */
+- ceph_decode_32_safe(&p, end, dlen, bad);
+- ceph_decode_need(&p, end, dlen, bad);
+- ceph_decode_copy(&p, ticket_buf, dlen);
++ ceph_decode_copy(&dp, &new_validity, sizeof(new_validity));
++ ceph_decode_timespec(&validity, &new_validity);
++ new_expires = get_seconds() + validity.tv_sec;
++ new_renew_after = new_expires - (validity.tv_sec / 4);
++ dout(" expires=%lu renew_after=%lu\n", new_expires,
++ new_renew_after);
++
++ /* ticket blob for service */
++ ceph_decode_8_safe(p, end, is_enc, bad);
++ if (is_enc) {
++ /* encrypted */
++ dout(" encrypted ticket\n");
++ dlen = ceph_x_decrypt(&old_key, p, end, &ticket_buf, 0);
++ if (dlen < 0) {
++ ret = dlen;
++ goto out;
+ }
+- tpend = tp + dlen;
+- dout(" ticket blob is %d bytes\n", dlen);
+- ceph_decode_need(&tp, tpend, 1 + sizeof(u64), bad);
+- blob_struct_v = ceph_decode_8(&tp);
+- new_secret_id = ceph_decode_64(&tp);
+- ret = ceph_decode_buffer(&new_ticket_blob, &tp, tpend);
+- if (ret)
++ tp = ticket_buf;
++ dlen = ceph_decode_32(&tp);
++ } else {
++ /* unencrypted */
++ ceph_decode_32_safe(p, end, dlen, bad);
++ ticket_buf = kmalloc(dlen, GFP_NOFS);
++ if (!ticket_buf) {
++ ret = -ENOMEM;
+ goto out;
+-
+- /* all is well, update our ticket */
+- ceph_crypto_key_destroy(&th->session_key);
+- if (th->ticket_blob)
+- ceph_buffer_put(th->ticket_blob);
+- th->session_key = new_session_key;
+- th->ticket_blob = new_ticket_blob;
+- th->validity = new_validity;
+- th->secret_id = new_secret_id;
+- th->expires = new_expires;
+- th->renew_after = new_renew_after;
+- dout(" got ticket service %d (%s) secret_id %lld len %d\n",
+- type, ceph_entity_type_name(type), th->secret_id,
+- (int)th->ticket_blob->vec.iov_len);
+- xi->have_keys |= th->service;
++ }
++ tp = ticket_buf;
++ ceph_decode_need(p, end, dlen, bad);
++ ceph_decode_copy(p, ticket_buf, dlen);
+ }
++ tpend = tp + dlen;
++ dout(" ticket blob is %d bytes\n", dlen);
++ ceph_decode_need(&tp, tpend, 1 + sizeof(u64), bad);
++ blob_struct_v = ceph_decode_8(&tp);
++ new_secret_id = ceph_decode_64(&tp);
++ ret = ceph_decode_buffer(&new_ticket_blob, &tp, tpend);
++ if (ret)
++ goto out;
++
++ /* all is well, update our ticket */
++ ceph_crypto_key_destroy(&th->session_key);
++ if (th->ticket_blob)
++ ceph_buffer_put(th->ticket_blob);
++ th->session_key = new_session_key;
++ th->ticket_blob = new_ticket_blob;
++ th->validity = new_validity;
++ th->secret_id = new_secret_id;
++ th->expires = new_expires;
++ th->renew_after = new_renew_after;
++ dout(" got ticket service %d (%s) secret_id %lld len %d\n",
++ type, ceph_entity_type_name(type), th->secret_id,
++ (int)th->ticket_blob->vec.iov_len);
++ xi->have_keys |= th->service;
+
+- ret = 0;
+ out:
+ kfree(ticket_buf);
+-out_dbuf:
+ kfree(dbuf);
+ return ret;
+
+@@ -270,6 +255,34 @@ bad:
+ goto out;
+ }
+
++static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac,
++ struct ceph_crypto_key *secret,
++ void *buf, void *end)
++{
++ void *p = buf;
++ u8 reply_struct_v;
++ u32 num;
++ int ret;
++
++ ceph_decode_8_safe(&p, end, reply_struct_v, bad);
++ if (reply_struct_v != 1)
++ return -EINVAL;
++
++ ceph_decode_32_safe(&p, end, num, bad);
++ dout("%d tickets\n", num);
++
++ while (num--) {
++ ret = process_one_ticket(ac, secret, &p, end);
++ if (ret)
++ return ret;
++ }
++
++ return 0;
++
++bad:
++ return -EINVAL;
++}
++
+ static int ceph_x_build_authorizer(struct ceph_auth_client *ac,
+ struct ceph_x_ticket_handler *th,
+ struct ceph_x_authorizer *au)
+@@ -583,13 +596,14 @@ static int ceph_x_verify_authorizer_reply(struct ceph_auth_client *ac,
+ struct ceph_x_ticket_handler *th;
+ int ret = 0;
+ struct ceph_x_authorize_reply reply;
++ void *preply = &reply;
+ void *p = au->reply_buf;
+ void *end = p + sizeof(au->reply_buf);
+
+ th = get_ticket_handler(ac, au->service);
+ if (IS_ERR(th))
+ return PTR_ERR(th);
+- ret = ceph_x_decrypt(&th->session_key, &p, end, &reply, sizeof(reply));
++ ret = ceph_x_decrypt(&th->session_key, &p, end, &preply, sizeof(reply));
+ if (ret < 0)
+ return ret;
+ if (ret != sizeof(reply))
+diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
+index 1948d592aa54..3d9ddc2842e1 100644
+--- a/net/ceph/messenger.c
++++ b/net/ceph/messenger.c
+@@ -900,7 +900,7 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
+ BUG_ON(page_count > (int)USHRT_MAX);
+ cursor->page_count = (unsigned short)page_count;
+ BUG_ON(length > SIZE_MAX - cursor->page_offset);
+- cursor->last_piece = (size_t)cursor->page_offset + length <= PAGE_SIZE;
++ cursor->last_piece = cursor->page_offset + cursor->resid <= PAGE_SIZE;
+ }
+
+ static struct page *
+diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
+index 067d3af2eaf6..61fcfc304f68 100644
+--- a/net/ceph/mon_client.c
++++ b/net/ceph/mon_client.c
+@@ -1181,7 +1181,15 @@ static struct ceph_msg *mon_alloc_msg(struct ceph_connection *con,
+ if (!m) {
+ pr_info("alloc_msg unknown type %d\n", type);
+ *skip = 1;
++ } else if (front_len > m->front_alloc_len) {
++ pr_warning("mon_alloc_msg front %d > prealloc %d (%u#%llu)\n",
++ front_len, m->front_alloc_len,
++ (unsigned int)con->peer_name.type,
++ le64_to_cpu(con->peer_name.num));
++ ceph_msg_put(m);
++ m = ceph_msg_new(type, front_len, GFP_NOFS, false);
+ }
++
+ return m;
+ }
+
+diff --git a/security/commoncap.c b/security/commoncap.c
+index b9d613e0ef14..963dc5981661 100644
+--- a/security/commoncap.c
++++ b/security/commoncap.c
+@@ -421,6 +421,9 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
+ cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
+ }
+
++ cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++ cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++
+ return 0;
+ }
+
+diff --git a/sound/soc/blackfin/bf5xx-i2s-pcm.c b/sound/soc/blackfin/bf5xx-i2s-pcm.c
+index a3881c4381c9..bcf591373a7a 100644
+--- a/sound/soc/blackfin/bf5xx-i2s-pcm.c
++++ b/sound/soc/blackfin/bf5xx-i2s-pcm.c
+@@ -290,19 +290,19 @@ static int bf5xx_pcm_silence(struct snd_pcm_substream *substream,
+ unsigned int sample_size = runtime->sample_bits / 8;
+ void *buf = runtime->dma_area;
+ struct bf5xx_i2s_pcm_data *dma_data;
+- unsigned int offset, size;
++ unsigned int offset, samples;
+
+ dma_data = snd_soc_dai_get_dma_data(rtd->cpu_dai, substream);
+
+ if (dma_data->tdm_mode) {
+ offset = pos * 8 * sample_size;
+- size = count * 8 * sample_size;
++ samples = count * 8;
+ } else {
+ offset = frames_to_bytes(runtime, pos);
+- size = frames_to_bytes(runtime, count);
++ samples = count * runtime->channels;
+ }
+
+- snd_pcm_format_set_silence(runtime->format, buf + offset, size);
++ snd_pcm_format_set_silence(runtime->format, buf + offset, samples);
+
+ return 0;
+ }
+diff --git a/sound/soc/codecs/adau1701.c b/sound/soc/codecs/adau1701.c
+index d71c59cf7bdd..370b742117ef 100644
+--- a/sound/soc/codecs/adau1701.c
++++ b/sound/soc/codecs/adau1701.c
+@@ -230,8 +230,10 @@ static int adau1701_reg_read(void *context, unsigned int reg,
+
+ *value = 0;
+
+- for (i = 0; i < size; i++)
+- *value |= recv_buf[i] << (i * 8);
++ for (i = 0; i < size; i++) {
++ *value <<= 8;
++ *value |= recv_buf[i];
++ }
+
+ return 0;
+ }
+diff --git a/sound/soc/codecs/max98090.c b/sound/soc/codecs/max98090.c
+index f5fccc7a8e89..d97f1ce7ff7d 100644
+--- a/sound/soc/codecs/max98090.c
++++ b/sound/soc/codecs/max98090.c
+@@ -2284,7 +2284,7 @@ static int max98090_probe(struct snd_soc_codec *codec)
+ /* Register for interrupts */
+ dev_dbg(codec->dev, "irq = %d\n", max98090->irq);
+
+- ret = request_threaded_irq(max98090->irq, NULL,
++ ret = devm_request_threaded_irq(codec->dev, max98090->irq, NULL,
+ max98090_interrupt, IRQF_TRIGGER_FALLING | IRQF_ONESHOT,
+ "max98090_interrupt", codec);
+ if (ret < 0) {
+diff --git a/sound/soc/codecs/rt5640.c b/sound/soc/codecs/rt5640.c
+index de80e89b5fd8..70679cf14c83 100644
+--- a/sound/soc/codecs/rt5640.c
++++ b/sound/soc/codecs/rt5640.c
+@@ -2059,6 +2059,7 @@ static struct snd_soc_codec_driver soc_codec_dev_rt5640 = {
+ static const struct regmap_config rt5640_regmap = {
+ .reg_bits = 8,
+ .val_bits = 16,
++ .use_single_rw = true,
+
+ .max_register = RT5640_VENDOR_ID2 + 1 + (ARRAY_SIZE(rt5640_ranges) *
+ RT5640_PR_SPACING),
+diff --git a/sound/soc/codecs/tlv320aic31xx.c b/sound/soc/codecs/tlv320aic31xx.c
+index 23419109ecac..1cdae8ccc61b 100644
+--- a/sound/soc/codecs/tlv320aic31xx.c
++++ b/sound/soc/codecs/tlv320aic31xx.c
+@@ -1178,7 +1178,7 @@ static void aic31xx_pdata_from_of(struct aic31xx_priv *aic31xx)
+ }
+ #endif /* CONFIG_OF */
+
+-static void aic31xx_device_init(struct aic31xx_priv *aic31xx)
++static int aic31xx_device_init(struct aic31xx_priv *aic31xx)
+ {
+ int ret, i;
+
+@@ -1197,7 +1197,7 @@ static void aic31xx_device_init(struct aic31xx_priv *aic31xx)
+ "aic31xx-reset-pin");
+ if (ret < 0) {
+ dev_err(aic31xx->dev, "not able to acquire gpio\n");
+- return;
++ return ret;
+ }
+ }
+
+@@ -1210,6 +1210,7 @@ static void aic31xx_device_init(struct aic31xx_priv *aic31xx)
+ if (ret != 0)
+ dev_err(aic31xx->dev, "Failed to request supplies: %d\n", ret);
+
++ return ret;
+ }
+
+ static int aic31xx_i2c_probe(struct i2c_client *i2c,
+@@ -1239,7 +1240,9 @@ static int aic31xx_i2c_probe(struct i2c_client *i2c,
+
+ aic31xx->pdata.codec_type = id->driver_data;
+
+- aic31xx_device_init(aic31xx);
++ ret = aic31xx_device_init(aic31xx);
++ if (ret)
++ return ret;
+
+ return snd_soc_register_codec(&i2c->dev, &soc_codec_driver_aic31xx,
+ aic31xx_dai_driver,
+diff --git a/sound/soc/codecs/wm8994.c b/sound/soc/codecs/wm8994.c
+index 247b39013fba..9719d3ca8e47 100644
+--- a/sound/soc/codecs/wm8994.c
++++ b/sound/soc/codecs/wm8994.c
+@@ -3505,6 +3505,7 @@ static irqreturn_t wm8994_mic_irq(int irq, void *data)
+ return IRQ_HANDLED;
+ }
+
++/* Should be called with accdet_lock held */
+ static void wm1811_micd_stop(struct snd_soc_codec *codec)
+ {
+ struct wm8994_priv *wm8994 = snd_soc_codec_get_drvdata(codec);
+@@ -3512,14 +3513,10 @@ static void wm1811_micd_stop(struct snd_soc_codec *codec)
+ if (!wm8994->jackdet)
+ return;
+
+- mutex_lock(&wm8994->accdet_lock);
+-
+ snd_soc_update_bits(codec, WM8958_MIC_DETECT_1, WM8958_MICD_ENA, 0);
+
+ wm1811_jackdet_set_mode(codec, WM1811_JACKDET_MODE_JACK);
+
+- mutex_unlock(&wm8994->accdet_lock);
+-
+ if (wm8994->wm8994->pdata.jd_ext_cap)
+ snd_soc_dapm_disable_pin(&codec->dapm,
+ "MICBIAS2");
+@@ -3560,10 +3557,10 @@ static void wm8958_open_circuit_work(struct work_struct *work)
+ open_circuit_work.work);
+ struct device *dev = wm8994->wm8994->dev;
+
+- wm1811_micd_stop(wm8994->hubs.codec);
+-
+ mutex_lock(&wm8994->accdet_lock);
+
++ wm1811_micd_stop(wm8994->hubs.codec);
++
+ dev_dbg(dev, "Reporting open circuit\n");
+
+ wm8994->jack_mic = false;
+diff --git a/sound/soc/codecs/wm_adsp.c b/sound/soc/codecs/wm_adsp.c
+index 060027182dcb..2537725dd53f 100644
+--- a/sound/soc/codecs/wm_adsp.c
++++ b/sound/soc/codecs/wm_adsp.c
+@@ -1758,3 +1758,5 @@ int wm_adsp2_init(struct wm_adsp *adsp, bool dvfs)
+ return 0;
+ }
+ EXPORT_SYMBOL_GPL(wm_adsp2_init);
++
++MODULE_LICENSE("GPL v2");
+diff --git a/sound/soc/intel/sst-baytrail-pcm.c b/sound/soc/intel/sst-baytrail-pcm.c
+index 8eab97368ea7..599401c0c655 100644
+--- a/sound/soc/intel/sst-baytrail-pcm.c
++++ b/sound/soc/intel/sst-baytrail-pcm.c
+@@ -32,7 +32,7 @@ static const struct snd_pcm_hardware sst_byt_pcm_hardware = {
+ SNDRV_PCM_INFO_PAUSE |
+ SNDRV_PCM_INFO_RESUME,
+ .formats = SNDRV_PCM_FMTBIT_S16_LE |
+- SNDRV_PCM_FORMAT_S24_LE,
++ SNDRV_PCM_FMTBIT_S24_LE,
+ .period_bytes_min = 384,
+ .period_bytes_max = 48000,
+ .periods_min = 2,
+diff --git a/sound/soc/intel/sst-haswell-pcm.c b/sound/soc/intel/sst-haswell-pcm.c
+index 058efb17c568..61bf6da4bb02 100644
+--- a/sound/soc/intel/sst-haswell-pcm.c
++++ b/sound/soc/intel/sst-haswell-pcm.c
+@@ -80,7 +80,7 @@ static const struct snd_pcm_hardware hsw_pcm_hardware = {
+ SNDRV_PCM_INFO_PAUSE |
+ SNDRV_PCM_INFO_RESUME |
+ SNDRV_PCM_INFO_NO_PERIOD_WAKEUP,
+- .formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FORMAT_S24_LE |
++ .formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+ SNDRV_PCM_FMTBIT_S32_LE,
+ .period_bytes_min = PAGE_SIZE,
+ .period_bytes_max = (HSW_PCM_PERIODS_MAX / HSW_PCM_PERIODS_MIN) * PAGE_SIZE,
+@@ -400,7 +400,15 @@ static int hsw_pcm_hw_params(struct snd_pcm_substream *substream,
+ sst_hsw_stream_set_valid(hsw, pcm_data->stream, 16);
+ break;
+ case SNDRV_PCM_FORMAT_S24_LE:
+- bits = SST_HSW_DEPTH_24BIT;
++ bits = SST_HSW_DEPTH_32BIT;
++ sst_hsw_stream_set_valid(hsw, pcm_data->stream, 24);
++ break;
++ case SNDRV_PCM_FORMAT_S8:
++ bits = SST_HSW_DEPTH_8BIT;
++ sst_hsw_stream_set_valid(hsw, pcm_data->stream, 8);
++ break;
++ case SNDRV_PCM_FORMAT_S32_LE:
++ bits = SST_HSW_DEPTH_32BIT;
+ sst_hsw_stream_set_valid(hsw, pcm_data->stream, 32);
+ break;
+ default:
+@@ -685,8 +693,9 @@ static int hsw_pcm_new(struct snd_soc_pcm_runtime *rtd)
+ }
+
+ #define HSW_FORMATS \
+- (SNDRV_PCM_FMTBIT_S20_3LE | SNDRV_PCM_FMTBIT_S16_LE |\
+- SNDRV_PCM_FMTBIT_S32_LE)
++ (SNDRV_PCM_FMTBIT_S32_LE | SNDRV_PCM_FMTBIT_S24_LE | \
++ SNDRV_PCM_FMTBIT_S20_3LE | SNDRV_PCM_FMTBIT_S16_LE |\
++ SNDRV_PCM_FMTBIT_S8)
+
+ static struct snd_soc_dai_driver hsw_dais[] = {
+ {
+@@ -696,7 +705,7 @@ static struct snd_soc_dai_driver hsw_dais[] = {
+ .channels_min = 2,
+ .channels_max = 2,
+ .rates = SNDRV_PCM_RATE_48000,
+- .formats = SNDRV_PCM_FMTBIT_S16_LE,
++ .formats = SNDRV_PCM_FMTBIT_S24_LE | SNDRV_PCM_FMTBIT_S16_LE,
+ },
+ },
+ {
+@@ -727,8 +736,8 @@ static struct snd_soc_dai_driver hsw_dais[] = {
+ .stream_name = "Loopback Capture",
+ .channels_min = 2,
+ .channels_max = 2,
+- .rates = SNDRV_PCM_RATE_8000_192000,
+- .formats = HSW_FORMATS,
++ .rates = SNDRV_PCM_RATE_48000,
++ .formats = SNDRV_PCM_FMTBIT_S24_LE | SNDRV_PCM_FMTBIT_S16_LE,
+ },
+ },
+ {
+@@ -737,8 +746,8 @@ static struct snd_soc_dai_driver hsw_dais[] = {
+ .stream_name = "Analog Capture",
+ .channels_min = 2,
+ .channels_max = 2,
+- .rates = SNDRV_PCM_RATE_8000_192000,
+- .formats = HSW_FORMATS,
++ .rates = SNDRV_PCM_RATE_48000,
++ .formats = SNDRV_PCM_FMTBIT_S24_LE | SNDRV_PCM_FMTBIT_S16_LE,
+ },
+ },
+ };
+diff --git a/sound/soc/omap/omap-twl4030.c b/sound/soc/omap/omap-twl4030.c
+index f8a6adc2d81c..4336d1831485 100644
+--- a/sound/soc/omap/omap-twl4030.c
++++ b/sound/soc/omap/omap-twl4030.c
+@@ -260,7 +260,7 @@ static struct snd_soc_dai_link omap_twl4030_dai_links[] = {
+ .stream_name = "TWL4030 Voice",
+ .cpu_dai_name = "omap-mcbsp.3",
+ .codec_dai_name = "twl4030-voice",
+- .platform_name = "omap-mcbsp.2",
++ .platform_name = "omap-mcbsp.3",
+ .codec_name = "twl4030-codec",
+ .dai_fmt = SND_SOC_DAIFMT_DSP_A | SND_SOC_DAIFMT_IB_NF |
+ SND_SOC_DAIFMT_CBM_CFM,
+diff --git a/sound/soc/pxa/pxa-ssp.c b/sound/soc/pxa/pxa-ssp.c
+index 199a8b377553..a8e097433074 100644
+--- a/sound/soc/pxa/pxa-ssp.c
++++ b/sound/soc/pxa/pxa-ssp.c
+@@ -723,7 +723,8 @@ static int pxa_ssp_probe(struct snd_soc_dai *dai)
+ ssp_handle = of_parse_phandle(dev->of_node, "port", 0);
+ if (!ssp_handle) {
+ dev_err(dev, "unable to get 'port' phandle\n");
+- return -ENODEV;
++ ret = -ENODEV;
++ goto err_priv;
+ }
+
+ priv->ssp = pxa_ssp_request_of(ssp_handle, "SoC audio");
+@@ -764,9 +765,7 @@ static int pxa_ssp_remove(struct snd_soc_dai *dai)
+ SNDRV_PCM_RATE_48000 | SNDRV_PCM_RATE_64000 | \
+ SNDRV_PCM_RATE_88200 | SNDRV_PCM_RATE_96000)
+
+-#define PXA_SSP_FORMATS (SNDRV_PCM_FMTBIT_S16_LE |\
+- SNDRV_PCM_FMTBIT_S24_LE | \
+- SNDRV_PCM_FMTBIT_S32_LE)
++#define PXA_SSP_FORMATS (SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S32_LE)
+
+ static const struct snd_soc_dai_ops pxa_ssp_dai_ops = {
+ .startup = pxa_ssp_startup,
+diff --git a/sound/soc/samsung/i2s.c b/sound/soc/samsung/i2s.c
+index 2ac76fa3e742..5f9b255a8b38 100644
+--- a/sound/soc/samsung/i2s.c
++++ b/sound/soc/samsung/i2s.c
+@@ -920,11 +920,9 @@ static int i2s_suspend(struct snd_soc_dai *dai)
+ {
+ struct i2s_dai *i2s = to_info(dai);
+
+- if (dai->active) {
+- i2s->suspend_i2smod = readl(i2s->addr + I2SMOD);
+- i2s->suspend_i2scon = readl(i2s->addr + I2SCON);
+- i2s->suspend_i2spsr = readl(i2s->addr + I2SPSR);
+- }
++ i2s->suspend_i2smod = readl(i2s->addr + I2SMOD);
++ i2s->suspend_i2scon = readl(i2s->addr + I2SCON);
++ i2s->suspend_i2spsr = readl(i2s->addr + I2SPSR);
+
+ return 0;
+ }
+@@ -933,11 +931,9 @@ static int i2s_resume(struct snd_soc_dai *dai)
+ {
+ struct i2s_dai *i2s = to_info(dai);
+
+- if (dai->active) {
+- writel(i2s->suspend_i2scon, i2s->addr + I2SCON);
+- writel(i2s->suspend_i2smod, i2s->addr + I2SMOD);
+- writel(i2s->suspend_i2spsr, i2s->addr + I2SPSR);
+- }
++ writel(i2s->suspend_i2scon, i2s->addr + I2SCON);
++ writel(i2s->suspend_i2smod, i2s->addr + I2SMOD);
++ writel(i2s->suspend_i2spsr, i2s->addr + I2SPSR);
+
+ return 0;
+ }
+diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c
+index 54d18f22a33e..4ea656770d65 100644
+--- a/sound/soc/soc-pcm.c
++++ b/sound/soc/soc-pcm.c
+@@ -2069,6 +2069,7 @@ int soc_dpcm_runtime_update(struct snd_soc_card *card)
+ dpcm_be_disconnect(fe, SNDRV_PCM_STREAM_PLAYBACK);
+ }
+
++ dpcm_path_put(&list);
+ capture:
+ /* skip if FE doesn't have capture capability */
+ if (!fe->cpu_dai->driver->capture.channels_min)
+diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
+index e66e710cc595..0a8a9db43d34 100644
+--- a/tools/testing/selftests/Makefile
++++ b/tools/testing/selftests/Makefile
+@@ -4,6 +4,7 @@ TARGETS += efivarfs
+ TARGETS += kcmp
+ TARGETS += memory-hotplug
+ TARGETS += mqueue
++TARGETS += mount
+ TARGETS += net
+ TARGETS += ptrace
+ TARGETS += timers
+diff --git a/tools/testing/selftests/mount/Makefile b/tools/testing/selftests/mount/Makefile
+new file mode 100644
+index 000000000000..337d853c2b72
+--- /dev/null
++++ b/tools/testing/selftests/mount/Makefile
+@@ -0,0 +1,17 @@
++# Makefile for mount selftests.
++
++all: unprivileged-remount-test
++
++unprivileged-remount-test: unprivileged-remount-test.c
++ gcc -Wall -O2 unprivileged-remount-test.c -o unprivileged-remount-test
++
++# Allow specific tests to be selected.
++test_unprivileged_remount: unprivileged-remount-test
++ @if [ -f /proc/self/uid_map ] ; then ./unprivileged-remount-test ; fi
++
++run_tests: all test_unprivileged_remount
++
++clean:
++ rm -f unprivileged-remount-test
++
++.PHONY: all test_unprivileged_remount
+diff --git a/tools/testing/selftests/mount/unprivileged-remount-test.c b/tools/testing/selftests/mount/unprivileged-remount-test.c
+new file mode 100644
+index 000000000000..1b3ff2fda4d0
+--- /dev/null
++++ b/tools/testing/selftests/mount/unprivileged-remount-test.c
+@@ -0,0 +1,242 @@
++#define _GNU_SOURCE
++#include <sched.h>
++#include <stdio.h>
++#include <errno.h>
++#include <string.h>
++#include <sys/types.h>
++#include <sys/mount.h>
++#include <sys/wait.h>
++#include <stdlib.h>
++#include <unistd.h>
++#include <fcntl.h>
++#include <grp.h>
++#include <stdbool.h>
++#include <stdarg.h>
++
++#ifndef CLONE_NEWNS
++# define CLONE_NEWNS 0x00020000
++#endif
++#ifndef CLONE_NEWUTS
++# define CLONE_NEWUTS 0x04000000
++#endif
++#ifndef CLONE_NEWIPC
++# define CLONE_NEWIPC 0x08000000
++#endif
++#ifndef CLONE_NEWNET
++# define CLONE_NEWNET 0x40000000
++#endif
++#ifndef CLONE_NEWUSER
++# define CLONE_NEWUSER 0x10000000
++#endif
++#ifndef CLONE_NEWPID
++# define CLONE_NEWPID 0x20000000
++#endif
++
++#ifndef MS_RELATIME
++#define MS_RELATIME (1 << 21)
++#endif
++#ifndef MS_STRICTATIME
++#define MS_STRICTATIME (1 << 24)
++#endif
++
++static void die(char *fmt, ...)
++{
++ va_list ap;
++ va_start(ap, fmt);
++ vfprintf(stderr, fmt, ap);
++ va_end(ap);
++ exit(EXIT_FAILURE);
++}
++
++static void write_file(char *filename, char *fmt, ...)
++{
++ char buf[4096];
++ int fd;
++ ssize_t written;
++ int buf_len;
++ va_list ap;
++
++ va_start(ap, fmt);
++ buf_len = vsnprintf(buf, sizeof(buf), fmt, ap);
++ va_end(ap);
++ if (buf_len < 0) {
++ die("vsnprintf failed: %s\n",
++ strerror(errno));
++ }
++ if (buf_len >= sizeof(buf)) {
++ die("vsnprintf output truncated\n");
++ }
++
++ fd = open(filename, O_WRONLY);
++ if (fd < 0) {
++ die("open of %s failed: %s\n",
++ filename, strerror(errno));
++ }
++ written = write(fd, buf, buf_len);
++ if (written != buf_len) {
++ if (written >= 0) {
++ die("short write to %s\n", filename);
++ } else {
++ die("write to %s failed: %s\n",
++ filename, strerror(errno));
++ }
++ }
++ if (close(fd) != 0) {
++ die("close of %s failed: %s\n",
++ filename, strerror(errno));
++ }
++}
++
++static void create_and_enter_userns(void)
++{
++ uid_t uid;
++ gid_t gid;
++
++ uid = getuid();
++ gid = getgid();
++
++ if (unshare(CLONE_NEWUSER) !=0) {
++ die("unshare(CLONE_NEWUSER) failed: %s\n",
++ strerror(errno));
++ }
++
++ write_file("/proc/self/uid_map", "0 %d 1", uid);
++ write_file("/proc/self/gid_map", "0 %d 1", gid);
++
++ if (setgroups(0, NULL) != 0) {
++ die("setgroups failed: %s\n",
++ strerror(errno));
++ }
++ if (setgid(0) != 0) {
++ die ("setgid(0) failed %s\n",
++ strerror(errno));
++ }
++ if (setuid(0) != 0) {
++ die("setuid(0) failed %s\n",
++ strerror(errno));
++ }
++}
++
++static
++bool test_unpriv_remount(int mount_flags, int remount_flags, int invalid_flags)
++{
++ pid_t child;
++
++ child = fork();
++ if (child == -1) {
++ die("fork failed: %s\n",
++ strerror(errno));
++ }
++ if (child != 0) { /* parent */
++ pid_t pid;
++ int status;
++ pid = waitpid(child, &status, 0);
++ if (pid == -1) {
++ die("waitpid failed: %s\n",
++ strerror(errno));
++ }
++ if (pid != child) {
++ die("waited for %d got %d\n",
++ child, pid);
++ }
++ if (!WIFEXITED(status)) {
++ die("child did not terminate cleanly\n");
++ }
++ return WEXITSTATUS(status) == EXIT_SUCCESS ? true : false;
++ }
++
++ create_and_enter_userns();
++ if (unshare(CLONE_NEWNS) != 0) {
++ die("unshare(CLONE_NEWNS) failed: %s\n",
++ strerror(errno));
++ }
++
++ if (mount("testing", "/tmp", "ramfs", mount_flags, NULL) != 0) {
++ die("mount of /tmp failed: %s\n",
++ strerror(errno));
++ }
++
++ create_and_enter_userns();
++
++ if (unshare(CLONE_NEWNS) != 0) {
++ die("unshare(CLONE_NEWNS) failed: %s\n",
++ strerror(errno));
++ }
++
++ if (mount("/tmp", "/tmp", "none",
++ MS_REMOUNT | MS_BIND | remount_flags, NULL) != 0) {
++ /* system("cat /proc/self/mounts"); */
++ die("remount of /tmp failed: %s\n",
++ strerror(errno));
++ }
++
++ if (mount("/tmp", "/tmp", "none",
++ MS_REMOUNT | MS_BIND | invalid_flags, NULL) == 0) {
++ /* system("cat /proc/self/mounts"); */
++ die("remount of /tmp with invalid flags "
++ "succeeded unexpectedly\n");
++ }
++ exit(EXIT_SUCCESS);
++}
++
++static bool test_unpriv_remount_simple(int mount_flags)
++{
++ return test_unpriv_remount(mount_flags, mount_flags, 0);
++}
++
++static bool test_unpriv_remount_atime(int mount_flags, int invalid_flags)
++{
++ return test_unpriv_remount(mount_flags, mount_flags, invalid_flags);
++}
++
++int main(int argc, char **argv)
++{
++ if (!test_unpriv_remount_simple(MS_RDONLY|MS_NODEV)) {
++ die("MS_RDONLY malfunctions\n");
++ }
++ if (!test_unpriv_remount_simple(MS_NODEV)) {
++ die("MS_NODEV malfunctions\n");
++ }
++ if (!test_unpriv_remount_simple(MS_NOSUID|MS_NODEV)) {
++ die("MS_NOSUID malfunctions\n");
++ }
++ if (!test_unpriv_remount_simple(MS_NOEXEC|MS_NODEV)) {
++ die("MS_NOEXEC malfunctions\n");
++ }
++ if (!test_unpriv_remount_atime(MS_RELATIME|MS_NODEV,
++ MS_NOATIME|MS_NODEV))
++ {
++ die("MS_RELATIME malfunctions\n");
++ }
++ if (!test_unpriv_remount_atime(MS_STRICTATIME|MS_NODEV,
++ MS_NOATIME|MS_NODEV))
++ {
++ die("MS_STRICTATIME malfunctions\n");
++ }
++ if (!test_unpriv_remount_atime(MS_NOATIME|MS_NODEV,
++ MS_STRICTATIME|MS_NODEV))
++ {
++ die("MS_RELATIME malfunctions\n");
++ }
++ if (!test_unpriv_remount_atime(MS_RELATIME|MS_NODIRATIME|MS_NODEV,
++ MS_NOATIME|MS_NODEV))
++ {
++ die("MS_RELATIME malfunctions\n");
++ }
++ if (!test_unpriv_remount_atime(MS_STRICTATIME|MS_NODIRATIME|MS_NODEV,
++ MS_NOATIME|MS_NODEV))
++ {
++ die("MS_RELATIME malfunctions\n");
++ }
++ if (!test_unpriv_remount_atime(MS_NOATIME|MS_NODIRATIME|MS_NODEV,
++ MS_STRICTATIME|MS_NODEV))
++ {
++ die("MS_RELATIME malfunctions\n");
++ }
++ if (!test_unpriv_remount(MS_STRICTATIME|MS_NODEV, MS_NODEV,
++ MS_NOATIME|MS_NODEV))
++ {
++ die("Default atime malfunctions\n");
++ }
++ return EXIT_SUCCESS;
++}
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-22 23:37 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-09-22 23:37 UTC (permalink / raw
To: gentoo-commits
commit: 935e025ffecfe6c163188f4f9725352501bf0a6e
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Mon Sep 22 23:37:15 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Mon Sep 22 23:37:15 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=935e025f
Fix UDEV auto selection to add FHANDLE. Remove from systemd. Thanks to Steven Presser. See bug #523126
---
4567_distro-Gentoo-Kconfig.patch | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/4567_distro-Gentoo-Kconfig.patch b/4567_distro-Gentoo-Kconfig.patch
index 652e2a7..71dbf09 100644
--- a/4567_distro-Gentoo-Kconfig.patch
+++ b/4567_distro-Gentoo-Kconfig.patch
@@ -1,15 +1,15 @@
---- a/Kconfig 2014-04-02 09:45:05.389224541 -0400
-+++ b/Kconfig 2014-04-02 09:45:39.269224273 -0400
+--- a/Kconfig 2014-04-02 09:45:05.389224541 -0400
++++ b/Kconfig 2014-04-02 09:45:39.269224273 -0400
@@ -8,4 +8,6 @@ config SRCARCH
- string
- option env="SRCARCH"
-
+ string
+ option env="SRCARCH"
+
+source "distro/Kconfig"
+
source "arch/$SRCARCH/Kconfig"
---- 1969-12-31 19:00:00.000000000 -0500
-+++ b/distro/Kconfig 2014-04-02 09:57:03.539218861 -0400
-@@ -0,0 +1,108 @@
+--- /dev/null 2014-09-22 14:19:24.316977284 -0400
++++ distro/Kconfig 2014-09-22 19:30:35.670959281 -0400
+@@ -0,0 +1,109 @@
+menu "Gentoo Linux"
+
+config GENTOO_LINUX
@@ -34,6 +34,8 @@
+ select DEVTMPFS
+ select TMPFS
+
++ select FHANDLE
++
+ select MMU
+ select SHMEM
+
@@ -89,7 +91,6 @@
+ select CGROUPS
+ select EPOLL
+ select FANOTIFY
-+ select FHANDLE
+ select INOTIFY_USER
+ select NET
+ select NET_NS
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-26 19:40 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-09-26 19:40 UTC (permalink / raw
To: gentoo-commits
commit: d9d386b72f6c05e68b48912cc93da59331852155
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Fri Sep 26 19:40:17 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Fri Sep 26 19:40:17 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=d9d386b7
Add multipath-tcp patch. Fix distro config.
---
0000_README | 4 +
2500_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 ++++++++++++++++++++++++++
4567_distro-Gentoo-Kconfig.patch | 19 +-
3 files changed, 19243 insertions(+), 10 deletions(-)
diff --git a/0000_README b/0000_README
index 706e53e..d92e6b7 100644
--- a/0000_README
+++ b/0000_README
@@ -58,6 +58,10 @@ Patch: 2400_kcopy-patch-for-infiniband-driver.patch
From: Alexey Shvetsov <alexxy@gentoo.org>
Desc: Zero copy for infiniband psm userspace driver
+Patch: 2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
+From: http://multipath-tcp.org/
+Desc: Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
+
Patch: 2700_ThinkPad-30-brightness-control-fix.patch
From: Seth Forshee <seth.forshee@canonical.com>
Desc: ACPI: Disable Windows 8 compatibility for some Lenovo ThinkPads
diff --git a/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch b/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
new file mode 100644
index 0000000..3000da3
--- /dev/null
+++ b/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
@@ -0,0 +1,19230 @@
+diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
+index 768a0fb67dd6..5a46d91a8df9 100644
+--- a/drivers/infiniband/hw/cxgb4/cm.c
++++ b/drivers/infiniband/hw/cxgb4/cm.c
+@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
+ */
+ memset(&tmp_opt, 0, sizeof(tmp_opt));
+ tcp_clear_options(&tmp_opt);
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
+
+ req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
+ memset(req, 0, sizeof(*req));
+diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
+index 2faef339d8f2..d86c853ffaad 100644
+--- a/include/linux/ipv6.h
++++ b/include/linux/ipv6.h
+@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ return inet_sk(__sk)->pinet6;
+ }
+
+-static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
+-{
+- struct request_sock *req = reqsk_alloc(ops);
+-
+- if (req)
+- inet_rsk(req)->pktopts = NULL;
+-
+- return req;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ return (struct raw6_sock *)sk;
+@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ return NULL;
+ }
+
+-static inline struct inet6_request_sock *
+- inet6_rsk(const struct request_sock *rsk)
+-{
+- return NULL;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ return NULL;
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index ec89301ada41..99ea4b0e3693 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
+ bool zero_okay,
+ __sum16 check)
+ {
+- if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
+- skb->csum_valid = 1;
++ if (skb_csum_unnecessary(skb)) {
++ return false;
++ } else if (zero_okay && !check) {
++ skb->ip_summed = CHECKSUM_UNNECESSARY;
+ return false;
+ }
+
+diff --git a/include/linux/tcp.h b/include/linux/tcp.h
+index a0513210798f..7bc2e078d6ca 100644
+--- a/include/linux/tcp.h
++++ b/include/linux/tcp.h
+@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
+ /* TCP Fast Open */
+ #define TCP_FASTOPEN_COOKIE_MIN 4 /* Min Fast Open Cookie size in bytes */
+ #define TCP_FASTOPEN_COOKIE_MAX 16 /* Max Fast Open Cookie size in bytes */
+-#define TCP_FASTOPEN_COOKIE_SIZE 8 /* the size employed by this impl. */
++#define TCP_FASTOPEN_COOKIE_SIZE 4 /* the size employed by this impl. */
+
+ /* TCP Fast Open Cookie as stored in memory */
+ struct tcp_fastopen_cookie {
+@@ -72,6 +72,51 @@ struct tcp_sack_block {
+ u32 end_seq;
+ };
+
++struct tcp_out_options {
++ u16 options; /* bit field of OPTION_* */
++ u8 ws; /* window scale, 0 to disable */
++ u8 num_sack_blocks;/* number of SACK blocks to include */
++ u8 hash_size; /* bytes in hash_location */
++ u16 mss; /* 0 to disable */
++ __u8 *hash_location; /* temporary pointer, overloaded */
++ __u32 tsval, tsecr; /* need to include OPTION_TS */
++ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
++#ifdef CONFIG_MPTCP
++ u16 mptcp_options; /* bit field of MPTCP related OPTION_* */
++ u8 dss_csum:1,
++ add_addr_v4:1,
++ add_addr_v6:1; /* dss-checksum required? */
++
++ union {
++ struct {
++ __u64 sender_key; /* sender's key for mptcp */
++ __u64 receiver_key; /* receiver's key for mptcp */
++ } mp_capable;
++
++ struct {
++ __u64 sender_truncated_mac;
++ __u32 sender_nonce;
++ /* random number of the sender */
++ __u32 token; /* token for mptcp */
++ u8 low_prio:1;
++ } mp_join_syns;
++ };
++
++ struct {
++ struct in_addr addr;
++ u8 addr_id;
++ } add_addr4;
++
++ struct {
++ struct in6_addr addr;
++ u8 addr_id;
++ } add_addr6;
++
++ u16 remove_addrs; /* list of address id */
++ u8 addr_id; /* address id (mp_join or add_address) */
++#endif /* CONFIG_MPTCP */
++};
++
+ /*These are used to set the sack_ok field in struct tcp_options_received */
+ #define TCP_SACK_SEEN (1 << 0) /*1 = peer is SACK capable, */
+ #define TCP_FACK_ENABLED (1 << 1) /*1 = FACK is enabled locally*/
+@@ -95,6 +140,9 @@ struct tcp_options_received {
+ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
+ };
+
++struct mptcp_cb;
++struct mptcp_tcp_sock;
++
+ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+ {
+ rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
+
+ struct tcp_request_sock {
+ struct inet_request_sock req;
+-#ifdef CONFIG_TCP_MD5SIG
+- /* Only used by TCP MD5 Signature so far. */
+ const struct tcp_request_sock_ops *af_specific;
+-#endif
+ struct sock *listener; /* needed for TFO */
+ u32 rcv_isn;
+ u32 snt_isn;
+@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
+ return (struct tcp_request_sock *)req;
+ }
+
++struct tcp_md5sig_key;
++
+ struct tcp_sock {
+ /* inet_connection_sock has to be the first member of tcp_sock */
+ struct inet_connection_sock inet_conn;
+@@ -326,6 +373,37 @@ struct tcp_sock {
+ * socket. Used to retransmit SYNACKs etc.
+ */
+ struct request_sock *fastopen_rsk;
++
++ /* MPTCP/TCP-specific callbacks */
++ const struct tcp_sock_ops *ops;
++
++ struct mptcp_cb *mpcb;
++ struct sock *meta_sk;
++ /* We keep these flags even if CONFIG_MPTCP is not checked, because
++ * it allows checking MPTCP capability just by checking the mpc flag,
++ * rather than adding ifdefs everywhere.
++ */
++ u16 mpc:1, /* Other end is multipath capable */
++ inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
++ send_mp_fclose:1,
++ request_mptcp:1, /* Did we send out an MP_CAPABLE?
++ * (this speeds up mptcp_doit() in tcp_recvmsg)
++ */
++ mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
++ pf:1, /* Potentially Failed state: when this flag is set, we
++ * stop using the subflow
++ */
++ mp_killed:1, /* Killed with a tcp_done in mptcp? */
++ was_meta_sk:1, /* This was a meta sk (in case of reuse) */
++ is_master_sk,
++ close_it:1, /* Must close socket in mptcp_data_ready? */
++ closing:1;
++ struct mptcp_tcp_sock *mptcp;
++#ifdef CONFIG_MPTCP
++ struct hlist_nulls_node tk_table;
++ u32 mptcp_loc_token;
++ u64 mptcp_loc_key;
++#endif /* CONFIG_MPTCP */
+ };
+
+ enum tsq_flags {
+@@ -337,6 +415,8 @@ enum tsq_flags {
+ TCP_MTU_REDUCED_DEFERRED, /* tcp_v{4|6}_err() could not call
+ * tcp_v{4|6}_mtu_reduced()
+ */
++ MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
++ MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
+ };
+
+ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
+@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
+ #ifdef CONFIG_TCP_MD5SIG
+ struct tcp_md5sig_key *tw_md5_key;
+ #endif
++ struct mptcp_tw *mptcp_tw;
+ };
+
+ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
+diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
+index 74af137304be..83f63033897a 100644
+--- a/include/net/inet6_connection_sock.h
++++ b/include/net/inet6_connection_sock.h
+@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
+
+ struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
+ const struct request_sock *req);
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++ const u32 rnd, const u32 synq_hsize);
+
+ struct request_sock *inet6_csk_search_req(const struct sock *sk,
+ struct request_sock ***prevp,
+diff --git a/include/net/inet_common.h b/include/net/inet_common.h
+index fe7994c48b75..780f229f46a8 100644
+--- a/include/net/inet_common.h
++++ b/include/net/inet_common.h
+@@ -1,6 +1,8 @@
+ #ifndef _INET_COMMON_H
+ #define _INET_COMMON_H
+
++#include <net/sock.h>
++
+ extern const struct proto_ops inet_stream_ops;
+ extern const struct proto_ops inet_dgram_ops;
+
+@@ -13,6 +15,8 @@ struct sock;
+ struct sockaddr;
+ struct socket;
+
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
+ int inet_release(struct socket *sock);
+ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len, int flags);
+diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
+index 7a4313887568..f62159e39839 100644
+--- a/include/net/inet_connection_sock.h
++++ b/include/net/inet_connection_sock.h
+@@ -30,6 +30,7 @@
+
+ struct inet_bind_bucket;
+ struct tcp_congestion_ops;
++struct tcp_options_received;
+
+ /*
+ * Pointers to address related TCP functions
+@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
+
+ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
+
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++ const u32 synq_hsize);
++
+ struct request_sock *inet_csk_search_req(const struct sock *sk,
+ struct request_sock ***prevp,
+ const __be16 rport,
+diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
+index b1edf17bec01..6a32d8d6b85e 100644
+--- a/include/net/inet_sock.h
++++ b/include/net/inet_sock.h
+@@ -86,10 +86,14 @@ struct inet_request_sock {
+ wscale_ok : 1,
+ ecn_ok : 1,
+ acked : 1,
+- no_srccheck: 1;
++ no_srccheck: 1,
++ mptcp_rqsk : 1,
++ saw_mpc : 1;
+ kmemcheck_bitfield_end(flags);
+- struct ip_options_rcu *opt;
+- struct sk_buff *pktopts;
++ union {
++ struct ip_options_rcu *opt;
++ struct sk_buff *pktopts;
++ };
+ u32 ir_mark;
+ };
+
+diff --git a/include/net/mptcp.h b/include/net/mptcp.h
+new file mode 100644
+index 000000000000..712780fc39e4
+--- /dev/null
++++ b/include/net/mptcp.h
+@@ -0,0 +1,1439 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_H
++#define _MPTCP_H
++
++#include <linux/inetdevice.h>
++#include <linux/ipv6.h>
++#include <linux/list.h>
++#include <linux/net.h>
++#include <linux/netpoll.h>
++#include <linux/skbuff.h>
++#include <linux/socket.h>
++#include <linux/tcp.h>
++#include <linux/kernel.h>
++
++#include <asm/byteorder.h>
++#include <asm/unaligned.h>
++#include <crypto/hash.h>
++#include <net/tcp.h>
++
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ #define ntohll(x) be64_to_cpu(x)
++ #define htonll(x) cpu_to_be64(x)
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ #define ntohll(x) (x)
++ #define htonll(x) (x)
++#endif
++
++struct mptcp_loc4 {
++ u8 loc4_id;
++ u8 low_prio:1;
++ struct in_addr addr;
++};
++
++struct mptcp_rem4 {
++ u8 rem4_id;
++ __be16 port;
++ struct in_addr addr;
++};
++
++struct mptcp_loc6 {
++ u8 loc6_id;
++ u8 low_prio:1;
++ struct in6_addr addr;
++};
++
++struct mptcp_rem6 {
++ u8 rem6_id;
++ __be16 port;
++ struct in6_addr addr;
++};
++
++struct mptcp_request_sock {
++ struct tcp_request_sock req;
++ /* hlist-nulls entry to the hash-table. Depending on whether this is a
++ * a new MPTCP connection or an additional subflow, the request-socket
++ * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
++ */
++ struct hlist_nulls_node hash_entry;
++
++ union {
++ struct {
++ /* Only on initial subflows */
++ u64 mptcp_loc_key;
++ u64 mptcp_rem_key;
++ u32 mptcp_loc_token;
++ };
++
++ struct {
++ /* Only on additional subflows */
++ struct mptcp_cb *mptcp_mpcb;
++ u32 mptcp_rem_nonce;
++ u32 mptcp_loc_nonce;
++ u64 mptcp_hash_tmac;
++ };
++ };
++
++ u8 loc_id;
++ u8 rem_id; /* Address-id in the MP_JOIN */
++ u8 dss_csum:1,
++ is_sub:1, /* Is this a new subflow? */
++ low_prio:1, /* Interface set to low-prio? */
++ rcv_low_prio:1;
++};
++
++struct mptcp_options_received {
++ u16 saw_mpc:1,
++ dss_csum:1,
++ drop_me:1,
++
++ is_mp_join:1,
++ join_ack:1,
++
++ saw_low_prio:2, /* 0x1 - low-prio set for this subflow
++ * 0x2 - low-prio set for another subflow
++ */
++ low_prio:1,
++
++ saw_add_addr:2, /* Saw at least one add_addr option:
++ * 0x1: IPv4 - 0x2: IPv6
++ */
++ more_add_addr:1, /* Saw one more add-addr. */
++
++ saw_rem_addr:1, /* Saw at least one rem_addr option */
++ more_rem_addr:1, /* Saw one more rem-addr. */
++
++ mp_fail:1,
++ mp_fclose:1;
++ u8 rem_id; /* Address-id in the MP_JOIN */
++ u8 prio_addr_id; /* Address-id in the MP_PRIO */
++
++ const unsigned char *add_addr_ptr; /* Pointer to add-address option */
++ const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
++
++ u32 data_ack;
++ u32 data_seq;
++ u16 data_len;
++
++ u32 mptcp_rem_token;/* Remote token */
++
++ /* Key inside the option (from mp_capable or fast_close) */
++ u64 mptcp_key;
++
++ u32 mptcp_recv_nonce;
++ u64 mptcp_recv_tmac;
++ u8 mptcp_recv_mac[20];
++};
++
++struct mptcp_tcp_sock {
++ struct tcp_sock *next; /* Next subflow socket */
++ struct hlist_node cb_list;
++ struct mptcp_options_received rx_opt;
++
++ /* Those three fields record the current mapping */
++ u64 map_data_seq;
++ u32 map_subseq;
++ u16 map_data_len;
++ u16 slave_sk:1,
++ fully_established:1,
++ establish_increased:1,
++ second_packet:1,
++ attached:1,
++ send_mp_fail:1,
++ include_mpc:1,
++ mapping_present:1,
++ map_data_fin:1,
++ low_prio:1, /* use this socket as backup */
++ rcv_low_prio:1, /* Peer sent low-prio option to us */
++ send_mp_prio:1, /* Trigger to send mp_prio on this socket */
++ pre_established:1; /* State between sending 3rd ACK and
++ * receiving the fourth ack of new subflows.
++ */
++
++ /* isn: needed to translate abs to relative subflow seqnums */
++ u32 snt_isn;
++ u32 rcv_isn;
++ u8 path_index;
++ u8 loc_id;
++ u8 rem_id;
++
++#define MPTCP_SCHED_SIZE 4
++ u8 mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
++
++ struct sk_buff *shortcut_ofoqueue; /* Shortcut to the current modified
++ * skb in the ofo-queue.
++ */
++
++ int init_rcv_wnd;
++ u32 infinite_cutoff_seq;
++ struct delayed_work work;
++ u32 mptcp_loc_nonce;
++ struct tcp_sock *tp; /* Where is my daddy? */
++ u32 last_end_data_seq;
++
++ /* MP_JOIN subflow: timer for retransmitting the 3rd ack */
++ struct timer_list mptcp_ack_timer;
++
++ /* HMAC of the third ack */
++ char sender_mac[20];
++};
++
++struct mptcp_tw {
++ struct list_head list;
++ u64 loc_key;
++ u64 rcv_nxt;
++ struct mptcp_cb __rcu *mpcb;
++ u8 meta_tw:1,
++ in_list:1;
++};
++
++#define MPTCP_PM_NAME_MAX 16
++struct mptcp_pm_ops {
++ struct list_head list;
++
++ /* Signal the creation of a new MPTCP-session. */
++ void (*new_session)(const struct sock *meta_sk);
++ void (*release_sock)(struct sock *meta_sk);
++ void (*fully_established)(struct sock *meta_sk);
++ void (*new_remote_address)(struct sock *meta_sk);
++ int (*get_local_id)(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio);
++ void (*addr_signal)(struct sock *sk, unsigned *size,
++ struct tcp_out_options *opts, struct sk_buff *skb);
++ void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
++ sa_family_t family, __be16 port, u8 id);
++ void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
++ void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
++ void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
++
++ char name[MPTCP_PM_NAME_MAX];
++ struct module *owner;
++};
++
++#define MPTCP_SCHED_NAME_MAX 16
++struct mptcp_sched_ops {
++ struct list_head list;
++
++ struct sock * (*get_subflow)(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test);
++ struct sk_buff * (*next_segment)(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit);
++ void (*init)(struct sock *sk);
++
++ char name[MPTCP_SCHED_NAME_MAX];
++ struct module *owner;
++};
++
++struct mptcp_cb {
++ /* list of sockets in this multipath connection */
++ struct tcp_sock *connection_list;
++ /* list of sockets that need a call to release_cb */
++ struct hlist_head callback_list;
++
++ /* High-order bits of 64-bit sequence numbers */
++ u32 snd_high_order[2];
++ u32 rcv_high_order[2];
++
++ u16 send_infinite_mapping:1,
++ in_time_wait:1,
++ list_rcvd:1, /* XXX TO REMOVE */
++ addr_signal:1, /* Path-manager wants us to call addr_signal */
++ dss_csum:1,
++ server_side:1,
++ infinite_mapping_rcv:1,
++ infinite_mapping_snd:1,
++ dfin_combined:1, /* Was the DFIN combined with subflow-fin? */
++ passive_close:1,
++ snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
++ rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
++
++ /* socket count in this connection */
++ u8 cnt_subflows;
++ u8 cnt_established;
++
++ struct mptcp_sched_ops *sched_ops;
++
++ struct sk_buff_head reinject_queue;
++ /* First cache-line boundary is here minus 8 bytes. But from the
++ * reinject-queue only the next and prev pointers are regularly
++ * accessed. Thus, the whole data-path is on a single cache-line.
++ */
++
++ u64 csum_cutoff_seq;
++
++ /***** Start of fields, used for connection closure */
++ spinlock_t tw_lock;
++ unsigned char mptw_state;
++ u8 dfin_path_index;
++
++ struct list_head tw_list;
++
++ /***** Start of fields, used for subflow establishment and closure */
++ atomic_t mpcb_refcnt;
++
++ /* Mutex needed, because otherwise mptcp_close will complain that the
++ * socket is owned by the user.
++ * E.g., mptcp_sub_close_wq is taking the meta-lock.
++ */
++ struct mutex mpcb_mutex;
++
++ /***** Start of fields, used for subflow establishment */
++ struct sock *meta_sk;
++
++ /* Master socket, also part of the connection_list, this
++ * socket is the one that the application sees.
++ */
++ struct sock *master_sk;
++
++ __u64 mptcp_loc_key;
++ __u64 mptcp_rem_key;
++ __u32 mptcp_loc_token;
++ __u32 mptcp_rem_token;
++
++#define MPTCP_PM_SIZE 608
++ u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
++ struct mptcp_pm_ops *pm_ops;
++
++ u32 path_index_bits;
++ /* Next pi to pick up in case a new path becomes available */
++ u8 next_path_index;
++
++ /* Original snd/rcvbuf of the initial subflow.
++ * Used for the new subflows on the server-side to allow correct
++ * autotuning
++ */
++ int orig_sk_rcvbuf;
++ int orig_sk_sndbuf;
++ u32 orig_window_clamp;
++
++ /* Timer for retransmitting SYN/ACK+MP_JOIN */
++ struct timer_list synack_timer;
++};
++
++#define MPTCP_SUB_CAPABLE 0
++#define MPTCP_SUB_LEN_CAPABLE_SYN 12
++#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN 12
++#define MPTCP_SUB_LEN_CAPABLE_ACK 20
++#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN 20
++
++#define MPTCP_SUB_JOIN 1
++#define MPTCP_SUB_LEN_JOIN_SYN 12
++#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN 12
++#define MPTCP_SUB_LEN_JOIN_SYNACK 16
++#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN 16
++#define MPTCP_SUB_LEN_JOIN_ACK 24
++#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN 24
++
++#define MPTCP_SUB_DSS 2
++#define MPTCP_SUB_LEN_DSS 4
++#define MPTCP_SUB_LEN_DSS_ALIGN 4
++
++/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
++ * as they are part of the DSS-option.
++ * To get the total length, just add the different options together.
++ */
++#define MPTCP_SUB_LEN_SEQ 10
++#define MPTCP_SUB_LEN_SEQ_CSUM 12
++#define MPTCP_SUB_LEN_SEQ_ALIGN 12
++
++#define MPTCP_SUB_LEN_SEQ_64 14
++#define MPTCP_SUB_LEN_SEQ_CSUM_64 16
++#define MPTCP_SUB_LEN_SEQ_64_ALIGN 16
++
++#define MPTCP_SUB_LEN_ACK 4
++#define MPTCP_SUB_LEN_ACK_ALIGN 4
++
++#define MPTCP_SUB_LEN_ACK_64 8
++#define MPTCP_SUB_LEN_ACK_64_ALIGN 8
++
++/* This is the "default" option-length we will send out most often.
++ * MPTCP DSS-header
++ * 32-bit data sequence number
++ * 32-bit data ack
++ *
++ * It is necessary to calculate the effective MSS we will be using when
++ * sending data.
++ */
++#define MPTCP_SUB_LEN_DSM_ALIGN (MPTCP_SUB_LEN_DSS_ALIGN + \
++ MPTCP_SUB_LEN_SEQ_ALIGN + \
++ MPTCP_SUB_LEN_ACK_ALIGN)
++
++#define MPTCP_SUB_ADD_ADDR 3
++#define MPTCP_SUB_LEN_ADD_ADDR4 8
++#define MPTCP_SUB_LEN_ADD_ADDR6 20
++#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN 8
++#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN 20
++
++#define MPTCP_SUB_REMOVE_ADDR 4
++#define MPTCP_SUB_LEN_REMOVE_ADDR 4
++
++#define MPTCP_SUB_PRIO 5
++#define MPTCP_SUB_LEN_PRIO 3
++#define MPTCP_SUB_LEN_PRIO_ADDR 4
++#define MPTCP_SUB_LEN_PRIO_ALIGN 4
++
++#define MPTCP_SUB_FAIL 6
++#define MPTCP_SUB_LEN_FAIL 12
++#define MPTCP_SUB_LEN_FAIL_ALIGN 12
++
++#define MPTCP_SUB_FCLOSE 7
++#define MPTCP_SUB_LEN_FCLOSE 12
++#define MPTCP_SUB_LEN_FCLOSE_ALIGN 12
++
++
++#define OPTION_MPTCP (1 << 5)
++
++#ifdef CONFIG_MPTCP
++
++/* Used for checking if the mptcp initialization has been successful */
++extern bool mptcp_init_failed;
++
++/* MPTCP options */
++#define OPTION_TYPE_SYN (1 << 0)
++#define OPTION_TYPE_SYNACK (1 << 1)
++#define OPTION_TYPE_ACK (1 << 2)
++#define OPTION_MP_CAPABLE (1 << 3)
++#define OPTION_DATA_ACK (1 << 4)
++#define OPTION_ADD_ADDR (1 << 5)
++#define OPTION_MP_JOIN (1 << 6)
++#define OPTION_MP_FAIL (1 << 7)
++#define OPTION_MP_FCLOSE (1 << 8)
++#define OPTION_REMOVE_ADDR (1 << 9)
++#define OPTION_MP_PRIO (1 << 10)
++
++/* MPTCP flags: both TX and RX */
++#define MPTCPHDR_SEQ 0x01 /* DSS.M option is present */
++#define MPTCPHDR_FIN 0x02 /* DSS.F option is present */
++#define MPTCPHDR_SEQ64_INDEX 0x04 /* index of seq in mpcb->snd_high_order */
++/* MPTCP flags: RX only */
++#define MPTCPHDR_ACK 0x08
++#define MPTCPHDR_SEQ64_SET 0x10 /* Did we received a 64-bit seq number? */
++#define MPTCPHDR_SEQ64_OFO 0x20 /* Is it not in our circular array? */
++#define MPTCPHDR_DSS_CSUM 0x40
++#define MPTCPHDR_JOIN 0x80
++/* MPTCP flags: TX only */
++#define MPTCPHDR_INF 0x08
++
++struct mptcp_option {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ver:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ver:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_capable {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ver:4,
++ sub:4;
++ __u8 h:1,
++ rsv:5,
++ b:1,
++ a:1;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ver:4;
++ __u8 a:1,
++ b:1,
++ rsv:5,
++ h:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u64 sender_key;
++ __u64 receiver_key;
++} __attribute__((__packed__));
++
++struct mp_join {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 b:1,
++ rsv:3,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:3,
++ b:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++ union {
++ struct {
++ u32 token;
++ u32 nonce;
++ } syn;
++ struct {
++ __u64 mac;
++ u32 nonce;
++ } synack;
++ struct {
++ __u8 mac[20];
++ } ack;
++ } u;
++} __attribute__((__packed__));
++
++struct mp_dss {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ A:1,
++ a:1,
++ M:1,
++ m:1,
++ F:1,
++ rsv2:3;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:3,
++ F:1,
++ m:1,
++ M:1,
++ a:1,
++ A:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_add_addr {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ipver:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ipver:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++ union {
++ struct {
++ struct in_addr addr;
++ __be16 port;
++ } v4;
++ struct {
++ struct in6_addr addr;
++ __be16 port;
++ } v6;
++ } u;
++} __attribute__((__packed__));
++
++struct mp_remove_addr {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 rsv:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ /* list of addr_id */
++ __u8 addrs_id;
++};
++
++struct mp_fail {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:8;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __be64 data_seq;
++} __attribute__((__packed__));
++
++struct mp_fclose {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:8;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u64 key;
++} __attribute__((__packed__));
++
++struct mp_prio {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 b:1,
++ rsv:3,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:3,
++ b:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++} __attribute__((__packed__));
++
++static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
++{
++ return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
++}
++
++#define MPTCP_APP 2
++
++extern int sysctl_mptcp_enabled;
++extern int sysctl_mptcp_checksum;
++extern int sysctl_mptcp_debug;
++extern int sysctl_mptcp_syn_retries;
++
++extern struct workqueue_struct *mptcp_wq;
++
++#define mptcp_debug(fmt, args...) \
++ do { \
++ if (unlikely(sysctl_mptcp_debug)) \
++ pr_err(__FILE__ ": " fmt, ##args); \
++ } while (0)
++
++/* Iterates over all subflows */
++#define mptcp_for_each_tp(mpcb, tp) \
++ for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
++
++#define mptcp_for_each_sk(mpcb, sk) \
++ for ((sk) = (struct sock *)(mpcb)->connection_list; \
++ sk; \
++ sk = (struct sock *)tcp_sk(sk)->mptcp->next)
++
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp) \
++ for (__sk = (struct sock *)(__mpcb)->connection_list, \
++ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
++ __sk; \
++ __sk = __temp, \
++ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
++
++/* Iterates over all bit set to 1 in a bitset */
++#define mptcp_for_each_bit_set(b, i) \
++ for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
++
++#define mptcp_for_each_bit_unset(b, i) \
++ mptcp_for_each_bit_set(~b, i)
++
++extern struct lock_class_key meta_key;
++extern struct lock_class_key meta_slock_key;
++extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
++
++/* This is needed to ensure that two subsequent key/nonce-generation result in
++ * different keys/nonces if the IPs and ports are the same.
++ */
++extern u32 mptcp_seed;
++
++#define MPTCP_HASH_SIZE 1024
++
++extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++extern spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
++
++/* Lock, protecting the two hash-tables that hold the token. Namely,
++ * mptcp_reqsk_tk_htb and tk_hashtable
++ */
++extern spinlock_t mptcp_tk_hashlock; /* hashtable protection */
++
++/* Request-sockets can be hashed in the tk_htb for collision-detection or in
++ * the regular htb for join-connections. We need to define different NULLS
++ * values so that we can correctly detect a request-socket that has been
++ * recycled. See also c25eb3bfb9729.
++ */
++#define MPTCP_REQSK_NULLS_BASE (1U << 29)
++
++
++void mptcp_data_ready(struct sock *sk);
++void mptcp_write_space(struct sock *sk);
++
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++ struct sock *sk);
++void mptcp_ofo_queue(struct sock *meta_sk);
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++ gfp_t flags);
++void mptcp_del_sock(struct sock *sk);
++void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
++void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
++void mptcp_update_sndbuf(const struct tcp_sock *tp);
++void mptcp_send_fin(struct sock *meta_sk);
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
++bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++ struct mptcp_options_received *mopt);
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++ struct mptcp_options_received *mopt,
++ const struct sk_buff *skb);
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++ unsigned *remaining);
++void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts,
++ unsigned *remaining);
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++ struct tcp_out_options *opts, unsigned *size);
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb);
++void mptcp_close(struct sock *meta_sk, long timeout);
++int mptcp_doit(struct sock *sk);
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev);
++struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt);
++u32 __mptcp_select_window(struct sock *sk);
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
++unsigned int mptcp_current_mss(struct sock *meta_sk);
++int mptcp_select_size(const struct sock *meta_sk, bool sg);
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++ u32 *hash_out);
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
++void mptcp_fin(struct sock *meta_sk);
++void mptcp_retransmit_timer(struct sock *meta_sk);
++int mptcp_write_wakeup(struct sock *meta_sk);
++void mptcp_sub_close_wq(struct work_struct *work);
++void mptcp_sub_close(struct sock *sk, unsigned long delay);
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
++void mptcp_fallback_meta_sk(struct sock *meta_sk);
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_ack_handler(unsigned long);
++int mptcp_check_rtt(const struct tcp_sock *tp, int time);
++int mptcp_check_snd_buf(const struct tcp_sock *tp);
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++ const struct sk_buff *skb);
++void __init mptcp_init(void);
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
++void mptcp_destroy_sock(struct sock *sk);
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++ const struct sk_buff *skb,
++ const struct mptcp_options_received *mopt);
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++ int large_allowed);
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
++void mptcp_time_wait(struct sock *sk, int state, int timeo);
++void mptcp_disconnect(struct sock *sk);
++bool mptcp_should_expand_sndbuf(const struct sock *sk);
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_tsq_flags(struct sock *sk);
++void mptcp_tsq_sub_deferred(struct sock *meta_sk);
++struct mp_join *mptcp_find_join(const struct sk_buff *skb);
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
++void mptcp_hash_remove(struct tcp_sock *meta_tp);
++struct sock *mptcp_hash_find(const struct net *net, const u32 token);
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
++int mptcp_do_join_short(struct sk_buff *skb,
++ const struct mptcp_options_received *mopt,
++ struct net *net);
++void mptcp_reqsk_destructor(struct request_sock *req);
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb);
++int mptcp_check_req(struct sk_buff *skb, struct net *net);
++void mptcp_connect_init(struct sock *sk);
++void mptcp_sub_force_close(struct sock *sk);
++int mptcp_sub_len_remove_addr_align(u16 bitfield);
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++ struct sk_buff *skb);
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
++void mptcp_init_congestion_control(struct sock *sk);
++
++/* MPTCP-path-manager registration/initialization functions */
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_init_path_manager(struct mptcp_cb *mpcb);
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
++void mptcp_fallback_default(struct mptcp_cb *mpcb);
++void mptcp_get_default_path_manager(char *name);
++int mptcp_set_default_path_manager(const char *name);
++extern struct mptcp_pm_ops mptcp_pm_default;
++
++/* MPTCP-scheduler registration/initialization functions */
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_init_scheduler(struct mptcp_cb *mpcb);
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
++void mptcp_get_default_scheduler(char *name);
++int mptcp_set_default_scheduler(const char *name);
++extern struct mptcp_sched_ops mptcp_sched_default;
++
++static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
++ unsigned long len)
++{
++ sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
++ jiffies + len);
++}
++
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
++{
++ sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
++}
++
++static inline bool is_mptcp_enabled(const struct sock *sk)
++{
++ if (!sysctl_mptcp_enabled || mptcp_init_failed)
++ return false;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++ return false;
++
++ return true;
++}
++
++static inline int mptcp_pi_to_flag(int pi)
++{
++ return 1 << (pi - 1);
++}
++
++static inline
++struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
++{
++ return (struct mptcp_request_sock *)req;
++}
++
++static inline
++struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
++{
++ return (struct request_sock *)req;
++}
++
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++ struct sock *sk_it;
++
++ if (tcp_sk(sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
++ if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
++ !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
++ return false;
++ }
++
++ return true;
++}
++
++static inline void mptcp_push_pending_frames(struct sock *meta_sk)
++{
++ /* We check packets out and send-head here. TCP only checks the
++ * send-head. But, MPTCP also checks packets_out, as this is an
++ * indication that we might want to do opportunistic reinjection.
++ */
++ if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
++ struct tcp_sock *tp = tcp_sk(meta_sk);
++
++ /* We don't care about the MSS, because it will be set in
++ * mptcp_write_xmit.
++ */
++ __tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
++ }
++}
++
++static inline void mptcp_send_reset(struct sock *sk)
++{
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++ mptcp_sub_force_close(sk);
++}
++
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
++}
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
++}
++
++/* Is it a data-fin while in infinite mapping mode?
++ * In infinite mode, a subflow-fin is in fact a data-fin.
++ */
++static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
++ const struct tcp_sock *tp)
++{
++ return mptcp_is_data_fin(skb) ||
++ (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
++}
++
++static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
++{
++ u64 data_seq_high = (u32)(data_seq >> 32);
++
++ if (mpcb->rcv_high_order[0] == data_seq_high)
++ return 0;
++ else if (mpcb->rcv_high_order[1] == data_seq_high)
++ return MPTCPHDR_SEQ64_INDEX;
++ else
++ return MPTCPHDR_SEQ64_OFO;
++}
++
++/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
++ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
++ */
++static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
++ u32 *data_seq,
++ struct mptcp_cb *mpcb)
++{
++ __u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
++
++ if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
++ u64 data_seq64 = get_unaligned_be64(ptr);
++
++ if (mpcb)
++ TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
++
++ *data_seq = (u32)data_seq64;
++ ptr++;
++ } else {
++ *data_seq = get_unaligned_be32(ptr);
++ }
++
++ return ptr;
++}
++
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++ return tcp_sk(sk)->meta_sk;
++}
++
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++ return tcp_sk(tp->meta_sk);
++}
++
++static inline int is_meta_tp(const struct tcp_sock *tp)
++{
++ return tp->mpcb && mptcp_meta_tp(tp) == tp;
++}
++
++static inline int is_meta_sk(const struct sock *sk)
++{
++ return sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP &&
++ mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
++}
++
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++ return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
++}
++
++static inline void mptcp_hash_request_remove(struct request_sock *req)
++{
++ int in_softirq = 0;
++
++ if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
++ return;
++
++ if (in_softirq()) {
++ spin_lock(&mptcp_reqsk_hlock);
++ in_softirq = 1;
++ } else {
++ spin_lock_bh(&mptcp_reqsk_hlock);
++ }
++
++ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++
++ if (in_softirq)
++ spin_unlock(&mptcp_reqsk_hlock);
++ else
++ spin_unlock_bh(&mptcp_reqsk_hlock);
++}
++
++static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
++{
++ mopt->saw_mpc = 0;
++ mopt->dss_csum = 0;
++ mopt->drop_me = 0;
++
++ mopt->is_mp_join = 0;
++ mopt->join_ack = 0;
++
++ mopt->saw_low_prio = 0;
++ mopt->low_prio = 0;
++
++ mopt->saw_add_addr = 0;
++ mopt->more_add_addr = 0;
++
++ mopt->saw_rem_addr = 0;
++ mopt->more_rem_addr = 0;
++
++ mopt->mp_fail = 0;
++ mopt->mp_fclose = 0;
++}
++
++static inline void mptcp_reset_mopt(struct tcp_sock *tp)
++{
++ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++ mopt->saw_low_prio = 0;
++ mopt->saw_add_addr = 0;
++ mopt->more_add_addr = 0;
++ mopt->saw_rem_addr = 0;
++ mopt->more_rem_addr = 0;
++ mopt->join_ack = 0;
++ mopt->mp_fail = 0;
++ mopt->mp_fclose = 0;
++}
++
++static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
++ const struct mptcp_cb *mpcb)
++{
++ return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
++ MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
++}
++
++static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
++ u32 data_seq_32)
++{
++ return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
++}
++
++static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
++{
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++ meta_tp->rcv_nxt);
++}
++
++static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
++{
++ if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++ mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
++ }
++}
++
++static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
++ u32 old_rcv_nxt)
++{
++ if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
++ mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
++ }
++}
++
++static inline int mptcp_sk_can_send(const struct sock *sk)
++{
++ return tcp_passive_fastopen(sk) ||
++ ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
++ !tcp_sk(sk)->mptcp->pre_established);
++}
++
++static inline int mptcp_sk_can_recv(const struct sock *sk)
++{
++ return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
++}
++
++static inline int mptcp_sk_can_send_ack(const struct sock *sk)
++{
++ return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
++ TCPF_CLOSE | TCPF_LISTEN)) &&
++ !tcp_sk(sk)->mptcp->pre_established;
++}
++
++/* Only support GSO if all subflows supports it */
++static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
++{
++ struct sock *sk;
++
++ if (tcp_sk(meta_sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++ if (!sk_can_gso(sk))
++ return false;
++ }
++ return true;
++}
++
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++ struct sock *sk;
++
++ if (tcp_sk(meta_sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++ if (!(sk->sk_route_caps & NETIF_F_SG))
++ return false;
++ }
++ return true;
++}
++
++static inline void mptcp_set_rto(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *sk_it;
++ struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
++ __u32 max_rto = 0;
++
++ /* We are in recovery-phase on the MPTCP-level. Do not update the
++ * RTO, because this would kill exponential backoff.
++ */
++ if (micsk->icsk_retransmits)
++ return;
++
++ mptcp_for_each_sk(tp->mpcb, sk_it) {
++ if (mptcp_sk_can_send(sk_it) &&
++ inet_csk(sk_it)->icsk_rto > max_rto)
++ max_rto = inet_csk(sk_it)->icsk_rto;
++ }
++ if (max_rto) {
++ micsk->icsk_rto = max_rto << 1;
++
++ /* A successfull rto-measurement - reset backoff counter */
++ micsk->icsk_backoff = 0;
++ }
++}
++
++static inline int mptcp_sysctl_syn_retries(void)
++{
++ return sysctl_mptcp_syn_retries;
++}
++
++static inline void mptcp_sub_close_passive(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
++
++ /* Only close, if the app did a send-shutdown (passive close), and we
++ * received the data-ack of the data-fin.
++ */
++ if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
++ mptcp_sub_close(sk, 0);
++}
++
++static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* If data has been acknowleged on the meta-level, fully_established
++ * will have been set before and thus we will not fall back to infinite
++ * mapping.
++ */
++ if (likely(tp->mptcp->fully_established))
++ return false;
++
++ if (!(flag & MPTCP_FLAG_DATA_ACKED))
++ return false;
++
++ /* Don't fallback twice ;) */
++ if (tp->mpcb->infinite_mapping_snd)
++ return false;
++
++ pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
++ __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
++ &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
++ __builtin_return_address(0));
++ if (!is_master_tp(tp))
++ return true;
++
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mpcb->infinite_mapping_rcv = 1;
++ tp->mptcp->fully_established = 1;
++
++ return false;
++}
++
++/* Find the first index whose bit in the bit-field == 0 */
++static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
++{
++ u8 base = mpcb->next_path_index;
++ int i;
++
++ /* Start at 1, because 0 is reserved for the meta-sk */
++ mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
++ if (i + base < 1)
++ continue;
++ if (i + base >= sizeof(mpcb->path_index_bits) * 8)
++ break;
++ i += base;
++ mpcb->path_index_bits |= (1 << i);
++ mpcb->next_path_index = i + 1;
++ return i;
++ }
++ mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
++ if (i >= sizeof(mpcb->path_index_bits) * 8)
++ break;
++ if (i < 1)
++ continue;
++ mpcb->path_index_bits |= (1 << i);
++ mpcb->next_path_index = i + 1;
++ return i;
++ }
++
++ return 0;
++}
++
++static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
++{
++ return sk->sk_family == AF_INET6 &&
++ ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
++}
++
++/* TCP and MPTCP mpc flag-depending functions */
++u16 mptcp_select_window(struct sock *sk);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_tcp_set_rto(struct sock *sk);
++
++/* TCP and MPTCP flag-depending functions */
++bool mptcp_prune_ofo_queue(struct sock *sk);
++
++#else /* CONFIG_MPTCP */
++#define mptcp_debug(fmt, args...) \
++ do { \
++ } while (0)
++
++/* Without MPTCP, we just do one iteration
++ * over the only socket available. This assumes that
++ * the sk/tp arg is the socket in that case.
++ */
++#define mptcp_for_each_sk(mpcb, sk)
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++ return false;
++}
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++ return false;
++}
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++ return NULL;
++}
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++ return NULL;
++}
++static inline int is_meta_sk(const struct sock *sk)
++{
++ return 0;
++}
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++ return 0;
++}
++static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
++static inline void mptcp_del_sock(const struct sock *sk) {}
++static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
++static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
++static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
++static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
++ const struct sock *sk) {}
++static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
++static inline void mptcp_set_rto(const struct sock *sk) {}
++static inline void mptcp_send_fin(const struct sock *meta_sk) {}
++static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb) {}
++static inline void mptcp_syn_options(const struct sock *sk,
++ struct tcp_out_options *opts,
++ unsigned *remaining) {}
++static inline void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts,
++ unsigned *remaining) {}
++
++static inline void mptcp_established_options(struct sock *sk,
++ struct sk_buff *skb,
++ struct tcp_out_options *opts,
++ unsigned *size) {}
++static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb) {}
++static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
++static inline int mptcp_doit(struct sock *sk)
++{
++ return 0;
++}
++static inline int mptcp_check_req_fastopen(struct sock *child,
++ struct request_sock *req)
++{
++ return 1;
++}
++static inline int mptcp_check_req_master(const struct sock *sk,
++ const struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev)
++{
++ return 1;
++}
++static inline struct sock *mptcp_check_req_child(struct sock *sk,
++ struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt)
++{
++ return NULL;
++}
++static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++ return 0;
++}
++static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++ return 0;
++}
++static inline void mptcp_sub_close_passive(struct sock *sk) {}
++static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
++{
++ return false;
++}
++static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
++static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++ return 0;
++}
++static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++ return 0;
++}
++static inline int mptcp_sysctl_syn_retries(void)
++{
++ return 0;
++}
++static inline void mptcp_send_reset(const struct sock *sk) {}
++static inline int mptcp_handle_options(struct sock *sk,
++ const struct tcphdr *th,
++ struct sk_buff *skb)
++{
++ return 0;
++}
++static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
++static inline void __init mptcp_init(void) {}
++static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++ return 0;
++}
++static inline bool mptcp_sk_can_gso(const struct sock *sk)
++{
++ return false;
++}
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++ return false;
++}
++static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
++ u32 mss_now, int large_allowed)
++{
++ return 0;
++}
++static inline void mptcp_destroy_sock(struct sock *sk) {}
++static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
++ struct sock **skptr,
++ struct sk_buff *skb,
++ const struct mptcp_options_received *mopt)
++{
++ return 0;
++}
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++ return false;
++}
++static inline int mptcp_init_tw_sock(struct sock *sk,
++ struct tcp_timewait_sock *tw)
++{
++ return 0;
++}
++static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
++static inline void mptcp_disconnect(struct sock *sk) {}
++static inline void mptcp_tsq_flags(struct sock *sk) {}
++static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
++static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
++static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
++static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct tcp_options_received *rx_opt,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb) {}
++static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb) {}
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_H */
+diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
+new file mode 100644
+index 000000000000..93ad97c77c5a
+--- /dev/null
++++ b/include/net/mptcp_v4.h
+@@ -0,0 +1,67 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef MPTCP_V4_H_
++#define MPTCP_V4_H_
++
++
++#include <linux/in.h>
++#include <linux/skbuff.h>
++#include <net/mptcp.h>
++#include <net/request_sock.h>
++#include <net/sock.h>
++
++extern struct request_sock_ops mptcp_request_sock_ops;
++extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++#ifdef CONFIG_MPTCP
++
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++ const __be32 laddr, const struct net *net);
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++ struct mptcp_rem4 *rem);
++int mptcp_pm_v4_init(void);
++void mptcp_pm_v4_undo(void);
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++
++#else
++
++static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
++ const struct sk_buff *skb)
++{
++ return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* MPTCP_V4_H_ */
+diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
+new file mode 100644
+index 000000000000..49a4f30ccd4d
+--- /dev/null
++++ b/include/net/mptcp_v6.h
+@@ -0,0 +1,69 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_V6_H
++#define _MPTCP_V6_H
++
++#include <linux/in6.h>
++#include <net/if_inet6.h>
++
++#include <net/mptcp.h>
++
++
++#ifdef CONFIG_MPTCP
++extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
++extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
++extern struct request_sock_ops mptcp6_request_sock_ops;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++ const struct in6_addr *laddr, const struct net *net);
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++ struct mptcp_rem6 *rem);
++int mptcp_pm_v6_init(void);
++void mptcp_pm_v6_undo(void);
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport);
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport);
++
++#else /* CONFIG_MPTCP */
++
++#define mptcp_v6_mapped ipv6_mapped
++
++static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_V6_H */
+diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
+index 361d26077196..bae95a11c531 100644
+--- a/include/net/net_namespace.h
++++ b/include/net/net_namespace.h
+@@ -16,6 +16,7 @@
+ #include <net/netns/packet.h>
+ #include <net/netns/ipv4.h>
+ #include <net/netns/ipv6.h>
++#include <net/netns/mptcp.h>
+ #include <net/netns/ieee802154_6lowpan.h>
+ #include <net/netns/sctp.h>
+ #include <net/netns/dccp.h>
+@@ -92,6 +93,9 @@ struct net {
+ #if IS_ENABLED(CONFIG_IPV6)
+ struct netns_ipv6 ipv6;
+ #endif
++#if IS_ENABLED(CONFIG_MPTCP)
++ struct netns_mptcp mptcp;
++#endif
+ #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
+ struct netns_ieee802154_lowpan ieee802154_lowpan;
+ #endif
+diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
+new file mode 100644
+index 000000000000..bad418b04cc8
+--- /dev/null
++++ b/include/net/netns/mptcp.h
+@@ -0,0 +1,44 @@
++/*
++ * MPTCP implementation - MPTCP namespace
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef __NETNS_MPTCP_H__
++#define __NETNS_MPTCP_H__
++
++#include <linux/compiler.h>
++
++enum {
++ MPTCP_PM_FULLMESH = 0,
++ MPTCP_PM_MAX
++};
++
++struct netns_mptcp {
++ void *path_managers[MPTCP_PM_MAX];
++};
++
++#endif /* __NETNS_MPTCP_H__ */
+diff --git a/include/net/request_sock.h b/include/net/request_sock.h
+index 7f830ff67f08..e79e87a8e1a6 100644
+--- a/include/net/request_sock.h
++++ b/include/net/request_sock.h
+@@ -164,7 +164,7 @@ struct request_sock_queue {
+ };
+
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+- unsigned int nr_table_entries);
++ unsigned int nr_table_entries, gfp_t flags);
+
+ void __reqsk_queue_destroy(struct request_sock_queue *queue);
+ void reqsk_queue_destroy(struct request_sock_queue *queue);
+diff --git a/include/net/sock.h b/include/net/sock.h
+index 156350745700..0e23cae8861f 100644
+--- a/include/net/sock.h
++++ b/include/net/sock.h
+@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
+
+ int sk_wait_data(struct sock *sk, long *timeo);
+
++/* START - needed for MPTCP */
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
++void sock_lock_init(struct sock *sk);
++
++extern struct lock_class_key af_callback_keys[AF_MAX];
++extern char *const af_family_clock_key_strings[AF_MAX+1];
++
++#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
++/* END - needed for MPTCP */
++
+ struct request_sock_ops;
+ struct timewait_sock_ops;
+ struct inet_hashinfo;
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 7286db80e8b8..ff92e74cd684 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TCPOPT_SACK 5 /* SACK Block */
+ #define TCPOPT_TIMESTAMP 8 /* Better RTT estimations/PAWS */
+ #define TCPOPT_MD5SIG 19 /* MD5 Signature (RFC2385) */
++#define TCPOPT_MPTCP 30
+ #define TCPOPT_EXP 254 /* Experimental */
+ /* Magic number to be after the option value for sharing TCP
+ * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
+@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TFO_SERVER_WO_SOCKOPT1 0x400
+ #define TFO_SERVER_WO_SOCKOPT2 0x800
+
++/* Flags from tcp_input.c for tcp_ack */
++#define FLAG_DATA 0x01 /* Incoming frame contained data. */
++#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
++#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
++#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
++#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
++#define FLAG_DATA_SACKED 0x20 /* New SACK. */
++#define FLAG_ECE 0x40 /* ECE in this ACK */
++#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
++#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
++#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
++#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
++#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
++#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
++#define MPTCP_FLAG_DATA_ACKED 0x8000
++
++#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
++#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
++#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
++#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
++
+ extern struct inet_timewait_death_row tcp_death_row;
+
+ /* sysctl variables for tcp */
+@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
+ #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+ #define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
+
++/**** START - Exports needed for MPTCP ****/
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
++
++struct mptcp_options_received;
++
++void tcp_enter_quickack_mode(struct sock *sk);
++int tcp_close_state(struct sock *sk);
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++ const struct sk_buff *skb);
++int tcp_xmit_probe_skb(struct sock *sk, int urgent);
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++ gfp_t gfp_mask);
++unsigned int tcp_mss_split_point(const struct sock *sk,
++ const struct sk_buff *skb,
++ unsigned int mss_now,
++ unsigned int max_segs,
++ int nonagle);
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss, int nonagle);
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss);
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now);
++void __pskb_trim_head(struct sk_buff *skb, int len);
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
++void tcp_reset(struct sock *sk);
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++ const u32 ack_seq, const u32 nwin);
++bool tcp_urg_mode(const struct tcp_sock *tp);
++void tcp_ack_probe(struct sock *sk);
++void tcp_rearm_rto(struct sock *sk);
++int tcp_write_timeout(struct sock *sk);
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++ unsigned int timeout, bool syn_set);
++void tcp_write_err(struct sock *sk);
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now);
++
++int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req);
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc);
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
++void tcp_v4_reqsk_destructor(struct request_sock *req);
++
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req);
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl, struct request_sock *req,
++ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
++void tcp_v6_destroy_sock(struct sock *sk);
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
++void tcp_v6_hash(struct sock *sk);
++struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req,
++ struct dst_entry *dst);
++void tcp_v6_reqsk_destructor(struct request_sock *req);
++
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
++ int large_allowed);
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
++
++void skb_clone_fraglist(struct sk_buff *skb);
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
++
++void inet_twsk_free(struct inet_timewait_sock *tw);
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
++/* These states need RST on ABORT according to RFC793 */
++static inline bool tcp_need_reset(int state)
++{
++ return (1 << state) &
++ (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
++ TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
++}
++
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
++ int hlen);
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++ bool *fragstolen);
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
++ struct sk_buff *from, bool *fragstolen);
++/**** END - Exports needed for MPTCP ****/
++
+ void tcp_tasklet_init(void);
+
+ void tcp_v4_err(struct sk_buff *skb, u32);
+@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ size_t len, int nonblock, int flags, int *addr_len);
+ void tcp_parse_options(const struct sk_buff *skb,
+ struct tcp_options_received *opt_rx,
++ struct mptcp_options_received *mopt_rx,
+ int estab, struct tcp_fastopen_cookie *foc);
+ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
+
+@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
+
+ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ u16 *mssp);
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
+-#else
+-static inline __u32 cookie_v4_init_sequence(struct sock *sk,
+- struct sk_buff *skb,
+- __u16 *mss)
+-{
+- return 0;
+-}
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mss);
+ #endif
+
+ __u32 cookie_init_timestamp(struct request_sock *req);
+@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
+ const struct tcphdr *th, u16 *mssp);
+ __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
+ __u16 *mss);
+-#else
+-static inline __u32 cookie_v6_init_sequence(struct sock *sk,
+- struct sk_buff *skb,
+- __u16 *mss)
+-{
+- return 0;
+-}
+ #endif
+ /* tcp_output.c */
+
+@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
+ void tcp_send_loss_probe(struct sock *sk);
+ bool tcp_schedule_loss_probe(struct sock *sk);
+
++u16 tcp_select_window(struct sock *sk);
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++
+ /* tcp_input.c */
+ void tcp_resume_early_retransmit(struct sock *sk);
+ void tcp_rearm_rto(struct sock *sk);
+ void tcp_reset(struct sock *sk);
++void tcp_set_rto(struct sock *sk);
++bool tcp_should_expand_sndbuf(const struct sock *sk);
++bool tcp_prune_ofo_queue(struct sock *sk);
+
+ /* tcp_timer.c */
+ void tcp_init_xmit_timers(struct sock *);
+@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
+ */
+ struct tcp_skb_cb {
+ union {
+- struct inet_skb_parm h4;
++ union {
++ struct inet_skb_parm h4;
+ #if IS_ENABLED(CONFIG_IPV6)
+- struct inet6_skb_parm h6;
++ struct inet6_skb_parm h6;
+ #endif
+- } header; /* For incoming frames */
++ } header; /* For incoming frames */
++#ifdef CONFIG_MPTCP
++ union { /* For MPTCP outgoing frames */
++ __u32 path_mask; /* paths that tried to send this skb */
++ __u32 dss[6]; /* DSS options */
++ };
++#endif
++ };
+ __u32 seq; /* Starting sequence number */
+ __u32 end_seq; /* SEQ + FIN + SYN + datalen */
+ __u32 when; /* used to compute rtt's */
++#ifdef CONFIG_MPTCP
++ __u8 mptcp_flags; /* flags for the MPTCP layer */
++ __u8 dss_off; /* Number of 4-byte words until
++ * seq-number */
++#endif
+ __u8 tcp_flags; /* TCP header flags. (tcp[13]) */
+
+ __u8 sacked; /* State flags for SACK/FACK. */
+@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
+ /* Determine a window scaling and initial window to offer. */
+ void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
+ __u32 *window_clamp, int wscale_ok,
+- __u8 *rcv_wscale, __u32 init_rcv_wnd);
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
+
+ static inline int tcp_win_from_space(int space)
+ {
+@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
+ space - (space>>sysctl_tcp_adv_win_scale);
+ }
+
++#ifdef CONFIG_MPTCP
++extern struct static_key mptcp_static_key;
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++ return static_key_false(&mptcp_static_key) && tp->mpc;
++}
++#else
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++ return 0;
++}
++#endif
++
+ /* Note: caller must be prepared to deal with negative returns */
+ static inline int tcp_space(const struct sock *sk)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = tcp_sk(sk)->meta_sk;
++
+ return tcp_win_from_space(sk->sk_rcvbuf -
+ atomic_read(&sk->sk_rmem_alloc));
+ }
+
+ static inline int tcp_full_space(const struct sock *sk)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = tcp_sk(sk)->meta_sk;
++
+ return tcp_win_from_space(sk->sk_rcvbuf);
+ }
+
+@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
+ ireq->wscale_ok = rx_opt->wscale_ok;
+ ireq->acked = 0;
+ ireq->ecn_ok = 0;
++ ireq->mptcp_rqsk = 0;
++ ireq->saw_mpc = 0;
+ ireq->ir_rmt_port = tcp_hdr(skb)->source;
+ ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
+ }
+@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
+ void tcp4_proc_exit(void);
+ #endif
+
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++ const struct tcp_request_sock_ops *af_ops,
++ struct sock *sk, struct sk_buff *skb);
++
+ /* TCP af-specific functions */
+ struct tcp_sock_af_ops {
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
+ #endif
+ };
+
++/* TCP/MPTCP-specific functions */
++struct tcp_sock_ops {
++ u32 (*__select_window)(struct sock *sk);
++ u16 (*select_window)(struct sock *sk);
++ void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
++ void (*init_buffer_space)(struct sock *sk);
++ void (*set_rto)(struct sock *sk);
++ bool (*should_expand_sndbuf)(const struct sock *sk);
++ void (*send_fin)(struct sock *sk);
++ bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++ void (*send_active_reset)(struct sock *sk, gfp_t priority);
++ int (*write_wakeup)(struct sock *sk);
++ bool (*prune_ofo_queue)(struct sock *sk);
++ void (*retransmit_timer)(struct sock *sk);
++ void (*time_wait)(struct sock *sk, int state, int timeo);
++ void (*cleanup_rbuf)(struct sock *sk, int copied);
++ void (*init_congestion_control)(struct sock *sk);
++};
++extern const struct tcp_sock_ops tcp_specific;
++
+ struct tcp_request_sock_ops {
++ u16 mss_clamp;
+ #ifdef CONFIG_TCP_MD5SIG
+ struct tcp_md5sig_key *(*md5_lookup) (struct sock *sk,
+ struct request_sock *req);
+@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
+ const struct request_sock *req,
+ const struct sk_buff *skb);
+ #endif
++ int (*init_req)(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb);
++#ifdef CONFIG_SYN_COOKIES
++ __u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mss);
++#endif
++ struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict);
++ __u32 (*init_seq)(const struct sk_buff *skb);
++ int (*send_synack)(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl, struct request_sock *req,
++ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++ void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
++ const unsigned long timeout);
+ };
+
++#ifdef CONFIG_SYN_COOKIES
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++ struct sock *sk, struct sk_buff *skb,
++ __u16 *mss)
++{
++ return ops->cookie_init_seq(sk, skb, mss);
++}
++#else
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++ struct sock *sk, struct sk_buff *skb,
++ __u16 *mss)
++{
++ return 0;
++}
++#endif
++
+ int tcpv4_offload_init(void);
+
+ void tcp_v4_init(void);
+diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
+index 9cf2394f0bcf..c2634b6ed854 100644
+--- a/include/uapi/linux/if.h
++++ b/include/uapi/linux/if.h
+@@ -109,6 +109,9 @@ enum net_device_flags {
+ #define IFF_DORMANT IFF_DORMANT
+ #define IFF_ECHO IFF_ECHO
+
++#define IFF_NOMULTIPATH 0x80000 /* Disable for MPTCP */
++#define IFF_MPBACKUP 0x100000 /* Use as backup path for MPTCP */
++
+ #define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
+ IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
+
+diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
+index 3b9718328d8b..487475681d84 100644
+--- a/include/uapi/linux/tcp.h
++++ b/include/uapi/linux/tcp.h
+@@ -112,6 +112,7 @@ enum {
+ #define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */
+ #define TCP_TIMESTAMP 24
+ #define TCP_NOTSENT_LOWAT 25 /* limit number of unsent bytes in write queue */
++#define MPTCP_ENABLED 26
+
+ struct tcp_repair_opt {
+ __u32 opt_code;
+diff --git a/net/Kconfig b/net/Kconfig
+index d92afe4204d9..96b58593ad5e 100644
+--- a/net/Kconfig
++++ b/net/Kconfig
+@@ -79,6 +79,7 @@ if INET
+ source "net/ipv4/Kconfig"
+ source "net/ipv6/Kconfig"
+ source "net/netlabel/Kconfig"
++source "net/mptcp/Kconfig"
+
+ endif # if INET
+
+diff --git a/net/Makefile b/net/Makefile
+index cbbbe6d657ca..244bac1435b1 100644
+--- a/net/Makefile
++++ b/net/Makefile
+@@ -20,6 +20,7 @@ obj-$(CONFIG_INET) += ipv4/
+ obj-$(CONFIG_XFRM) += xfrm/
+ obj-$(CONFIG_UNIX) += unix/
+ obj-$(CONFIG_NET) += ipv6/
++obj-$(CONFIG_MPTCP) += mptcp/
+ obj-$(CONFIG_PACKET) += packet/
+ obj-$(CONFIG_NET_KEY) += key/
+ obj-$(CONFIG_BRIDGE) += bridge/
+diff --git a/net/core/dev.c b/net/core/dev.c
+index 367a586d0c8a..215d2757fbf6 100644
+--- a/net/core/dev.c
++++ b/net/core/dev.c
+@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
+
+ dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
+ IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
+- IFF_AUTOMEDIA)) |
++ IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
+ (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
+ IFF_ALLMULTI));
+
+diff --git a/net/core/request_sock.c b/net/core/request_sock.c
+index 467f326126e0..909dfa13f499 100644
+--- a/net/core/request_sock.c
++++ b/net/core/request_sock.c
+@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
+ EXPORT_SYMBOL(sysctl_max_syn_backlog);
+
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+- unsigned int nr_table_entries)
++ unsigned int nr_table_entries,
++ gfp_t flags)
+ {
+ size_t lopt_size = sizeof(struct listen_sock);
+ struct listen_sock *lopt;
+@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
+ nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
+ lopt_size += nr_table_entries * sizeof(struct request_sock *);
+ if (lopt_size > PAGE_SIZE)
+- lopt = vzalloc(lopt_size);
++ lopt = __vmalloc(lopt_size,
++ flags | __GFP_HIGHMEM | __GFP_ZERO,
++ PAGE_KERNEL);
+ else
+- lopt = kzalloc(lopt_size, GFP_KERNEL);
++ lopt = kzalloc(lopt_size, flags);
+ if (lopt == NULL)
+ return -ENOMEM;
+
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..8abc5d60fbe3 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
+ skb_drop_list(&skb_shinfo(skb)->frag_list);
+ }
+
+-static void skb_clone_fraglist(struct sk_buff *skb)
++void skb_clone_fraglist(struct sk_buff *skb)
+ {
+ struct sk_buff *list;
+
+@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
+ skb->inner_mac_header += off;
+ }
+
+-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+ {
+ __copy_skb_header(new, old);
+
+diff --git a/net/core/sock.c b/net/core/sock.c
+index 026e01f70274..359295523177 100644
+--- a/net/core/sock.c
++++ b/net/core/sock.c
+@@ -136,6 +136,11 @@
+
+ #include <trace/events/sock.h>
+
++#ifdef CONFIG_MPTCP
++#include <net/mptcp.h>
++#include <net/inet_common.h>
++#endif
++
+ #ifdef CONFIG_INET
+ #include <net/tcp.h>
+ #endif
+@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
+ "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG" ,
+ "slock-AF_NFC" , "slock-AF_VSOCK" ,"slock-AF_MAX"
+ };
+-static const char *const af_family_clock_key_strings[AF_MAX+1] = {
++char *const af_family_clock_key_strings[AF_MAX+1] = {
+ "clock-AF_UNSPEC", "clock-AF_UNIX" , "clock-AF_INET" ,
+ "clock-AF_AX25" , "clock-AF_IPX" , "clock-AF_APPLETALK",
+ "clock-AF_NETROM", "clock-AF_BRIDGE" , "clock-AF_ATMPVC" ,
+@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
+ * sk_callback_lock locking rules are per-address-family,
+ * so split the lock classes by using a per-AF key:
+ */
+-static struct lock_class_key af_callback_keys[AF_MAX];
++struct lock_class_key af_callback_keys[AF_MAX];
+
+ /* Take into consideration the size of the struct sk_buff overhead in the
+ * determination of these values, since that is non-constant across
+@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
+ }
+ }
+
+-#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
+-
+ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
+ {
+ if (sk->sk_flags & flags) {
+@@ -1253,8 +1256,25 @@ lenout:
+ *
+ * (We also register the sk_lock with the lock validator.)
+ */
+-static inline void sock_lock_init(struct sock *sk)
+-{
++void sock_lock_init(struct sock *sk)
++{
++#ifdef CONFIG_MPTCP
++ /* Reclassify the lock-class for subflows */
++ if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
++ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
++ sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
++ &meta_slock_key,
++ "sk_lock-AF_INET-MPTCP",
++ &meta_key);
++
++ /* We don't yet have the mptcp-point.
++ * Thus we still need inet_sock_destruct
++ */
++ sk->sk_destruct = inet_sock_destruct;
++ return;
++ }
++#endif
++
+ sock_lock_init_class_and_name(sk,
+ af_family_slock_key_strings[sk->sk_family],
+ af_family_slock_keys + sk->sk_family,
+@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
+ }
+ EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
+
+-static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
+ int family)
+ {
+ struct sock *sk;
+diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
+index 4db3c2a1679c..04cb17d4b0ce 100644
+--- a/net/dccp/ipv6.c
++++ b/net/dccp/ipv6.c
+@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
+ goto drop;
+
+- req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
++ req = inet_reqsk_alloc(&dccp6_request_sock_ops);
+ if (req == NULL)
+ goto drop;
+
+diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
+index 05c57f0fcabe..630434db0085 100644
+--- a/net/ipv4/Kconfig
++++ b/net/ipv4/Kconfig
+@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
+ For further details see:
+ http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
+
++config TCP_CONG_COUPLED
++ tristate "MPTCP COUPLED CONGESTION CONTROL"
++ depends on MPTCP
++ default n
++ ---help---
++ MultiPath TCP Coupled Congestion Control
++ To enable it, just put 'coupled' in tcp_congestion_control
++
++config TCP_CONG_OLIA
++ tristate "MPTCP Opportunistic Linked Increase"
++ depends on MPTCP
++ default n
++ ---help---
++ MultiPath TCP Opportunistic Linked Increase Congestion Control
++ To enable it, just put 'olia' in tcp_congestion_control
++
++config TCP_CONG_WVEGAS
++ tristate "MPTCP WVEGAS CONGESTION CONTROL"
++ depends on MPTCP
++ default n
++ ---help---
++ wVegas congestion control for MPTCP
++ To enable it, just put 'wvegas' in tcp_congestion_control
++
+ choice
+ prompt "Default TCP congestion control"
+ default DEFAULT_CUBIC
+@@ -584,6 +608,15 @@ choice
+ config DEFAULT_WESTWOOD
+ bool "Westwood" if TCP_CONG_WESTWOOD=y
+
++ config DEFAULT_COUPLED
++ bool "Coupled" if TCP_CONG_COUPLED=y
++
++ config DEFAULT_OLIA
++ bool "Olia" if TCP_CONG_OLIA=y
++
++ config DEFAULT_WVEGAS
++ bool "Wvegas" if TCP_CONG_WVEGAS=y
++
+ config DEFAULT_RENO
+ bool "Reno"
+
+@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
+ default "vegas" if DEFAULT_VEGAS
+ default "westwood" if DEFAULT_WESTWOOD
+ default "veno" if DEFAULT_VENO
++ default "coupled" if DEFAULT_COUPLED
++ default "wvegas" if DEFAULT_WVEGAS
+ default "reno" if DEFAULT_RENO
+ default "cubic"
+
+diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
+index d156b3c5f363..4afd6d8d9028 100644
+--- a/net/ipv4/af_inet.c
++++ b/net/ipv4/af_inet.c
+@@ -104,6 +104,7 @@
+ #include <net/ip_fib.h>
+ #include <net/inet_connection_sock.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/ping.h>
+@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
+ * Create an inet socket.
+ */
+
+-static int inet_create(struct net *net, struct socket *sock, int protocol,
+- int kern)
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ struct sock *sk;
+ struct inet_protosw *answer;
+@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
+ lock_sock(sk2);
+
+ sock_rps_record_flow(sk2);
++
++ if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
++ struct sock *sk_it = sk2;
++
++ mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++
++ if (tcp_sk(sk2)->mpcb->master_sk) {
++ sk_it = tcp_sk(sk2)->mpcb->master_sk;
++
++ write_lock_bh(&sk_it->sk_callback_lock);
++ sk_it->sk_wq = newsock->wq;
++ sk_it->sk_socket = newsock;
++ write_unlock_bh(&sk_it->sk_callback_lock);
++ }
++ }
++
+ WARN_ON(!((1 << sk2->sk_state) &
+ (TCPF_ESTABLISHED | TCPF_SYN_RECV |
+ TCPF_CLOSE_WAIT | TCPF_CLOSE)));
+@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
+
+ ip_init();
+
++ /* We must initialize MPTCP before TCP. */
++ mptcp_init();
++
+ tcp_v4_init();
+
+ /* Setup TCP slab cache for open requests. */
+diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
+index 14d02ea905b6..7d734d8af19b 100644
+--- a/net/ipv4/inet_connection_sock.c
++++ b/net/ipv4/inet_connection_sock.c
+@@ -23,6 +23,7 @@
+ #include <net/route.h>
+ #include <net/tcp_states.h>
+ #include <net/xfrm.h>
++#include <net/mptcp.h>
+
+ #ifdef INET_CSK_DEBUG
+ const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
+@@ -465,8 +466,8 @@ no_route:
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
+
+-static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
+- const u32 rnd, const u32 synq_hsize)
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++ const u32 synq_hsize)
+ {
+ return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
+ }
+@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
+
+ lopt->clock_hand = i;
+
+- if (lopt->qlen)
++ if (lopt->qlen && !is_meta_sk(parent))
+ inet_csk_reset_keepalive_timer(parent, interval);
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
+@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
+ const struct request_sock *req,
+ const gfp_t priority)
+ {
+- struct sock *newsk = sk_clone_lock(sk, priority);
++ struct sock *newsk;
++
++ newsk = sk_clone_lock(sk, priority);
+
+ if (newsk != NULL) {
+ struct inet_connection_sock *newicsk = inet_csk(newsk);
+@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
+ {
+ struct inet_sock *inet = inet_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+- int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
++ int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
++ GFP_KERNEL);
+
+ if (rc != 0)
+ return rc;
+@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
+
+ while ((req = acc_req) != NULL) {
+ struct sock *child = req->sk;
++ bool mutex_taken = false;
+
+ acc_req = req->dl_next;
+
++ if (is_meta_sk(child)) {
++ mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
++ mutex_taken = true;
++ }
+ local_bh_disable();
+ bh_lock_sock(child);
+ WARN_ON(sock_owned_by_user(child));
+@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
+
+ bh_unlock_sock(child);
+ local_bh_enable();
++ if (mutex_taken)
++ mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
+ sock_put(child);
+
+ sk_acceptq_removed(sk);
+diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
+index c86624b36a62..0ff3fe004d62 100644
+--- a/net/ipv4/syncookies.c
++++ b/net/ipv4/syncookies.c
+@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ }
+ EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
+
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mssp)
+ {
+ const struct iphdr *iph = ip_hdr(skb);
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+
+ /* check for timestamp cookie support */
+ memset(&tcp_opt, 0, sizeof(tcp_opt));
+- tcp_parse_options(skb, &tcp_opt, 0, NULL);
++ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+
+ if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ goto out;
+@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ /* Try to redo what tcp_v4_send_synack did. */
+ req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
+
+- tcp_select_initial_window(tcp_full_space(sk), req->mss,
+- &req->rcv_wnd, &req->window_clamp,
+- ireq->wscale_ok, &rcv_wscale,
+- dst_metric(&rt->dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++ &req->rcv_wnd, &req->window_clamp,
++ ireq->wscale_ok, &rcv_wscale,
++ dst_metric(&rt->dst, RTAX_INITRWND), sk);
+
+ ireq->rcv_wscale = rcv_wscale;
+
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 9d2118e5fbc7..2cb89f886d45 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -271,6 +271,7 @@
+
+ #include <net/icmp.h>
+ #include <net/inet_common.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/xfrm.h>
+ #include <net/ip.h>
+@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+ return period;
+ }
+
++const struct tcp_sock_ops tcp_specific = {
++ .__select_window = __tcp_select_window,
++ .select_window = tcp_select_window,
++ .select_initial_window = tcp_select_initial_window,
++ .init_buffer_space = tcp_init_buffer_space,
++ .set_rto = tcp_set_rto,
++ .should_expand_sndbuf = tcp_should_expand_sndbuf,
++ .init_congestion_control = tcp_init_congestion_control,
++ .send_fin = tcp_send_fin,
++ .write_xmit = tcp_write_xmit,
++ .send_active_reset = tcp_send_active_reset,
++ .write_wakeup = tcp_write_wakeup,
++ .prune_ofo_queue = tcp_prune_ofo_queue,
++ .retransmit_timer = tcp_retransmit_timer,
++ .time_wait = tcp_time_wait,
++ .cleanup_rbuf = tcp_cleanup_rbuf,
++};
++
+ /* Address-family independent initialization for a tcp_sock.
+ *
+ * NOTE: A lot of things set to zero explicitly by call to
+@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
+ sk->sk_sndbuf = sysctl_tcp_wmem[1];
+ sk->sk_rcvbuf = sysctl_tcp_rmem[1];
+
++ tp->ops = &tcp_specific;
++
+ local_bh_disable();
+ sock_update_memcg(sk);
+ sk_sockets_allocated_inc(sk);
+@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+ int ret;
+
+ sock_rps_record_flow(sk);
++
++#ifdef CONFIG_MPTCP
++ if (mptcp(tcp_sk(sk))) {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++#endif
+ /*
+ * We can't seek on a socket input
+ */
+@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
+ return NULL;
+ }
+
+-static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
+- int large_allowed)
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 xmit_size_goal, old_size_goal;
+@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+ {
+ int mss_now;
+
+- mss_now = tcp_current_mss(sk);
+- *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ if (mptcp(tcp_sk(sk))) {
++ mss_now = mptcp_current_mss(sk);
++ *size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ } else {
++ mss_now = tcp_current_mss(sk);
++ *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ }
+
+ return mss_now;
+ }
+@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+ * is fully established.
+ */
+ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+- !tcp_passive_fastopen(sk)) {
++ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++ tp->mpcb->master_sk : sk)) {
+ if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ goto out_err;
+ }
+
++ if (mptcp(tp)) {
++ struct sock *sk_it = sk;
++
++ /* We must check this with socket-lock hold because we iterate
++ * over the subflows.
++ */
++ if (!mptcp_can_sendpage(sk)) {
++ ssize_t ret;
++
++ release_sock(sk);
++ ret = sock_no_sendpage(sk->sk_socket, page, offset,
++ size, flags);
++ lock_sock(sk);
++ return ret;
++ }
++
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++
+ clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+
+ mss_now = tcp_send_mss(sk, &size_goal, flags);
+@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+ {
+ ssize_t res;
+
+- if (!(sk->sk_route_caps & NETIF_F_SG) ||
+- !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
++ /* If MPTCP is enabled, we check it later after establishment */
++ if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
++ !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
+ return sock_no_sendpage(sk->sk_socket, page, offset, size,
+ flags);
+
+@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
+ const struct tcp_sock *tp = tcp_sk(sk);
+ int tmp = tp->mss_cache;
+
++ if (mptcp(tp))
++ return mptcp_select_size(sk, sg);
++
+ if (sg) {
+ if (sk_can_gso(sk)) {
+ /* Small frames wont use a full page:
+@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ * is fully established.
+ */
+ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+- !tcp_passive_fastopen(sk)) {
++ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++ tp->mpcb->master_sk : sk)) {
+ if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ goto do_error;
+ }
+
++ if (mptcp(tp)) {
++ struct sock *sk_it = sk;
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++
+ if (unlikely(tp->repair)) {
+ if (tp->repair_queue == TCP_RECV_QUEUE) {
+ copied = tcp_send_rcvq(sk, msg, size);
+@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
+ goto out_err;
+
+- sg = !!(sk->sk_route_caps & NETIF_F_SG);
++ if (mptcp(tp))
++ sg = mptcp_can_sg(sk);
++ else
++ sg = !!(sk->sk_route_caps & NETIF_F_SG);
+
+ while (--iovlen >= 0) {
+ size_t seglen = iov->iov_len;
+@@ -1183,8 +1251,15 @@ new_segment:
+
+ /*
+ * Check whether we can use HW checksum.
++ *
++ * If dss-csum is enabled, we do not do hw-csum.
++ * In case of non-mptcp we check the
++ * device-capabilities.
++ * In case of mptcp, hw-csum's will be handled
++ * later in mptcp_write_xmit.
+ */
+- if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
++ if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
++ (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ skb_entail(sk, skb);
+@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
+
+ /* Optimize, __tcp_select_window() is not cheap. */
+ if (2*rcv_window_now <= tp->window_clamp) {
+- __u32 new_window = __tcp_select_window(sk);
++ __u32 new_window = tp->ops->__select_window(sk);
+
+ /* Send ACK now, if this read freed lots of space
+ * in our buffer. Certainly, new_window is new window.
+@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ /* Clean up data we have read: This will do ACK frames. */
+ if (copied > 0) {
+ tcp_recv_skb(sk, seq, &offset);
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+ }
+ return copied;
+ }
+@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+
+ lock_sock(sk);
+
++#ifdef CONFIG_MPTCP
++ if (mptcp(tp)) {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++#endif
++
+ err = -ENOTCONN;
+ if (sk->sk_state == TCP_LISTEN)
+ goto out;
+@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ }
+ }
+
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+
+ if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
+ /* Install new reader */
+@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ if (tp->rcv_wnd == 0 &&
+ !skb_queue_empty(&sk->sk_async_wait_queue)) {
+ tcp_service_net_dma(sk, true);
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+ } else
+ dma_async_issue_pending(tp->ucopy.dma_chan);
+ }
+@@ -1993,7 +2076,7 @@ skip_copy:
+ */
+
+ /* Clean up data we have read: This will do ACK frames. */
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+
+ release_sock(sk);
+ return copied;
+@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
+ /* TCP_CLOSING */ TCP_CLOSING,
+ };
+
+-static int tcp_close_state(struct sock *sk)
++int tcp_close_state(struct sock *sk)
+ {
+ int next = (int)new_state[sk->sk_state];
+ int ns = next & TCP_STATE_MASK;
+@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
+ TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
+ /* Clear out any half completed packets. FIN if needed. */
+ if (tcp_close_state(sk))
+- tcp_send_fin(sk);
++ tcp_sk(sk)->ops->send_fin(sk);
+ }
+ }
+ EXPORT_SYMBOL(tcp_shutdown);
+@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
+ int data_was_unread = 0;
+ int state;
+
++ if (is_meta_sk(sk)) {
++ mptcp_close(sk, timeout);
++ return;
++ }
++
+ lock_sock(sk);
+ sk->sk_shutdown = SHUTDOWN_MASK;
+
+@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
+ /* Unread data was tossed, zap the connection. */
+ NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, sk->sk_allocation);
++ tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
+ } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
+ /* Check zero linger _after_ checking for unread data. */
+ sk->sk_prot->disconnect(sk, 0);
+@@ -2247,7 +2335,7 @@ adjudge_to_death:
+ struct tcp_sock *tp = tcp_sk(sk);
+ if (tp->linger2 < 0) {
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ NET_INC_STATS_BH(sock_net(sk),
+ LINUX_MIB_TCPABORTONLINGER);
+ } else {
+@@ -2257,7 +2345,8 @@ adjudge_to_death:
+ inet_csk_reset_keepalive_timer(sk,
+ tmo - TCP_TIMEWAIT_LEN);
+ } else {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
++ tmo);
+ goto out;
+ }
+ }
+@@ -2266,7 +2355,7 @@ adjudge_to_death:
+ sk_mem_reclaim(sk);
+ if (tcp_check_oom(sk, 0)) {
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
+ NET_INC_STATS_BH(sock_net(sk),
+ LINUX_MIB_TCPABORTONMEMORY);
+ }
+@@ -2291,15 +2380,6 @@ out:
+ }
+ EXPORT_SYMBOL(tcp_close);
+
+-/* These states need RST on ABORT according to RFC793 */
+-
+-static inline bool tcp_need_reset(int state)
+-{
+- return (1 << state) &
+- (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
+- TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
+-}
+-
+ int tcp_disconnect(struct sock *sk, int flags)
+ {
+ struct inet_sock *inet = inet_sk(sk);
+@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
+ /* The last check adjusts for discrepancy of Linux wrt. RFC
+ * states
+ */
+- tcp_send_active_reset(sk, gfp_any());
++ tp->ops->send_active_reset(sk, gfp_any());
+ sk->sk_err = ECONNRESET;
+ } else if (old_state == TCP_SYN_SENT)
+ sk->sk_err = ECONNRESET;
+@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
+ if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+ inet_reset_saddr(sk);
+
++ if (is_meta_sk(sk)) {
++ mptcp_disconnect(sk);
++ } else {
++ if (tp->inside_tk_table)
++ mptcp_hash_remove_bh(tp);
++ }
++
+ sk->sk_shutdown = 0;
+ sock_reset_flag(sk, SOCK_DONE);
+ tp->srtt_us = 0;
+@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ break;
+
+ case TCP_DEFER_ACCEPT:
++ /* An established MPTCP-connection (mptcp(tp) only returns true
++ * if the socket is established) should not use DEFER on new
++ * subflows.
++ */
++ if (mptcp(tp))
++ break;
+ /* Translate value in seconds to number of retransmits */
+ icsk->icsk_accept_queue.rskq_defer_accept =
+ secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
+ inet_csk_ack_scheduled(sk)) {
+ icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
+- tcp_cleanup_rbuf(sk, 1);
++ tp->ops->cleanup_rbuf(sk, 1);
+ if (!(val & 1))
+ icsk->icsk_ack.pingpong = 1;
+ }
+@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ tp->notsent_lowat = val;
+ sk->sk_write_space(sk);
+ break;
++#ifdef CONFIG_MPTCP
++ case MPTCP_ENABLED:
++ if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
++ if (val)
++ tp->mptcp_enabled = 1;
++ else
++ tp->mptcp_enabled = 0;
++ } else {
++ err = -EPERM;
++ }
++ break;
++#endif
+ default:
+ err = -ENOPROTOOPT;
+ break;
+@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
+ case TCP_NOTSENT_LOWAT:
+ val = tp->notsent_lowat;
+ break;
++#ifdef CONFIG_MPTCP
++ case MPTCP_ENABLED:
++ val = tp->mptcp_enabled;
++ break;
++#endif
+ default:
+ return -ENOPROTOOPT;
+ }
+@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
+ if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
+
++ WARN_ON(sk->sk_state == TCP_CLOSE);
+ tcp_set_state(sk, TCP_CLOSE);
++
+ tcp_clear_xmit_timers(sk);
++
+ if (req != NULL)
+ reqsk_fastopen_remove(sk, req, false);
+
+diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
+index 9771563ab564..5c230d96c4c1 100644
+--- a/net/ipv4/tcp_fastopen.c
++++ b/net/ipv4/tcp_fastopen.c
+@@ -7,6 +7,7 @@
+ #include <linux/rculist.h>
+ #include <net/inetpeer.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+
+ int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
+
+@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ {
+ struct tcp_sock *tp;
+ struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
+- struct sock *child;
++ struct sock *child, *meta_sk;
+
+ req->num_retrans = 0;
+ req->num_timeout = 0;
+@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ /* Add the child socket directly into the accept queue */
+ inet_csk_reqsk_queue_add(sk, req, child);
+
+- /* Now finish processing the fastopen child socket. */
+- inet_csk(child)->icsk_af_ops->rebuild_header(child);
+- tcp_init_congestion_control(child);
+- tcp_mtup_init(child);
+- tcp_init_metrics(child);
+- tcp_init_buffer_space(child);
+-
+ /* Queue the data carried in the SYN packet. We need to first
+ * bump skb's refcnt because the caller will attempt to free it.
+ *
+@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ tp->syn_data_acked = 1;
+ }
+ tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++
++ meta_sk = child;
++ if (!mptcp_check_req_fastopen(meta_sk, req)) {
++ child = tcp_sk(meta_sk)->mpcb->master_sk;
++ tp = tcp_sk(child);
++ }
++
++ /* Now finish processing the fastopen child socket. */
++ inet_csk(child)->icsk_af_ops->rebuild_header(child);
++ tp->ops->init_congestion_control(child);
++ tcp_mtup_init(child);
++ tcp_init_metrics(child);
++ tp->ops->init_buffer_space(child);
++
+ sk->sk_data_ready(sk);
+- bh_unlock_sock(child);
++ if (mptcp(tcp_sk(child)))
++ bh_unlock_sock(child);
++ bh_unlock_sock(meta_sk);
+ sock_put(child);
+ WARN_ON(req->sk == NULL);
+ return true;
+diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
+index 40639c288dc2..3273bb69f387 100644
+--- a/net/ipv4/tcp_input.c
++++ b/net/ipv4/tcp_input.c
+@@ -74,6 +74,9 @@
+ #include <linux/ipsec.h>
+ #include <asm/unaligned.h>
+ #include <net/netdma.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
+
+ int sysctl_tcp_timestamps __read_mostly = 1;
+ int sysctl_tcp_window_scaling __read_mostly = 1;
+@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
+ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
+ int sysctl_tcp_early_retrans __read_mostly = 3;
+
+-#define FLAG_DATA 0x01 /* Incoming frame contained data. */
+-#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
+-#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
+-#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
+-#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
+-#define FLAG_DATA_SACKED 0x20 /* New SACK. */
+-#define FLAG_ECE 0x40 /* ECE in this ACK */
+-#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
+-#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
+-#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
+-#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
+-#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
+-#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
+-
+-#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
+-#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
+-#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
+-#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
+-
+ #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
+ #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
+
+@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
+ icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
+ }
+
+-static void tcp_enter_quickack_mode(struct sock *sk)
++void tcp_enter_quickack_mode(struct sock *sk)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ tcp_incr_quickack(sk);
+@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ per_mss = roundup_pow_of_two(per_mss) +
+ SKB_DATA_ALIGN(sizeof(struct sk_buff));
+
+- nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
+- nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++ if (mptcp(tp)) {
++ nr_segs = mptcp_check_snd_buf(tp);
++ } else {
++ nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
++ nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++ }
+
+ /* Fast Recovery (RFC 5681 3.2) :
+ * Cubic needs 1.7 factor, rounded to 2 to include
+@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ */
+ sndmem = 2 * nr_segs * per_mss;
+
+- if (sk->sk_sndbuf < sndmem)
++ /* MPTCP: after this sndmem is the new contribution of the
++ * current subflow to the aggregated sndbuf */
++ if (sk->sk_sndbuf < sndmem) {
++ int old_sndbuf = sk->sk_sndbuf;
+ sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
++ /* MPTCP: ok, the subflow sndbuf has grown, reflect
++ * this in the aggregate buffer.*/
++ if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
++ mptcp_update_sndbuf(tp);
++ }
+ }
+
+ /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
+@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
+ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
+
+ /* Check #1 */
+- if (tp->rcv_ssthresh < tp->window_clamp &&
+- (int)tp->rcv_ssthresh < tcp_space(sk) &&
++ if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
++ (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
+ !sk_under_memory_pressure(sk)) {
+ int incr;
+
+@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ * will fit to rcvbuf in future.
+ */
+ if (tcp_win_from_space(skb->truesize) <= skb->len)
+- incr = 2 * tp->advmss;
++ incr = 2 * meta_tp->advmss;
+ else
+- incr = __tcp_grow_window(sk, skb);
++ incr = __tcp_grow_window(meta_sk, skb);
+
+ if (incr) {
+ incr = max_t(int, incr, 2 * skb->len);
+- tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
+- tp->window_clamp);
++ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
++ meta_tp->window_clamp);
+ inet_csk(sk)->icsk_ack.quick |= 1;
+ }
+ }
+@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
+ int copied;
+
+ time = tcp_time_stamp - tp->rcvq_space.time;
+- if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
++ if (mptcp(tp)) {
++ if (mptcp_check_rtt(tp, time))
++ return;
++ } else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+ return;
+
+ /* Number of bytes copied to user in last RTT */
+@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
+ /* Calculate rto without backoff. This is the second half of Van Jacobson's
+ * routine referred to above.
+ */
+-static void tcp_set_rto(struct sock *sk)
++void tcp_set_rto(struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ /* Old crap is replaced with new one. 8)
+@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
+ int len;
+ int in_sack;
+
+- if (!sk_can_gso(sk))
++ /* For MPTCP we cannot shift skb-data and remove one skb from the
++ * send-queue, because this will make us loose the DSS-option (which
++ * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
++ */
++ if (!sk_can_gso(sk) || mptcp(tp))
+ goto fallback;
+
+ /* Normally R but no L won't result in plain S */
+@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
+ return false;
+
+ tcp_rtt_estimator(sk, seq_rtt_us);
+- tcp_set_rto(sk);
++ tp->ops->set_rto(sk);
+
+ /* RFC6298: only reset backoff on valid RTT measurement. */
+ inet_csk(sk)->icsk_backoff = 0;
+@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
+ }
+
+ /* If we get here, the whole TSO packet has not been acked. */
+-static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 packets_acked;
+@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ */
+ if (!(scb->tcp_flags & TCPHDR_SYN)) {
+ flag |= FLAG_DATA_ACKED;
++ if (mptcp(tp) && mptcp_is_data_seq(skb))
++ flag |= MPTCP_FLAG_DATA_ACKED;
+ } else {
+ flag |= FLAG_SYN_ACKED;
+ tp->retrans_stamp = 0;
+@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ return flag;
+ }
+
+-static void tcp_ack_probe(struct sock *sk)
++void tcp_ack_probe(struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
+ /* Check that window update is acceptable.
+ * The function assumes that snd_una<=ack<=snd_next.
+ */
+-static inline bool tcp_may_update_window(const struct tcp_sock *tp,
+- const u32 ack, const u32 ack_seq,
+- const u32 nwin)
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++ const u32 ack_seq, const u32 nwin)
+ {
+ return after(ack, tp->snd_una) ||
+ after(ack_seq, tp->snd_wl1) ||
+@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
+ }
+
+ /* This routine deals with incoming acks, but not outgoing ones. */
+-static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
++static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+ sack_rtt_us);
+ acked -= tp->packets_out;
+
++ if (mptcp(tp)) {
++ if (mptcp_fallback_infinite(sk, flag)) {
++ pr_err("%s resetting flow\n", __func__);
++ mptcp_send_reset(sk);
++ goto invalid_ack;
++ }
++
++ mptcp_clean_rtx_infinite(skb, sk);
++ }
++
+ /* Advance cwnd if state allows */
+ if (tcp_may_raise_cwnd(sk, flag))
+ tcp_cong_avoid(sk, ack, acked);
+@@ -3512,8 +3528,9 @@ old_ack:
+ * the fast version below fails.
+ */
+ void tcp_parse_options(const struct sk_buff *skb,
+- struct tcp_options_received *opt_rx, int estab,
+- struct tcp_fastopen_cookie *foc)
++ struct tcp_options_received *opt_rx,
++ struct mptcp_options_received *mopt,
++ int estab, struct tcp_fastopen_cookie *foc)
+ {
+ const unsigned char *ptr;
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
+ */
+ break;
+ #endif
++ case TCPOPT_MPTCP:
++ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++ break;
+ case TCPOPT_EXP:
+ /* Fast Open option shares code 254 using a
+ * 16 bits magic number. It's valid only in
+@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
+ if (tcp_parse_aligned_timestamp(tp, th))
+ return true;
+ }
+-
+- tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
++ tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
++ 1, NULL);
+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+
+@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
+ dst = __sk_dst_get(sk);
+ if (!dst || !dst_metric(dst, RTAX_QUICKACK))
+ inet_csk(sk)->icsk_ack.pingpong = 1;
++ if (mptcp(tp))
++ mptcp_sub_close_passive(sk);
+ break;
+
+ case TCP_CLOSE_WAIT:
+@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
+ tcp_set_state(sk, TCP_CLOSING);
+ break;
+ case TCP_FIN_WAIT2:
++ if (mptcp(tp)) {
++ /* The socket will get closed by mptcp_data_ready.
++ * We first have to process all data-sequences.
++ */
++ tp->close_it = 1;
++ break;
++ }
+ /* Received a FIN -- send ACK and enter TIME_WAIT. */
+ tcp_send_ack(sk);
+- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ break;
+ default:
+ /* Only TCP_LISTEN and TCP_CLOSE are left, in these
+@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
+ if (!sock_flag(sk, SOCK_DEAD)) {
+ sk->sk_state_change(sk);
+
++ /* Don't wake up MPTCP-subflows */
++ if (mptcp(tp))
++ return;
++
+ /* Do not send POLL_HUP for half duplex close. */
+ if (sk->sk_shutdown == SHUTDOWN_MASK ||
+ sk->sk_state == TCP_CLOSE)
+@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
+ tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
+ }
+
+- if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
++ /* In case of MPTCP, the segment may be empty if it's a
++ * non-data DATA_FIN. (see beginning of tcp_data_queue)
++ */
++ if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
++ !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
+ SOCK_DEBUG(sk, "ofo packet was already received\n");
+ __skb_unlink(skb, &tp->out_of_order_queue);
+ __kfree_skb(skb);
+@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
+ }
+ }
+
+-static bool tcp_prune_ofo_queue(struct sock *sk);
+ static int tcp_prune_queue(struct sock *sk);
+
+ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ unsigned int size)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = mptcp_meta_sk(sk);
++
+ if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+ !sk_rmem_schedule(sk, skb, size)) {
+
+@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ return -1;
+
+ if (!sk_rmem_schedule(sk, skb, size)) {
+- if (!tcp_prune_ofo_queue(sk))
++ if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
+ return -1;
+
+ if (!sk_rmem_schedule(sk, skb, size))
+@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ * Better try to coalesce them right now to avoid future collapses.
+ * Returns true if caller should free @from instead of queueing it
+ */
+-static bool tcp_try_coalesce(struct sock *sk,
+- struct sk_buff *to,
+- struct sk_buff *from,
+- bool *fragstolen)
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
++ bool *fragstolen)
+ {
+ int delta;
+
+ *fragstolen = false;
+
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++ return false;
++
+ if (tcp_hdr(from)->fin)
+ return false;
+
+@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+
+ /* Do skb overlap to previous one? */
+ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+- if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* MPTCP allows non-data data-fin to be in the ofo-queue */
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
++ !(mptcp(tp) && end_seq == seq)) {
+ /* All the bits are present. Drop. */
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
+ __kfree_skb(skb);
+@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ end_seq);
+ break;
+ }
++ /* MPTCP allows non-data data-fin to be in the ofo-queue */
++ if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
++ continue;
+ __skb_unlink(skb1, &tp->out_of_order_queue);
+ tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
+ TCP_SKB_CB(skb1)->end_seq);
+@@ -4280,8 +4325,8 @@ end:
+ }
+ }
+
+-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
+- bool *fragstolen)
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++ bool *fragstolen)
+ {
+ int eaten;
+ struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
+ int eaten = -1;
+ bool fragstolen = false;
+
+- if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
++ /* If no data is present, but a data_fin is in the options, we still
++ * have to call mptcp_queue_skb later on. */
++ if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
++ !(mptcp(tp) && mptcp_is_data_fin(skb)))
+ goto drop;
+
+ skb_dst_drop(skb);
+@@ -4389,7 +4437,7 @@ queue_and_out:
+ eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
+ }
+ tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+- if (skb->len)
++ if (skb->len || mptcp_is_data_fin(skb))
+ tcp_event_data_recv(sk, skb);
+ if (th->fin)
+ tcp_fin(sk);
+@@ -4411,7 +4459,11 @@ queue_and_out:
+
+ if (eaten > 0)
+ kfree_skb_partial(skb, fragstolen);
+- if (!sock_flag(sk, SOCK_DEAD))
++ if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
++ /* MPTCP: we always have to call data_ready, because
++ * we may be about to receive a data-fin, which still
++ * must get queued.
++ */
+ sk->sk_data_ready(sk);
+ return;
+ }
+@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
+ next = skb_queue_next(list, skb);
+
+ __skb_unlink(skb, list);
++ if (mptcp(tcp_sk(sk)))
++ mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
+ __kfree_skb(skb);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
+
+@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
+ * Purge the out-of-order queue.
+ * Return true if queue was pruned.
+ */
+-static bool tcp_prune_ofo_queue(struct sock *sk)
++bool tcp_prune_ofo_queue(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ bool res = false;
+@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
+ /* Collapsing did not help, destructive actions follow.
+ * This must not ever occur. */
+
+- tcp_prune_ofo_queue(sk);
++ tp->ops->prune_ofo_queue(sk);
+
+ if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+ return 0;
+@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
+ return -1;
+ }
+
+-static bool tcp_should_expand_sndbuf(const struct sock *sk)
++/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
++ * As additional protections, we do not touch cwnd in retransmission phases,
++ * and if application hit its sndbuf limit recently.
++ */
++void tcp_cwnd_application_limited(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
++ sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
++ /* Limited by application or receiver window. */
++ u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
++ u32 win_used = max(tp->snd_cwnd_used, init_win);
++ if (win_used < tp->snd_cwnd) {
++ tp->snd_ssthresh = tcp_current_ssthresh(sk);
++ tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
++ }
++ tp->snd_cwnd_used = 0;
++ }
++ tp->snd_cwnd_stamp = tcp_time_stamp;
++}
++
++bool tcp_should_expand_sndbuf(const struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+- if (tcp_should_expand_sndbuf(sk)) {
++ if (tp->ops->should_expand_sndbuf(sk)) {
+ tcp_sndbuf_expand(sk);
+ tp->snd_cwnd_stamp = tcp_time_stamp;
+ }
+@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
+ {
+ if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
+ sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
+- if (sk->sk_socket &&
+- test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
++ if (mptcp(tcp_sk(sk)) ||
++ (sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
+ tcp_new_space(sk);
+ }
+ }
+@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+ /* ... and right edge of window advances far enough.
+ * (tcp_recvmsg() will send ACK otherwise). Or...
+ */
+- __tcp_select_window(sk) >= tp->rcv_wnd) ||
++ tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
+ /* We ACK each frame or... */
+ tcp_in_quickack_mode(sk) ||
+ /* We have out of order data. */
+@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
++ /* MPTCP urgent data is not yet supported */
++ if (mptcp(tp))
++ return;
++
+ /* Check if we get a new urgent pointer - normally not. */
+ if (th->urg)
+ tcp_check_urg(sk, th);
+@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
+ }
+
+ #ifdef CONFIG_NET_DMA
+-static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
+- int hlen)
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ int chunk = skb->len - hlen;
+@@ -5052,9 +5132,15 @@ syn_challenge:
+ goto discard;
+ }
+
++ /* If valid: post process the received MPTCP options. */
++ if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
++ goto discard;
++
+ return true;
+
+ discard:
++ if (mptcp(tp))
++ mptcp_reset_mopt(tp);
+ __kfree_skb(skb);
+ return false;
+ }
+@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+
+ tp->rx_opt.saw_tstamp = 0;
+
++ /* MPTCP: force slowpath. */
++ if (mptcp(tp))
++ goto slow_path;
++
+ /* pred_flags is 0xS?10 << 16 + snd_wnd
+ * if header_prediction is to be made
+ * 'S' will always be tp->tcp_header_len >> 2
+@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
+ }
+ if (copied_early)
+- tcp_cleanup_rbuf(sk, skb->len);
++ tp->ops->cleanup_rbuf(sk, skb->len);
+ }
+ if (!eaten) {
+ if (tcp_checksum_complete_user(sk, skb))
+@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
+
+ tcp_init_metrics(sk);
+
+- tcp_init_congestion_control(sk);
++ tp->ops->init_congestion_control(sk);
+
+ /* Prevent spurious tcp_cwnd_restart() on first data
+ * packet.
+ */
+ tp->lsndtime = tcp_time_stamp;
+
+- tcp_init_buffer_space(sk);
++ tp->ops->init_buffer_space(sk);
+
+ if (sock_flag(sk, SOCK_KEEPOPEN))
+ inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
+@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ /* Get original SYNACK MSS value if user MSS sets mss_clamp */
+ tcp_clear_options(&opt);
+ opt.user_mss = opt.mss_clamp = 0;
+- tcp_parse_options(synack, &opt, 0, NULL);
++ tcp_parse_options(synack, &opt, NULL, 0, NULL);
+ mss = opt.mss_clamp;
+ }
+
+@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+
+ tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
+
+- if (data) { /* Retransmit unacked data in SYN */
++ /* In mptcp case, we do not rely on "retransmit", but instead on
++ * "transmit", because if fastopen data is not acked, the retransmission
++ * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
++ */
++ if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
+ tcp_for_write_queue_from(data, sk) {
+ if (data == tcp_send_head(sk) ||
+ __tcp_retransmit_skb(sk, data))
+@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp_fastopen_cookie foc = { .len = -1 };
+ int saved_clamp = tp->rx_opt.mss_clamp;
++ struct mptcp_options_received mopt;
++ mptcp_init_mp_opt(&mopt);
+
+- tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
++ tcp_parse_options(skb, &tp->rx_opt,
++ mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+
+@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
+ tcp_ack(sk, skb, FLAG_SLOWPATH);
+
++ if (tp->request_mptcp || mptcp(tp)) {
++ int ret;
++ ret = mptcp_rcv_synsent_state_process(sk, &sk,
++ skb, &mopt);
++
++ /* May have changed if we support MPTCP */
++ tp = tcp_sk(sk);
++ icsk = inet_csk(sk);
++
++ if (ret == 1)
++ goto reset_and_undo;
++ if (ret == 2)
++ goto discard;
++ }
++
++ if (mptcp(tp) && !is_master_tp(tp)) {
++ /* Timer for repeating the ACK until an answer
++ * arrives. Used only when establishing an additional
++ * subflow inside of an MPTCP connection.
++ */
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ }
++
+ /* Ok.. it's good. Set up sequence numbers and
+ * move to established.
+ */
+@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tp->tcp_header_len = sizeof(struct tcphdr);
+ }
+
++ if (mptcp(tp)) {
++ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++ }
++
+ if (tcp_is_sack(tp) && sysctl_tcp_fack)
+ tcp_enable_fack(tp);
+
+@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_rcv_fastopen_synack(sk, skb, &foc))
+ return -1;
+
+- if (sk->sk_write_pending ||
++ /* With MPTCP we cannot send data on the third ack due to the
++ * lack of option-space to combine with an MP_CAPABLE.
++ */
++ if (!mptcp(tp) && (sk->sk_write_pending ||
+ icsk->icsk_accept_queue.rskq_defer_accept ||
+- icsk->icsk_ack.pingpong) {
++ icsk->icsk_ack.pingpong)) {
+ /* Save one ACK. Data will be ready after
+ * several ticks, if write_pending is set.
+ *
+@@ -5536,6 +5665,7 @@ discard:
+ tcp_paws_reject(&tp->rx_opt, 0))
+ goto discard_and_undo;
+
++ /* TODO - check this here for MPTCP */
+ if (th->syn) {
+ /* We see SYN without ACK. It is attempt of
+ * simultaneous connect with crossed SYNs.
+@@ -5552,6 +5682,11 @@ discard:
+ tp->tcp_header_len = sizeof(struct tcphdr);
+ }
+
++ if (mptcp(tp)) {
++ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++ }
++
+ tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
+ tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
+
+@@ -5610,6 +5745,7 @@ reset_and_undo:
+
+ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ const struct tcphdr *th, unsigned int len)
++ __releases(&sk->sk_lock.slock)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ case TCP_SYN_SENT:
+ queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
++ if (is_meta_sk(sk)) {
++ sk = tcp_sk(sk)->mpcb->master_sk;
++ tp = tcp_sk(sk);
++
++ /* Need to call it here, because it will announce new
++ * addresses, which can only be done after the third ack
++ * of the 3-way handshake.
++ */
++ mptcp_update_metasocket(sk, tp->meta_sk);
++ }
+ if (queued >= 0)
+ return queued;
+
+@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_urg(sk, skb, th);
+ __kfree_skb(skb);
+ tcp_data_snd_check(sk);
++ if (mptcp(tp) && is_master_tp(tp))
++ bh_unlock_sock(sk);
+ return 0;
+ }
+
+@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ synack_stamp = tp->lsndtime;
+ /* Make sure socket is routed, for correct metrics. */
+ icsk->icsk_af_ops->rebuild_header(sk);
+- tcp_init_congestion_control(sk);
++ tp->ops->init_congestion_control(sk);
+
+ tcp_mtup_init(sk);
+ tp->copied_seq = tp->rcv_nxt;
+- tcp_init_buffer_space(sk);
++ tp->ops->init_buffer_space(sk);
+ }
+ smp_mb();
+ tcp_set_state(sk, TCP_ESTABLISHED);
+@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ if (tp->rx_opt.tstamp_ok)
+ tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
++ if (mptcp(tp))
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
+
+ if (req) {
+ /* Re-arm the timer because data may have been sent out.
+@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ tcp_initialize_rcv_mss(sk);
+ tcp_fast_path_on(tp);
++ /* Send an ACK when establishing a new
++ * MPTCP subflow, i.e. using an MP_JOIN
++ * subtype.
++ */
++ if (mptcp(tp) && !is_master_tp(tp))
++ tcp_send_ack(sk);
+ break;
+
+ case TCP_FIN_WAIT1: {
+@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ tmo = tcp_fin_time(sk);
+ if (tmo > TCP_TIMEWAIT_LEN) {
+ inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
+- } else if (th->fin || sock_owned_by_user(sk)) {
++ } else if (th->fin || mptcp_is_data_fin(skb) ||
++ sock_owned_by_user(sk)) {
+ /* Bad case. We could lose such FIN otherwise.
+ * It is not a big problem, but it looks confusing
+ * and not so rare event. We still can lose it now,
+@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ */
+ inet_csk_reset_keepalive_timer(sk, tmo);
+ } else {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ goto discard;
+ }
+ break;
+@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ case TCP_CLOSING:
+ if (tp->snd_una == tp->write_seq) {
+- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ goto discard;
+ }
+ break;
+@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ goto discard;
+ }
+ break;
++ case TCP_CLOSE:
++ if (tp->mp_killed)
++ goto discard;
+ }
+
+ /* step 6: check the URG bit */
+@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ */
+ if (sk->sk_shutdown & RCV_SHUTDOWN) {
+ if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
+- after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
++ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++ !mptcp(tp)) {
++ /* In case of mptcp, the reset is handled by
++ * mptcp_rcv_state_process
++ */
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
+ tcp_reset(sk);
+ return 1;
+@@ -5877,3 +6041,154 @@ discard:
+ return 0;
+ }
+ EXPORT_SYMBOL(tcp_rcv_state_process);
++
++static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++
++ if (family == AF_INET)
++ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
++ &ireq->ir_rmt_addr, port);
++#if IS_ENABLED(CONFIG_IPV6)
++ else if (family == AF_INET6)
++ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
++ &ireq->ir_v6_rmt_addr, port);
++#endif
++}
++
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++ const struct tcp_request_sock_ops *af_ops,
++ struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_options_received tmp_opt;
++ struct request_sock *req;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct dst_entry *dst = NULL;
++ __u32 isn = TCP_SKB_CB(skb)->when;
++ bool want_cookie = false, fastopen;
++ struct flowi fl;
++ struct tcp_fastopen_cookie foc = { .len = -1 };
++ int err;
++
++
++ /* TW buckets are converted to open requests without
++ * limitations, they conserve resources and peer is
++ * evidently real one.
++ */
++ if ((sysctl_tcp_syncookies == 2 ||
++ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++ want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
++ if (!want_cookie)
++ goto drop;
++ }
++
++
++ /* Accept backlog is full. If we have already queued enough
++ * of warm entries in syn queue, drop request. It is better than
++ * clogging syn queue with openreqs with exponentially increasing
++ * timeout.
++ */
++ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
++ goto drop;
++ }
++
++ req = inet_reqsk_alloc(rsk_ops);
++ if (!req)
++ goto drop;
++
++ tcp_rsk(req)->af_specific = af_ops;
++
++ tcp_clear_options(&tmp_opt);
++ tmp_opt.mss_clamp = af_ops->mss_clamp;
++ tmp_opt.user_mss = tp->rx_opt.user_mss;
++ tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
++
++ if (want_cookie && !tmp_opt.saw_tstamp)
++ tcp_clear_options(&tmp_opt);
++
++ tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
++ tcp_openreq_init(req, &tmp_opt, skb);
++
++ if (af_ops->init_req(req, sk, skb))
++ goto drop_and_free;
++
++ if (security_inet_conn_request(sk, skb, req))
++ goto drop_and_free;
++
++ if (!want_cookie || tmp_opt.tstamp_ok)
++ TCP_ECN_create_request(req, skb, sock_net(sk));
++
++ if (want_cookie) {
++ isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
++ req->cookie_ts = tmp_opt.tstamp_ok;
++ } else if (!isn) {
++ /* VJ's idea. We save last timestamp seen
++ * from the destination in peer table, when entering
++ * state TIME-WAIT, and check against it before
++ * accepting new connection request.
++ *
++ * If "isn" is not zero, this request hit alive
++ * timewait bucket, so that all the necessary checks
++ * are made in the function processing timewait state.
++ */
++ if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
++ bool strict;
++
++ dst = af_ops->route_req(sk, &fl, req, &strict);
++ if (dst && strict &&
++ !tcp_peer_is_proven(req, dst, true)) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
++ goto drop_and_release;
++ }
++ }
++ /* Kill the following clause, if you dislike this way. */
++ else if (!sysctl_tcp_syncookies &&
++ (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
++ (sysctl_max_syn_backlog >> 2)) &&
++ !tcp_peer_is_proven(req, dst, false)) {
++ /* Without syncookies last quarter of
++ * backlog is filled with destinations,
++ * proven to be alive.
++ * It means that we continue to communicate
++ * to destinations, already remembered
++ * to the moment of synflood.
++ */
++ pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
++ rsk_ops->family);
++ goto drop_and_release;
++ }
++
++ isn = af_ops->init_seq(skb);
++ }
++ if (!dst) {
++ dst = af_ops->route_req(sk, &fl, req, NULL);
++ if (!dst)
++ goto drop_and_free;
++ }
++
++ tcp_rsk(req)->snt_isn = isn;
++ tcp_openreq_init_rwin(req, sk, dst);
++ fastopen = !want_cookie &&
++ tcp_try_fastopen(sk, skb, req, &foc, dst);
++ err = af_ops->send_synack(sk, dst, &fl, req,
++ skb_get_queue_mapping(skb), &foc);
++ if (!fastopen) {
++ if (err || want_cookie)
++ goto drop_and_free;
++
++ tcp_rsk(req)->listener = NULL;
++ af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
++ }
++
++ return 0;
++
++drop_and_release:
++ dst_release(dst);
++drop_and_free:
++ reqsk_free(req);
++drop:
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++ return 0;
++}
++EXPORT_SYMBOL(tcp_conn_request);
+diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
+index 77cccda1ad0c..c77017f600f1 100644
+--- a/net/ipv4/tcp_ipv4.c
++++ b/net/ipv4/tcp_ipv4.c
+@@ -67,6 +67,8 @@
+ #include <net/icmp.h>
+ #include <net/inet_hashtables.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/transp_v6.h>
+ #include <net/ipv6.h>
+ #include <net/inet_common.h>
+@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
+ struct inet_hashinfo tcp_hashinfo;
+ EXPORT_SYMBOL(tcp_hashinfo);
+
+-static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
+ {
+ return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
+ ip_hdr(skb)->saddr,
+@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ struct inet_sock *inet;
+ const int type = icmp_hdr(icmp_skb)->type;
+ const int code = icmp_hdr(icmp_skb)->code;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+ struct sk_buff *skb;
+ struct request_sock *fastopen;
+ __u32 seq, snd_una;
+@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ return;
+ }
+
+- bh_lock_sock(sk);
++ tp = tcp_sk(sk);
++ if (mptcp(tp))
++ meta_sk = mptcp_meta_sk(sk);
++ else
++ meta_sk = sk;
++
++ bh_lock_sock(meta_sk);
+ /* If too many ICMPs get dropped on busy
+ * servers this needs to be solved differently.
+ * We do take care of PMTU discovery (RFC1191) special case :
+ * we can receive locally generated ICMP messages while socket is held.
+ */
+- if (sock_owned_by_user(sk)) {
++ if (sock_owned_by_user(meta_sk)) {
+ if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
+ NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ }
+@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ }
+
+ icsk = inet_csk(sk);
+- tp = tcp_sk(sk);
+ seq = ntohl(th->seq);
+ /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ fastopen = tp->fastopen_rsk;
+@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ goto out;
+
+ tp->mtu_info = info;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_v4_mtu_reduced(sk);
+ } else {
+ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+ goto out;
+ }
+@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ !icsk->icsk_backoff || fastopen)
+ break;
+
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ break;
+
+ icsk->icsk_backoff--;
+@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ switch (sk->sk_state) {
+ struct request_sock *req, **prev;
+ case TCP_LISTEN:
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ goto out;
+
+ req = inet_csk_search_req(sk, &prev, th->dest,
+@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ if (fastopen && fastopen->sk == NULL)
+ break;
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ sk->sk_err = err;
+
+ sk->sk_error_report(sk);
+@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ */
+
+ inet = inet_sk(sk);
+- if (!sock_owned_by_user(sk) && inet->recverr) {
++ if (!sock_owned_by_user(meta_sk) && inet->recverr) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk);
+ } else { /* Only an error on timeout */
+@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ }
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
+ * Exception: precedence violation. We do not implement it in any case.
+ */
+
+-static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct {
+@@ -702,10 +711,10 @@ release_sk1:
+ outside socket context is ugly, certainly. What can I do?
+ */
+
+-static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ u32 win, u32 tsval, u32 tsecr, int oif,
+ struct tcp_md5sig_key *key,
+- int reply_flags, u8 tos)
++ int reply_flags, u8 tos, int mptcp)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct {
+@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ #ifdef CONFIG_TCP_MD5SIG
+ + (TCPOLEN_MD5SIG_ALIGNED >> 2)
+ #endif
++#ifdef CONFIG_MPTCP
++ + ((MPTCP_SUB_LEN_DSS >> 2) +
++ (MPTCP_SUB_LEN_ACK >> 2))
++#endif
+ ];
+ } rep;
+ struct ip_reply_arg arg;
+@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ ip_hdr(skb)->daddr, &rep.th);
+ }
+ #endif
++#ifdef CONFIG_MPTCP
++ if (mptcp) {
++ int offset = (tsecr) ? 3 : 0;
++ /* Construction of 32-bit data_ack */
++ rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
++ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++ (0x20 << 8) |
++ (0x01));
++ rep.opt[offset] = htonl(data_ack);
++
++ arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++ rep.th.doff = arg.iov[0].iov_len / 4;
++ }
++#endif /* CONFIG_MPTCP */
++
+ arg.flags = reply_flags;
+ arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
+ ip_hdr(skb)->saddr, /* XXX */
+@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ struct inet_timewait_sock *tw = inet_twsk(sk);
+ struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++ u32 data_ack = 0;
++ int mptcp = 0;
++
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++ mptcp = 1;
++ }
+
+ tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++ data_ack,
+ tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ tcp_time_stamp + tcptw->tw_ts_offset,
+ tcptw->tw_ts_recent,
+ tw->tw_bound_dev_if,
+ tcp_twsk_md5_key(tcptw),
+ tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
+- tw->tw_tos
++ tw->tw_tos, mptcp
+ );
+
+ inet_twsk_put(tw);
+ }
+
+-static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req)
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req)
+ {
+ /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ */
+ tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+- tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
++ tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
+ tcp_time_stamp,
+ req->ts_recent,
+ 0,
+ tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
+ AF_INET),
+ inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
+- ip_hdr(skb)->tos);
++ ip_hdr(skb)->tos, 0);
+ }
+
+ /*
+@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+ * This still operates on a request_sock only, not on a big
+ * socket.
+ */
+-static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+- struct request_sock *req,
+- u16 queue_mapping,
+- struct tcp_fastopen_cookie *foc)
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc)
+ {
+ const struct inet_request_sock *ireq = inet_rsk(req);
+ struct flowi4 fl4;
+@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+ return err;
+ }
+
+-static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
+-{
+- int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
+-
+- if (!res) {
+- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+- }
+- return res;
+-}
+-
+ /*
+ * IPv4 request_sock destructor.
+ */
+-static void tcp_v4_reqsk_destructor(struct request_sock *req)
++void tcp_v4_reqsk_destructor(struct request_sock *req)
+ {
+ kfree(inet_rsk(req)->opt);
+ }
+@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
+ /*
+ * Save and compile IPv4 options into the request_sock if needed.
+ */
+-static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
+ {
+ const struct ip_options *opt = &(IPCB(skb)->opt);
+ struct ip_options_rcu *dopt = NULL;
+@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+
+ #endif
+
++static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++
++ ireq->ir_loc_addr = ip_hdr(skb)->daddr;
++ ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
++ ireq->no_srccheck = inet_sk(sk)->transparent;
++ ireq->opt = tcp_v4_save_options(skb);
++ ireq->ir_mark = inet_request_mark(sk, skb);
++
++ return 0;
++}
++
++static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict)
++{
++ struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
++
++ if (strict) {
++ if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
++ *strict = true;
++ else
++ *strict = false;
++ }
++
++ return dst;
++}
++
+ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
+ .family = PF_INET,
+ .obj_size = sizeof(struct tcp_request_sock),
+- .rtx_syn_ack = tcp_v4_rtx_synack,
++ .rtx_syn_ack = tcp_rtx_synack,
+ .send_ack = tcp_v4_reqsk_send_ack,
+ .destructor = tcp_v4_reqsk_destructor,
+ .send_reset = tcp_v4_send_reset,
+ .syn_ack_timeout = tcp_syn_ack_timeout,
+ };
+
++const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
++ .mss_clamp = TCP_MSS_DEFAULT,
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+ .md5_lookup = tcp_v4_reqsk_md5_lookup,
+ .calc_md5_hash = tcp_v4_md5_hash_skb,
+-};
+ #endif
++ .init_req = tcp_v4_init_req,
++#ifdef CONFIG_SYN_COOKIES
++ .cookie_init_seq = cookie_v4_init_sequence,
++#endif
++ .route_req = tcp_v4_route_req,
++ .init_seq = tcp_v4_init_sequence,
++ .send_synack = tcp_v4_send_synack,
++ .queue_hash_add = inet_csk_reqsk_queue_hash_add,
++};
+
+ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+- struct tcp_options_received tmp_opt;
+- struct request_sock *req;
+- struct inet_request_sock *ireq;
+- struct tcp_sock *tp = tcp_sk(sk);
+- struct dst_entry *dst = NULL;
+- __be32 saddr = ip_hdr(skb)->saddr;
+- __be32 daddr = ip_hdr(skb)->daddr;
+- __u32 isn = TCP_SKB_CB(skb)->when;
+- bool want_cookie = false, fastopen;
+- struct flowi4 fl4;
+- struct tcp_fastopen_cookie foc = { .len = -1 };
+- int err;
+-
+ /* Never answer to SYNs send to broadcast or multicast */
+ if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+ goto drop;
+
+- /* TW buckets are converted to open requests without
+- * limitations, they conserve resources and peer is
+- * evidently real one.
+- */
+- if ((sysctl_tcp_syncookies == 2 ||
+- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+- want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
+- if (!want_cookie)
+- goto drop;
+- }
+-
+- /* Accept backlog is full. If we have already queued enough
+- * of warm entries in syn queue, drop request. It is better than
+- * clogging syn queue with openreqs with exponentially increasing
+- * timeout.
+- */
+- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+- goto drop;
+- }
+-
+- req = inet_reqsk_alloc(&tcp_request_sock_ops);
+- if (!req)
+- goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
+-#endif
+-
+- tcp_clear_options(&tmp_opt);
+- tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
+- tmp_opt.user_mss = tp->rx_opt.user_mss;
+- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+- if (want_cookie && !tmp_opt.saw_tstamp)
+- tcp_clear_options(&tmp_opt);
+-
+- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+- tcp_openreq_init(req, &tmp_opt, skb);
++ return tcp_conn_request(&tcp_request_sock_ops,
++ &tcp_request_sock_ipv4_ops, sk, skb);
+
+- ireq = inet_rsk(req);
+- ireq->ir_loc_addr = daddr;
+- ireq->ir_rmt_addr = saddr;
+- ireq->no_srccheck = inet_sk(sk)->transparent;
+- ireq->opt = tcp_v4_save_options(skb);
+- ireq->ir_mark = inet_request_mark(sk, skb);
+-
+- if (security_inet_conn_request(sk, skb, req))
+- goto drop_and_free;
+-
+- if (!want_cookie || tmp_opt.tstamp_ok)
+- TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+- if (want_cookie) {
+- isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+- req->cookie_ts = tmp_opt.tstamp_ok;
+- } else if (!isn) {
+- /* VJ's idea. We save last timestamp seen
+- * from the destination in peer table, when entering
+- * state TIME-WAIT, and check against it before
+- * accepting new connection request.
+- *
+- * If "isn" is not zero, this request hit alive
+- * timewait bucket, so that all the necessary checks
+- * are made in the function processing timewait state.
+- */
+- if (tmp_opt.saw_tstamp &&
+- tcp_death_row.sysctl_tw_recycle &&
+- (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
+- fl4.daddr == saddr) {
+- if (!tcp_peer_is_proven(req, dst, true)) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+- goto drop_and_release;
+- }
+- }
+- /* Kill the following clause, if you dislike this way. */
+- else if (!sysctl_tcp_syncookies &&
+- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+- (sysctl_max_syn_backlog >> 2)) &&
+- !tcp_peer_is_proven(req, dst, false)) {
+- /* Without syncookies last quarter of
+- * backlog is filled with destinations,
+- * proven to be alive.
+- * It means that we continue to communicate
+- * to destinations, already remembered
+- * to the moment of synflood.
+- */
+- LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
+- &saddr, ntohs(tcp_hdr(skb)->source));
+- goto drop_and_release;
+- }
+-
+- isn = tcp_v4_init_sequence(skb);
+- }
+- if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_isn = isn;
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_openreq_init_rwin(req, sk, dst);
+- fastopen = !want_cookie &&
+- tcp_try_fastopen(sk, skb, req, &foc, dst);
+- err = tcp_v4_send_synack(sk, dst, req,
+- skb_get_queue_mapping(skb), &foc);
+- if (!fastopen) {
+- if (err || want_cookie)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_rsk(req)->listener = NULL;
+- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+- }
+-
+- return 0;
+-
+-drop_and_release:
+- dst_release(dst);
+-drop_and_free:
+- reqsk_free(req);
+ drop:
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ return 0;
+@@ -1497,7 +1433,7 @@ put_and_exit:
+ }
+ EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
+
+-static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcphdr *th = tcp_hdr(skb);
+ const struct iphdr *iph = ip_hdr(skb);
+@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+
+ if (nsk) {
+ if (nsk->sk_state != TCP_TIME_WAIT) {
++ /* Don't lock again the meta-sk. It has been locked
++ * before mptcp_v4_do_rcv.
++ */
++ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++ bh_lock_sock(mptcp_meta_sk(nsk));
+ bh_lock_sock(nsk);
++
+ return nsk;
++
+ }
+ inet_twsk_put(inet_twsk(nsk));
+ return NULL;
+@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
+ goto discard;
+ #endif
+
++ if (is_meta_sk(sk))
++ return mptcp_v4_do_rcv(sk, skb);
++
+ if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
+ struct dst_entry *dst = sk->sk_rx_dst;
+
+@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
+ } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
+ wake_up_interruptible_sync_poll(sk_sleep(sk),
+ POLLIN | POLLRDNORM | POLLRDBAND);
+- if (!inet_csk_ack_scheduled(sk))
++ if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
+ (3 * tcp_rto_min(sk)) / 4,
+ TCP_RTO_MAX);
+@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ {
+ const struct iphdr *iph;
+ const struct tcphdr *th;
+- struct sock *sk;
++ struct sock *sk, *meta_sk = NULL;
+ int ret;
+ struct net *net = dev_net(skb->dev);
+
+@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ skb->len - th->doff * 4);
+ TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++ TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ TCP_SKB_CB(skb)->when = 0;
+ TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
+ TCP_SKB_CB(skb)->sacked = 0;
+
+ sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+- if (!sk)
+- goto no_tcp_socket;
+
+ process:
+- if (sk->sk_state == TCP_TIME_WAIT)
++ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ goto do_time_wait;
+
++#ifdef CONFIG_MPTCP
++ if (!sk && th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, NULL);
++
++ if (ret < 0) {
++ tcp_v4_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++
++ /* Is there a pending request sock for this segment ? */
++ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++ if (sk)
++ sock_put(sk);
++ return 0;
++ }
++#endif
++ if (!sk)
++ goto no_tcp_socket;
++
+ if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+@@ -1759,11 +1729,21 @@ process:
+ sk_mark_napi_id(sk, skb);
+ skb->dev = NULL;
+
+- bh_lock_sock_nested(sk);
++ if (mptcp(tcp_sk(sk))) {
++ meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk))
++ skb->sk = sk;
++ } else {
++ meta_sk = sk;
++ bh_lock_sock_nested(sk);
++ }
++
+ ret = 0;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+- struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp = tcp_sk(meta_sk);
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = net_dma_find_channel();
+ if (tp->ucopy.dma_chan)
+@@ -1771,16 +1751,16 @@ process:
+ else
+ #endif
+ {
+- if (!tcp_prequeue(sk, skb))
++ if (!tcp_prequeue(meta_sk, skb))
+ ret = tcp_v4_do_rcv(sk, skb);
+ }
+- } else if (unlikely(sk_add_backlog(sk, skb,
+- sk->sk_rcvbuf + sk->sk_sndbuf))) {
+- bh_unlock_sock(sk);
++ } else if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ goto discard_and_relse;
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+
+ sock_put(sk);
+
+@@ -1835,6 +1815,18 @@ do_time_wait:
+ sk = sk2;
+ goto process;
+ }
++#ifdef CONFIG_MPTCP
++ if (th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++ if (ret < 0) {
++ tcp_v4_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++#endif
+ /* Fall through to ACK */
+ }
+ case TCP_TW_ACK:
+@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
+
+ tcp_init_sock(sk);
+
+- icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v4_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv4_specific;
+
+ #ifdef CONFIG_TCP_MD5SIG
+ tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
+@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
+
+ tcp_cleanup_congestion_control(sk);
+
++ if (mptcp(tp))
++ mptcp_destroy_sock(sk);
++ if (tp->inside_tk_table)
++ mptcp_hash_remove(tp);
++
+ /* Cleanup up the write buffer. */
+ tcp_write_queue_purge(sk);
+
+@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
+ }
+ #endif /* CONFIG_PROC_FS */
+
++#ifdef CONFIG_MPTCP
++static void tcp_v4_clear_sk(struct sock *sk, int size)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* we do not want to clear tk_table field, because of RCU lookups */
++ sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
++
++ size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
++ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
++}
++#endif
++
+ struct proto tcp_prot = {
+ .name = "TCP",
+ .owner = THIS_MODULE,
+@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
+ .destroy_cgroup = tcp_destroy_cgroup,
+ .proto_cgroup = tcp_proto_cgroup,
+ #endif
++#ifdef CONFIG_MPTCP
++ .clear_sk = tcp_v4_clear_sk,
++#endif
+ };
+ EXPORT_SYMBOL(tcp_prot);
+
+diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
+index e68e0d4af6c9..ae6946857dff 100644
+--- a/net/ipv4/tcp_minisocks.c
++++ b/net/ipv4/tcp_minisocks.c
+@@ -18,11 +18,13 @@
+ * Jorge Cwik, <jorge@laser.satlink.net>
+ */
+
++#include <linux/kconfig.h>
+ #include <linux/mm.h>
+ #include <linux/module.h>
+ #include <linux/slab.h>
+ #include <linux/sysctl.h>
+ #include <linux/workqueue.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/inet_common.h>
+ #include <net/xfrm.h>
+@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ struct tcp_options_received tmp_opt;
+ struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
+ bool paws_reject = false;
++ struct mptcp_options_received mopt;
+
+ tmp_opt.saw_tstamp = 0;
+ if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ mptcp_init_mp_opt(&mopt);
++
++ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+
+ if (tmp_opt.saw_tstamp) {
+ tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
+@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ tmp_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
+ paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
+ }
++
++ if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
++ if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
++ goto kill_with_rst;
++ }
+ }
+
+ if (tw->tw_substate == TCP_FIN_WAIT2) {
+@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ if (!th->ack ||
+ !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
+ TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
++ /* If mptcp_is_data_fin() returns true, we are sure that
++ * mopt has been initialized - otherwise it would not
++ * be a DATA_FIN.
++ */
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
++ mptcp_is_data_fin(skb) &&
++ TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
++ mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
++ return TCP_TW_ACK;
++
+ inet_twsk_put(tw);
+ return TCP_TW_SUCCESS;
+ }
+@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
+ tcptw->tw_ts_offset = tp->tsoffset;
+
++ if (mptcp(tp)) {
++ if (mptcp_init_tw_sock(sk, tcptw)) {
++ inet_twsk_free(tw);
++ goto exit;
++ }
++ } else {
++ tcptw->mptcp_tw = NULL;
++ }
++
+ #if IS_ENABLED(CONFIG_IPV6)
+ if (tw->tw_family == PF_INET6) {
+ struct ipv6_pinfo *np = inet6_sk(sk);
+@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
+ }
+
++exit:
+ tcp_update_metrics(sk);
+ tcp_done(sk);
+ }
+
+ void tcp_twsk_destructor(struct sock *sk)
+ {
+-#ifdef CONFIG_TCP_MD5SIG
+ struct tcp_timewait_sock *twsk = tcp_twsk(sk);
+
++ if (twsk->mptcp_tw)
++ mptcp_twsk_destructor(twsk);
++#ifdef CONFIG_TCP_MD5SIG
+ if (twsk->tw_md5_key)
+ kfree_rcu(twsk->tw_md5_key, rcu);
+ #endif
+@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
+ req->window_clamp = tcp_full_space(sk);
+
+ /* tcp_full_space because it is guaranteed to be the first packet */
+- tcp_select_initial_window(tcp_full_space(sk),
+- mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
++ tp->ops->select_initial_window(tcp_full_space(sk),
++ mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
++ (ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
+ &req->rcv_wnd,
+ &req->window_clamp,
+ ireq->wscale_ok,
+ &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ dst_metric(dst, RTAX_INITRWND), sk);
+ ireq->rcv_wscale = rcv_wscale;
+ }
+ EXPORT_SYMBOL(tcp_openreq_init_rwin);
+@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
+ newtp->rx_opt.ts_recent_stamp = 0;
+ newtp->tcp_header_len = sizeof(struct tcphdr);
+ }
++ if (ireq->saw_mpc)
++ newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
+ newtp->tsoffset = 0;
+ #ifdef CONFIG_TCP_MD5SIG
+ newtp->md5sig_info = NULL; /*XXX*/
+@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ bool fastopen)
+ {
+ struct tcp_options_received tmp_opt;
++ struct mptcp_options_received mopt;
+ struct sock *child;
+ const struct tcphdr *th = tcp_hdr(skb);
+ __be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
+ bool paws_reject = false;
+
+- BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
++ BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
+
+ tmp_opt.saw_tstamp = 0;
++
++ mptcp_init_mp_opt(&mopt);
++
+ if (th->doff > (sizeof(struct tcphdr)>>2)) {
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+
+ if (tmp_opt.saw_tstamp) {
+ tmp_opt.ts_recent = req->ts_recent;
+@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ *
+ * Reset timer after retransmitting SYNACK, similar to
+ * the idea of fast retransmit in recovery.
++ *
++ * Fall back to TCP if MP_CAPABLE is not set.
+ */
++
++ if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
++ inet_rsk(req)->saw_mpc = false;
++
++
+ if (!inet_rtx_syn_ack(sk, req))
+ req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
+ TCP_RTO_MAX) + jiffies;
+@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ * socket is created, wait for troubles.
+ */
+ child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
++
+ if (child == NULL)
+ goto listen_overflow;
+
++ if (!is_meta_sk(sk)) {
++ int ret = mptcp_check_req_master(sk, child, req, prev);
++ if (ret < 0)
++ goto listen_overflow;
++
++ /* MPTCP-supported */
++ if (!ret)
++ return tcp_sk(child)->mpcb->master_sk;
++ } else {
++ return mptcp_check_req_child(sk, child, req, prev, &mopt);
++ }
+ inet_csk_reqsk_queue_unlink(sk, req, prev);
+ inet_csk_reqsk_queue_removed(sk, req);
+
+@@ -746,7 +804,17 @@ embryonic_reset:
+ tcp_reset(sk);
+ }
+ if (!fastopen) {
+- inet_csk_reqsk_queue_drop(sk, req, prev);
++ if (is_meta_sk(sk)) {
++ /* We want to avoid stoping the keepalive-timer and so
++ * avoid ending up in inet_csk_reqsk_queue_removed ...
++ */
++ inet_csk_reqsk_queue_unlink(sk, req, prev);
++ if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
++ mptcp_delete_synack_timer(sk);
++ reqsk_free(req);
++ } else {
++ inet_csk_reqsk_queue_drop(sk, req, prev);
++ }
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+ }
+ return NULL;
+@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ {
+ int ret = 0;
+ int state = child->sk_state;
++ struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
+
+- if (!sock_owned_by_user(child)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
+ skb->len);
+ /* Wakeup parent, send SIGIO */
+@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ * in main socket hash table and lock on listening
+ * socket does not protect us more.
+ */
+- __sk_add_backlog(child, skb);
++ if (mptcp(tcp_sk(child)))
++ skb->sk = child;
++ __sk_add_backlog(meta_sk, skb);
+ }
+
+- bh_unlock_sock(child);
++ if (mptcp(tcp_sk(child)))
++ bh_unlock_sock(child);
++ bh_unlock_sock(meta_sk);
+ sock_put(child);
+ return ret;
+ }
+diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
+index 179b51e6bda3..efd31b6c5784 100644
+--- a/net/ipv4/tcp_output.c
++++ b/net/ipv4/tcp_output.c
+@@ -36,6 +36,12 @@
+
+ #define pr_fmt(fmt) "TCP: " fmt
+
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++#include <net/ipv6.h>
+ #include <net/tcp.h>
+
+ #include <linux/compiler.h>
+@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
+ unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
+ EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
+
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+- int push_one, gfp_t gfp);
+-
+ /* Account for new data that has been sent to the network. */
+-static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
+ void tcp_select_initial_window(int __space, __u32 mss,
+ __u32 *rcv_wnd, __u32 *window_clamp,
+ int wscale_ok, __u8 *rcv_wscale,
+- __u32 init_rcv_wnd)
++ __u32 init_rcv_wnd, const struct sock *sk)
+ {
+ unsigned int space = (__space < 0 ? 0 : __space);
+
+@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
+ * value can be stuffed directly into th->window for an outgoing
+ * frame.
+ */
+-static u16 tcp_select_window(struct sock *sk)
++u16 tcp_select_window(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 old_win = tp->rcv_wnd;
+- u32 cur_win = tcp_receive_window(tp);
+- u32 new_win = __tcp_select_window(sk);
++ /* The window must never shrink at the meta-level. At the subflow we
++ * have to allow this. Otherwise we may announce a window too large
++ * for the current meta-level sk_rcvbuf.
++ */
++ u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
++ u32 new_win = tp->ops->__select_window(sk);
+
+ /* Never shrink the offered window */
+ if (new_win < cur_win) {
+@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
+ LINUX_MIB_TCPWANTZEROWINDOWADV);
+ new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+ }
++
+ tp->rcv_wnd = new_win;
+ tp->rcv_wup = tp->rcv_nxt;
+
+@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
+ /* Constructs common control bits of non-data skb. If SYN/FIN is present,
+ * auto increment end seqno.
+ */
+-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ {
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ TCP_SKB_CB(skb)->end_seq = seq;
+ }
+
+-static inline bool tcp_urg_mode(const struct tcp_sock *tp)
++bool tcp_urg_mode(const struct tcp_sock *tp)
+ {
+ return tp->snd_una != tp->snd_up;
+ }
+@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
+ #define OPTION_MD5 (1 << 2)
+ #define OPTION_WSCALE (1 << 3)
+ #define OPTION_FAST_OPEN_COOKIE (1 << 8)
+-
+-struct tcp_out_options {
+- u16 options; /* bit field of OPTION_* */
+- u16 mss; /* 0 to disable */
+- u8 ws; /* window scale, 0 to disable */
+- u8 num_sack_blocks; /* number of SACK blocks to include */
+- u8 hash_size; /* bytes in hash_location */
+- __u8 *hash_location; /* temporary pointer, overloaded */
+- __u32 tsval, tsecr; /* need to include OPTION_TS */
+- struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
+-};
++/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
+
+ /* Write previously computed TCP options to the packet.
+ *
+@@ -430,7 +428,7 @@ struct tcp_out_options {
+ * (but it may well be that other scenarios fail similarly).
+ */
+ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+- struct tcp_out_options *opts)
++ struct tcp_out_options *opts, struct sk_buff *skb)
+ {
+ u16 options = opts->options; /* mungable copy */
+
+@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+ }
+ ptr += (foc->len + 3) >> 2;
+ }
++
++ if (unlikely(OPTION_MPTCP & opts->options))
++ mptcp_options_write(ptr, tp, opts, skb);
+ }
+
+ /* Compute TCP options for SYN packets. This is not the final
+@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
+ if (unlikely(!(OPTION_TS & opts->options)))
+ remaining -= TCPOLEN_SACKPERM_ALIGNED;
+ }
++ if (tp->request_mptcp || mptcp(tp))
++ mptcp_syn_options(sk, opts, &remaining);
+
+ if (fastopen && fastopen->cookie.len >= 0) {
+ u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
+@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
+ }
+ }
+
++ if (ireq->saw_mpc)
++ mptcp_synack_options(req, opts, &remaining);
++
+ return MAX_TCP_OPTION_SPACE - remaining;
+ }
+
+@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
+ opts->tsecr = tp->rx_opt.ts_recent;
+ size += TCPOLEN_TSTAMP_ALIGNED;
+ }
++ if (mptcp(tp))
++ mptcp_established_options(sk, skb, opts, &size);
+
+ eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
+ if (unlikely(eff_sacks)) {
+- const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+- opts->num_sack_blocks =
+- min_t(unsigned int, eff_sacks,
+- (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
+- TCPOLEN_SACK_PERBLOCK);
+- size += TCPOLEN_SACK_BASE_ALIGNED +
+- opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
++ const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
++ if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
++ opts->num_sack_blocks = 0;
++ else
++ opts->num_sack_blocks =
++ min_t(unsigned int, eff_sacks,
++ (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
++ TCPOLEN_SACK_PERBLOCK);
++ if (opts->num_sack_blocks)
++ size += TCPOLEN_SACK_BASE_ALIGNED +
++ opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+ }
+
+ return size;
+@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
+ if ((1 << sk->sk_state) &
+ (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
+ TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
+- tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+- 0, GFP_ATOMIC);
++ tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
++ tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
+ }
+ /*
+ * One tasklet per cpu tries to send more skbs.
+@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
+ unsigned long flags;
+ struct list_head *q, *n;
+ struct tcp_sock *tp;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+
+ local_irq_save(flags);
+ list_splice_init(&tsq->head, &list);
+@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
+ list_del(&tp->tsq_node);
+
+ sk = (struct sock *)tp;
+- bh_lock_sock(sk);
++ meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++ bh_lock_sock(meta_sk);
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_tsq_handler(sk);
++ if (mptcp(tp))
++ tcp_tsq_handler(meta_sk);
+ } else {
++ if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
++ goto exit;
++
+ /* defer the work to tcp_release_cb() */
+ set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
++
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++exit:
++ bh_unlock_sock(meta_sk);
+
+ clear_bit(TSQ_QUEUED, &tp->tsq_flags);
+ sk_free(sk);
+@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
+ #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) | \
+ (1UL << TCP_WRITE_TIMER_DEFERRED) | \
+ (1UL << TCP_DELACK_TIMER_DEFERRED) | \
+- (1UL << TCP_MTU_REDUCED_DEFERRED))
++ (1UL << TCP_MTU_REDUCED_DEFERRED) | \
++ (1UL << MPTCP_PATH_MANAGER) | \
++ (1UL << MPTCP_SUB_DEFERRED))
++
+ /**
+ * tcp_release_cb - tcp release_sock() callback
+ * @sk: socket
+@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
+ sk->sk_prot->mtu_reduced(sk);
+ __sock_put(sk);
+ }
++ if (flags & (1UL << MPTCP_PATH_MANAGER)) {
++ if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
++ tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
++ __sock_put(sk);
++ }
++ if (flags & (1UL << MPTCP_SUB_DEFERRED))
++ mptcp_tsq_sub_deferred(sk);
+ }
+ EXPORT_SYMBOL(tcp_release_cb);
+
+@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
+ * We are working here with either a clone of the original
+ * SKB, or a fresh unique copy made by the retransmit engine.
+ */
+-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+- gfp_t gfp_mask)
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++ gfp_t gfp_mask)
+ {
+ const struct inet_connection_sock *icsk = inet_csk(sk);
+ struct inet_sock *inet;
+@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ */
+ th->window = htons(min(tp->rcv_wnd, 65535U));
+ } else {
+- th->window = htons(tcp_select_window(sk));
++ th->window = htons(tp->ops->select_window(sk));
+ }
+ th->check = 0;
+ th->urg_ptr = 0;
+@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ }
+ }
+
+- tcp_options_write((__be32 *)(th + 1), tp, &opts);
++ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
+ TCP_ECN_send(sk, skb, tcp_header_size);
+
+@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
+ * otherwise socket can stall.
+ */
+-static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ }
+
+ /* Initialize TSO segments for a packet. */
+-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+- unsigned int mss_now)
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now)
+ {
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+ /* Make sure we own this skb before messing gso_size/gso_segs */
+ WARN_ON_ONCE(skb_cloned(skb));
+
+- if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
++ if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
++ (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
+ /* Avoid the costly divide in the normal
+ * non-TSO case.
+ */
+@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
+ /* Pcount in the middle of the write queue got changed, we need to do various
+ * tweaks to fix counters
+ */
+-static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
+ * eventually). The difference is that pulled data not copied, but
+ * immediately discarded.
+ */
+-static void __pskb_trim_head(struct sk_buff *skb, int len)
++void __pskb_trim_head(struct sk_buff *skb, int len)
+ {
+ struct skb_shared_info *shinfo;
+ int i, k, eat;
+@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
+ /* Remove acked data from a packet in the transmit queue. */
+ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ {
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
++ return mptcp_trim_head(sk, skb, len);
++
+ if (skb_unclone(skb, GFP_ATOMIC))
+ return -ENOMEM;
+
+@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ if (tcp_skb_pcount(skb) > 1)
+ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
+
++#ifdef CONFIG_MPTCP
++ /* Some data got acked - we assume that the seq-number reached the dest.
++ * Anyway, our MPTCP-option has been trimmed above - we lost it here.
++ * Only remove the SEQ if the call does not come from a meta retransmit.
++ */
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++ TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
++#endif
++
+ return 0;
+ }
+
+@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
+
+ return mss_now;
+ }
++EXPORT_SYMBOL(tcp_current_mss);
+
+ /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
+ * As additional protections, we do not touch cwnd in retransmission phases,
+@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
+ * But we can avoid doing the divide again given we already have
+ * skb_pcount = skb->len / mss_now
+ */
+-static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
+- const struct sk_buff *skb)
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++ const struct sk_buff *skb)
+ {
+ if (skb->len < tcp_skb_pcount(skb) * mss_now)
+ tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
+ (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
+ }
+ /* Returns the portion of skb which can be sent right away */
+-static unsigned int tcp_mss_split_point(const struct sock *sk,
+- const struct sk_buff *skb,
+- unsigned int mss_now,
+- unsigned int max_segs,
+- int nonagle)
++unsigned int tcp_mss_split_point(const struct sock *sk,
++ const struct sk_buff *skb,
++ unsigned int mss_now,
++ unsigned int max_segs,
++ int nonagle)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ u32 partial, needed, window, max_len;
+@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
+ /* Can at least one segment of SKB be sent right now, according to the
+ * congestion window rules? If so, return how many segments are allowed.
+ */
+-static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+- const struct sk_buff *skb)
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
++ const struct sk_buff *skb)
+ {
+ u32 in_flight, cwnd;
+
+ /* Don't be strict about the congestion window for the final FIN. */
+- if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
++ if (skb &&
++ (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
+ tcp_skb_pcount(skb) == 1)
+ return 1;
+
+@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+ * This must be invoked the first time we consider transmitting
+ * SKB onto the wire.
+ */
+-static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+- unsigned int mss_now)
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now)
+ {
+ int tso_segs = tcp_skb_pcount(skb);
+
+@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+ /* Return true if the Nagle test allows this packet to be
+ * sent now.
+ */
+-static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
+- unsigned int cur_mss, int nonagle)
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss, int nonagle)
+ {
+ /* Nagle rule does not apply to frames, which sit in the middle of the
+ * write_queue (they have no chances to get new data).
+@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ return true;
+
+ /* Don't use the nagle rule for urgent data (or for the final FIN). */
+- if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
++ if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
++ mptcp_is_data_fin(skb))
+ return true;
+
+ if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
+@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ }
+
+ /* Does at least the first segment of SKB fit into the send window? */
+-static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
+- const struct sk_buff *skb,
+- unsigned int cur_mss)
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss)
+ {
+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
+
+@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
+ u32 send_win, cong_win, limit, in_flight;
+ int win_divisor;
+
+- if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
++ if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
+ goto send_now;
+
+ if (icsk->icsk_ca_state != TCP_CA_Open)
+@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
+ * Returns true, if no segments are in flight and we have queued segments,
+ * but cannot send anything now because of SWS or another problem.
+ */
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ int push_one, gfp_t gfp)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+
+ sent_pkts = 0;
+
+- if (!push_one) {
++ /* pmtu not yet supported with MPTCP. Should be possible, by early
++ * exiting the loop inside tcp_mtu_probe, making sure that only one
++ * single DSS-mapping gets probed.
++ */
++ if (!push_one && !mptcp(tp)) {
+ /* Do MTU probing. */
+ result = tcp_mtu_probe(sk);
+ if (!result) {
+@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
+ int err = -1;
+
+ if (tcp_send_head(sk) != NULL) {
+- err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
++ err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
++ GFP_ATOMIC);
+ goto rearm_timer;
+ }
+
+@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
+ if (unlikely(sk->sk_state == TCP_CLOSE))
+ return;
+
+- if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
+- sk_gfp_atomic(sk, GFP_ATOMIC)))
++ if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
++ sk_gfp_atomic(sk, GFP_ATOMIC)))
+ tcp_check_probe_timer(sk);
+ }
+
+@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
+
+ BUG_ON(!skb || skb->len < mss_now);
+
+- tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
++ tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
++ sk->sk_allocation);
+ }
+
+ /* This function returns the amount that we can raise the
+@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+ if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
+ return;
+
++ /* Currently not supported for MPTCP - but it should be possible */
++ if (mptcp(tp))
++ return;
++
+ tcp_for_write_queue_from_safe(skb, tmp, sk) {
+ if (!tcp_can_collapse(sk, skb))
+ break;
+@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+
+ /* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
+ th->window = htons(min(req->rcv_wnd, 65535U));
+- tcp_options_write((__be32 *)(th + 1), tp, &opts);
++ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ th->doff = (tcp_header_size >> 2);
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
+
+@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
+ (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
+ tp->window_clamp = tcp_full_space(sk);
+
+- tcp_select_initial_window(tcp_full_space(sk),
+- tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
+- &tp->rcv_wnd,
+- &tp->window_clamp,
+- sysctl_tcp_window_scaling,
+- &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk),
++ tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
++ &tp->rcv_wnd,
++ &tp->window_clamp,
++ sysctl_tcp_window_scaling,
++ &rcv_wscale,
++ dst_metric(dst, RTAX_INITRWND), sk);
+
+ tp->rx_opt.rcv_wscale = rcv_wscale;
+ tp->rcv_ssthresh = tp->rcv_wnd;
+@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
+ inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ inet_csk(sk)->icsk_retransmits = 0;
+ tcp_clear_retrans(tp);
++
++#ifdef CONFIG_MPTCP
++ if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
++ if (is_master_tp(tp)) {
++ tp->request_mptcp = 1;
++ mptcp_connect_init(sk);
++ } else if (tp->mptcp) {
++ struct inet_sock *inet = inet_sk(sk);
++
++ tp->mptcp->snt_isn = tp->write_seq;
++ tp->mptcp->init_rcv_wnd = tp->rcv_wnd;
++
++ /* Set nonce for new subflows */
++ if (sk->sk_family == AF_INET)
++ tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
++ inet->inet_saddr,
++ inet->inet_daddr,
++ inet->inet_sport,
++ inet->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
++ inet6_sk(sk)->saddr.s6_addr32,
++ sk->sk_v6_daddr.s6_addr32,
++ inet->inet_sport,
++ inet->inet_dport);
++#endif
++ }
++ }
++#endif
+ }
+
+ static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
+@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
+ TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
+ }
++EXPORT_SYMBOL(tcp_send_ack);
+
+ /* This routine sends a packet with an out of date sequence
+ * number. It assumes the other end will try to ack it.
+@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
+ * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
+ * out-of-date with SND.UNA-1 to probe window.
+ */
+-static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
++int tcp_xmit_probe_skb(struct sock *sk, int urgent)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct sk_buff *skb;
+@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
+ struct tcp_sock *tp = tcp_sk(sk);
+ int err;
+
+- err = tcp_write_wakeup(sk);
++ err = tp->ops->write_wakeup(sk);
+
+ if (tp->packets_out || !tcp_send_head(sk)) {
+ /* Cancel probe timer, if it is not required. */
+@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
+ TCP_RTO_MAX);
+ }
+ }
++
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
++{
++ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++ struct flowi fl;
++ int res;
++
++ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
++ if (!res) {
++ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
++ }
++ return res;
++}
++EXPORT_SYMBOL(tcp_rtx_synack);
+diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
+index 286227abed10..966b873cbf3e 100644
+--- a/net/ipv4/tcp_timer.c
++++ b/net/ipv4/tcp_timer.c
+@@ -20,6 +20,7 @@
+
+ #include <linux/module.h>
+ #include <linux/gfp.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+
+ int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
+@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+ int sysctl_tcp_orphan_retries __read_mostly;
+ int sysctl_tcp_thin_linear_timeouts __read_mostly;
+
+-static void tcp_write_err(struct sock *sk)
++void tcp_write_err(struct sock *sk)
+ {
+ sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
+ sk->sk_error_report(sk);
+@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
+ (!tp->snd_wnd && !tp->packets_out))
+ do_reset = 1;
+ if (do_reset)
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ tcp_done(sk);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
+ return 1;
+@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
+ * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
+ * syn_set flag is set.
+ */
+-static bool retransmits_timed_out(struct sock *sk,
+- unsigned int boundary,
+- unsigned int timeout,
+- bool syn_set)
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++ unsigned int timeout, bool syn_set)
+ {
+ unsigned int linear_backoff_thresh, start_ts;
+ unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
+ }
+
+ /* A write timeout has occurred. Process the after effects. */
+-static int tcp_write_timeout(struct sock *sk)
++int tcp_write_timeout(struct sock *sk)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
+ }
+ retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
+ syn_set = true;
++ /* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
++ if (tcp_sk(sk)->request_mptcp &&
++ icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
++ tcp_sk(sk)->request_mptcp = 0;
+ } else {
+ if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
+ /* Black hole detection */
+@@ -251,18 +254,22 @@ out:
+ static void tcp_delack_timer(unsigned long data)
+ {
+ struct sock *sk = (struct sock *)data;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+
+- bh_lock_sock(sk);
+- if (!sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_delack_timer_handler(sk);
+ } else {
+ inet_csk(sk)->icsk_ack.blocked = 1;
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
+ /* deleguate our work to tcp_release_cb() */
+ if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -479,6 +486,10 @@ out_reset_timer:
+ __sk_dst_reset(sk);
+
+ out:;
++ if (mptcp(tp)) {
++ mptcp_reinject_data(sk, 1);
++ mptcp_set_rto(sk);
++ }
+ }
+
+ void tcp_write_timer_handler(struct sock *sk)
+@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
+ break;
+ case ICSK_TIME_RETRANS:
+ icsk->icsk_pending = 0;
+- tcp_retransmit_timer(sk);
++ tcp_sk(sk)->ops->retransmit_timer(sk);
+ break;
+ case ICSK_TIME_PROBE0:
+ icsk->icsk_pending = 0;
+@@ -520,16 +531,19 @@ out:
+ static void tcp_write_timer(unsigned long data)
+ {
+ struct sock *sk = (struct sock *)data;
++ struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
+
+- bh_lock_sock(sk);
+- if (!sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_write_timer_handler(sk);
+ } else {
+ /* deleguate our work to tcp_release_cb() */
+ if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tcp_sk(sk)))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
+ struct sock *sk = (struct sock *) data;
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ u32 elapsed;
+
+ /* Only process if socket is not in use. */
+- bh_lock_sock(sk);
+- if (sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
+ /* Try again later. */
+ inet_csk_reset_keepalive_timer (sk, HZ/20);
+ goto out;
+@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
+ goto out;
+ }
+
++ if (tp->send_mp_fclose) {
++ /* MUST do this before tcp_write_timeout, because retrans_stamp
++ * may have been set to 0 in another part while we are
++ * retransmitting MP_FASTCLOSE. Then, we would crash, because
++ * retransmits_timed_out accesses the meta-write-queue.
++ *
++ * We make sure that the timestamp is != 0.
++ */
++ if (!tp->retrans_stamp)
++ tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++ if (tcp_write_timeout(sk))
++ goto out;
++
++ tcp_send_ack(sk);
++ icsk->icsk_retransmits++;
++
++ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++ elapsed = icsk->icsk_rto;
++ goto resched;
++ }
++
+ if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
+ if (tp->linger2 >= 0) {
+ const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
+
+ if (tmo > 0) {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ goto out;
+ }
+ }
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ goto death;
+ }
+
+@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
+ icsk->icsk_probes_out > 0) ||
+ (icsk->icsk_user_timeout == 0 &&
+ icsk->icsk_probes_out >= keepalive_probes(tp))) {
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ tcp_write_err(sk);
+ goto out;
+ }
+- if (tcp_write_wakeup(sk) <= 0) {
++ if (tp->ops->write_wakeup(sk) <= 0) {
+ icsk->icsk_probes_out++;
+ elapsed = keepalive_intvl_when(tp);
+ } else {
+@@ -642,7 +679,7 @@ death:
+ tcp_done(sk);
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
+index 5667b3003af9..7139c2973fd2 100644
+--- a/net/ipv6/addrconf.c
++++ b/net/ipv6/addrconf.c
+@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
+
+ kfree_rcu(ifp, rcu);
+ }
++EXPORT_SYMBOL(inet6_ifa_finish_destroy);
+
+ static void
+ ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
+diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
+index 7cb4392690dd..7057afbca4df 100644
+--- a/net/ipv6/af_inet6.c
++++ b/net/ipv6/af_inet6.c
+@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
+ return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
+ }
+
+-static int inet6_create(struct net *net, struct socket *sock, int protocol,
+- int kern)
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ struct inet_sock *inet;
+ struct ipv6_pinfo *np;
+diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
+index a245e5ddffbd..99c892b8992d 100644
+--- a/net/ipv6/inet6_connection_sock.c
++++ b/net/ipv6/inet6_connection_sock.c
+@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
+ /*
+ * request_sock (formerly open request) hash tables.
+ */
+-static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
+- const u32 rnd, const u32 synq_hsize)
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++ const u32 rnd, const u32 synq_hsize)
+ {
+ u32 c;
+
+diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
+index edb58aff4ae7..ea4d9fda0927 100644
+--- a/net/ipv6/ipv6_sockglue.c
++++ b/net/ipv6/ipv6_sockglue.c
+@@ -48,6 +48,8 @@
+ #include <net/addrconf.h>
+ #include <net/inet_common.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/xfrm.h>
+@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+ sock_prot_inuse_add(net, &tcp_prot, 1);
+ local_bh_enable();
+ sk->sk_prot = &tcp_prot;
+- icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v4_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv4_specific;
+ sk->sk_socket->ops = &inet_stream_ops;
+ sk->sk_family = PF_INET;
+ tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
+diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
+index a822b880689b..b2b38869d795 100644
+--- a/net/ipv6/syncookies.c
++++ b/net/ipv6/syncookies.c
+@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+
+ /* check for timestamp cookie support */
+ memset(&tcp_opt, 0, sizeof(tcp_opt));
+- tcp_parse_options(skb, &tcp_opt, 0, NULL);
++ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+
+ if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ goto out;
+
+ ret = NULL;
+- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
++ req = inet_reqsk_alloc(&tcp6_request_sock_ops);
+ if (!req)
+ goto out;
+
+@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ }
+
+ req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
+- tcp_select_initial_window(tcp_full_space(sk), req->mss,
+- &req->rcv_wnd, &req->window_clamp,
+- ireq->wscale_ok, &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++ &req->rcv_wnd, &req->window_clamp,
++ ireq->wscale_ok, &rcv_wscale,
++ dst_metric(dst, RTAX_INITRWND), sk);
+
+ ireq->rcv_wscale = rcv_wscale;
+
+diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
+index 229239ad96b1..fda94d71666e 100644
+--- a/net/ipv6/tcp_ipv6.c
++++ b/net/ipv6/tcp_ipv6.c
+@@ -63,6 +63,8 @@
+ #include <net/inet_common.h>
+ #include <net/secure_seq.h>
+ #include <net/tcp_memcontrol.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
+ #include <net/busy_poll.h>
+
+ #include <linux/proc_fs.h>
+@@ -71,12 +73,6 @@
+ #include <linux/crypto.h>
+ #include <linux/scatterlist.h>
+
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req);
+-
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
+-
+ static const struct inet_connection_sock_af_ops ipv6_mapped;
+ static const struct inet_connection_sock_af_ops ipv6_specific;
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
+ }
+ #endif
+
+-static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct dst_entry *dst = skb_dst(skb);
+ const struct rt6_info *rt = (const struct rt6_info *)dst;
+@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+ }
+
+-static void tcp_v6_hash(struct sock *sk)
++void tcp_v6_hash(struct sock *sk)
+ {
+ if (sk->sk_state != TCP_CLOSE) {
+- if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
++ if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
++ inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
+ tcp_prot.hash(sk);
+ return;
+ }
+@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
+ }
+ }
+
+-static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ {
+ return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
+ ipv6_hdr(skb)->saddr.s6_addr32,
+@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ tcp_hdr(skb)->source);
+ }
+
+-static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ int addr_len)
+ {
+ struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
+@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ sin.sin_port = usin->sin6_port;
+ sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
+
+- icsk->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_mapped;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_mapped;
+ sk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ tp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+
+ if (err) {
+ icsk->icsk_ext_hdr_len = exthdrlen;
+- icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_specific;
+ sk->sk_backlog_rcv = tcp_v6_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ tp->af_specific = &tcp_sock_ipv6_specific;
+@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
+ const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
+ struct ipv6_pinfo *np;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+ int err;
+ struct tcp_sock *tp;
+ struct request_sock *fastopen;
+@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ return;
+ }
+
+- bh_lock_sock(sk);
+- if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
++ tp = tcp_sk(sk);
++ if (mptcp(tp))
++ meta_sk = mptcp_meta_sk(sk);
++ else
++ meta_sk = sk;
++
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
+ NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+
+ if (sk->sk_state == TCP_CLOSE)
+@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+ }
+
+- tp = tcp_sk(sk);
+ seq = ntohl(th->seq);
+ /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ fastopen = tp->fastopen_rsk;
+@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+
+ tp->mtu_info = ntohl(info);
+- if (!sock_owned_by_user(sk))
++ if (!sock_owned_by_user(meta_sk))
+ tcp_v6_mtu_reduced(sk);
+- else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
++ else {
++ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
+ &tp->tsq_flags))
+- sock_hold(sk);
++ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
++ }
+ goto out;
+ }
+
+@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ switch (sk->sk_state) {
+ struct request_sock *req, **prev;
+ case TCP_LISTEN:
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ goto out;
+
+ req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
+@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ if (fastopen && fastopen->sk == NULL)
+ break;
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk); /* Wake people up to see the error (see connect in sock.c) */
+
+@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+ }
+
+- if (!sock_owned_by_user(sk) && np->recverr) {
++ if (!sock_owned_by_user(meta_sk) && np->recverr) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk);
+ } else
+ sk->sk_err_soft = err;
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+
+-static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+- struct flowi6 *fl6,
+- struct request_sock *req,
+- u16 queue_mapping,
+- struct tcp_fastopen_cookie *foc)
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc)
+ {
+ struct inet_request_sock *ireq = inet_rsk(req);
+ struct ipv6_pinfo *np = inet6_sk(sk);
++ struct flowi6 *fl6 = &fl->u.ip6;
+ struct sk_buff *skb;
+ int err = -ENOMEM;
+
+@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+ skb_set_queue_mapping(skb, queue_mapping);
+ err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
+ err = net_xmit_eval(err);
++ if (!tcp_rsk(req)->snt_synack && !err)
++ tcp_rsk(req)->snt_synack = tcp_time_stamp;
+ }
+
+ done:
+ return err;
+ }
+
+-static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ {
+- struct flowi6 fl6;
++ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++ struct flowi fl;
+ int res;
+
+- res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
++ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
+ if (!res) {
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ return res;
+ }
+
+-static void tcp_v6_reqsk_destructor(struct request_sock *req)
++void tcp_v6_reqsk_destructor(struct request_sock *req)
+ {
+ kfree_skb(inet_rsk(req)->pktopts);
+ }
+@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ }
+ #endif
+
++static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++ struct ipv6_pinfo *np = inet6_sk(sk);
++
++ ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
++ ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
++
++ ireq->ir_iif = sk->sk_bound_dev_if;
++ ireq->ir_mark = inet_request_mark(sk, skb);
++
++ /* So that link locals have meaning */
++ if (!sk->sk_bound_dev_if &&
++ ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
++ ireq->ir_iif = inet6_iif(skb);
++
++ if (!TCP_SKB_CB(skb)->when &&
++ (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
++ np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
++ np->rxopt.bits.rxohlim || np->repflow)) {
++ atomic_inc(&skb->users);
++ ireq->pktopts = skb;
++ }
++
++ return 0;
++}
++
++static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict)
++{
++ if (strict)
++ *strict = true;
++ return inet6_csk_route_req(sk, &fl->u.ip6, req);
++}
++
+ struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
+ .family = AF_INET6,
+ .obj_size = sizeof(struct tcp6_request_sock),
+- .rtx_syn_ack = tcp_v6_rtx_synack,
++ .rtx_syn_ack = tcp_rtx_synack,
+ .send_ack = tcp_v6_reqsk_send_ack,
+ .destructor = tcp_v6_reqsk_destructor,
+ .send_reset = tcp_v6_send_reset,
+ .syn_ack_timeout = tcp_syn_ack_timeout,
+ };
+
++const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
++ .mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) -
++ sizeof(struct ipv6hdr),
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
+ .md5_lookup = tcp_v6_reqsk_md5_lookup,
+ .calc_md5_hash = tcp_v6_md5_hash_skb,
+-};
+ #endif
++ .init_req = tcp_v6_init_req,
++#ifdef CONFIG_SYN_COOKIES
++ .cookie_init_seq = cookie_v6_init_sequence,
++#endif
++ .route_req = tcp_v6_route_req,
++ .init_seq = tcp_v6_init_sequence,
++ .send_synack = tcp_v6_send_synack,
++ .queue_hash_add = inet6_csk_reqsk_queue_hash_add,
++};
+
+-static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+- u32 tsval, u32 tsecr, int oif,
+- struct tcp_md5sig_key *key, int rst, u8 tclass,
+- u32 label)
++static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
++ u32 data_ack, u32 win, u32 tsval, u32 tsecr,
++ int oif, struct tcp_md5sig_key *key, int rst,
++ u8 tclass, u32 label, int mptcp)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct tcphdr *t1;
+@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ if (key)
+ tot_len += TCPOLEN_MD5SIG_ALIGNED;
+ #endif
+-
++#ifdef CONFIG_MPTCP
++ if (mptcp)
++ tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++#endif
+ buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
+ GFP_ATOMIC);
+ if (buff == NULL)
+@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ tcp_v6_md5_hash_hdr((__u8 *)topt, key,
+ &ipv6_hdr(skb)->saddr,
+ &ipv6_hdr(skb)->daddr, t1);
++ topt += 4;
++ }
++#endif
++#ifdef CONFIG_MPTCP
++ if (mptcp) {
++ /* Construction of 32-bit data_ack */
++ *topt++ = htonl((TCPOPT_MPTCP << 24) |
++ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++ (0x20 << 8) |
++ (0x01));
++ *topt++ = htonl(data_ack);
+ }
+ #endif
+
+@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ kfree_skb(buff);
+ }
+
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ u32 seq = 0, ack_seq = 0;
+@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ (th->doff << 2);
+
+ oif = sk ? sk->sk_bound_dev_if : 0;
+- tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
++ tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
+
+ #ifdef CONFIG_TCP_MD5SIG
+ release_sk1:
+@@ -902,45 +983,52 @@ release_sk1:
+ #endif
+ }
+
+-static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ u32 win, u32 tsval, u32 tsecr, int oif,
+ struct tcp_md5sig_key *key, u8 tclass,
+- u32 label)
++ u32 label, int mptcp)
+ {
+- tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
+- label);
++ tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
++ key, 0, tclass, label, mptcp);
+ }
+
+ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ struct inet_timewait_sock *tw = inet_twsk(sk);
+ struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++ u32 data_ack = 0;
++ int mptcp = 0;
+
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++ mptcp = 1;
++ }
+ tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++ data_ack,
+ tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ tcp_time_stamp + tcptw->tw_ts_offset,
+ tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
+- tw->tw_tclass, (tw->tw_flowlabel << 12));
++ tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
+
+ inet_twsk_put(tw);
+ }
+
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req)
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req)
+ {
+ /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ */
+ tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+- tcp_rsk(req)->rcv_nxt,
++ tcp_rsk(req)->rcv_nxt, 0,
+ req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
+ tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
+- 0, 0);
++ 0, 0, 0);
+ }
+
+
+-static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ struct request_sock *req, **prev;
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+
+ if (nsk) {
+ if (nsk->sk_state != TCP_TIME_WAIT) {
++ /* Don't lock again the meta-sk. It has been locked
++ * before mptcp_v6_do_rcv.
++ */
++ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++ bh_lock_sock(mptcp_meta_sk(nsk));
+ bh_lock_sock(nsk);
++
+ return nsk;
+ }
+ inet_twsk_put(inet_twsk(nsk));
+@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ return sk;
+ }
+
+-/* FIXME: this is substantially similar to the ipv4 code.
+- * Can some kind of merge be done? -- erics
+- */
+-static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+- struct tcp_options_received tmp_opt;
+- struct request_sock *req;
+- struct inet_request_sock *ireq;
+- struct ipv6_pinfo *np = inet6_sk(sk);
+- struct tcp_sock *tp = tcp_sk(sk);
+- __u32 isn = TCP_SKB_CB(skb)->when;
+- struct dst_entry *dst = NULL;
+- struct tcp_fastopen_cookie foc = { .len = -1 };
+- bool want_cookie = false, fastopen;
+- struct flowi6 fl6;
+- int err;
+-
+ if (skb->protocol == htons(ETH_P_IP))
+ return tcp_v4_conn_request(sk, skb);
+
+ if (!ipv6_unicast_destination(skb))
+ goto drop;
+
+- if ((sysctl_tcp_syncookies == 2 ||
+- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+- want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
+- if (!want_cookie)
+- goto drop;
+- }
+-
+- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+- goto drop;
+- }
+-
+- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
+- if (req == NULL)
+- goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
+-#endif
+-
+- tcp_clear_options(&tmp_opt);
+- tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
+- tmp_opt.user_mss = tp->rx_opt.user_mss;
+- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+- if (want_cookie && !tmp_opt.saw_tstamp)
+- tcp_clear_options(&tmp_opt);
++ return tcp_conn_request(&tcp6_request_sock_ops,
++ &tcp_request_sock_ipv6_ops, sk, skb);
+
+- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+- tcp_openreq_init(req, &tmp_opt, skb);
+-
+- ireq = inet_rsk(req);
+- ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
+- ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
+- if (!want_cookie || tmp_opt.tstamp_ok)
+- TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+- ireq->ir_iif = sk->sk_bound_dev_if;
+- ireq->ir_mark = inet_request_mark(sk, skb);
+-
+- /* So that link locals have meaning */
+- if (!sk->sk_bound_dev_if &&
+- ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
+- ireq->ir_iif = inet6_iif(skb);
+-
+- if (!isn) {
+- if (ipv6_opt_accepted(sk, skb) ||
+- np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
+- np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
+- np->repflow) {
+- atomic_inc(&skb->users);
+- ireq->pktopts = skb;
+- }
+-
+- if (want_cookie) {
+- isn = cookie_v6_init_sequence(sk, skb, &req->mss);
+- req->cookie_ts = tmp_opt.tstamp_ok;
+- goto have_isn;
+- }
+-
+- /* VJ's idea. We save last timestamp seen
+- * from the destination in peer table, when entering
+- * state TIME-WAIT, and check against it before
+- * accepting new connection request.
+- *
+- * If "isn" is not zero, this request hit alive
+- * timewait bucket, so that all the necessary checks
+- * are made in the function processing timewait state.
+- */
+- if (tmp_opt.saw_tstamp &&
+- tcp_death_row.sysctl_tw_recycle &&
+- (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
+- if (!tcp_peer_is_proven(req, dst, true)) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+- goto drop_and_release;
+- }
+- }
+- /* Kill the following clause, if you dislike this way. */
+- else if (!sysctl_tcp_syncookies &&
+- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+- (sysctl_max_syn_backlog >> 2)) &&
+- !tcp_peer_is_proven(req, dst, false)) {
+- /* Without syncookies last quarter of
+- * backlog is filled with destinations,
+- * proven to be alive.
+- * It means that we continue to communicate
+- * to destinations, already remembered
+- * to the moment of synflood.
+- */
+- LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
+- &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
+- goto drop_and_release;
+- }
+-
+- isn = tcp_v6_init_sequence(skb);
+- }
+-have_isn:
+-
+- if (security_inet_conn_request(sk, skb, req))
+- goto drop_and_release;
+-
+- if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_isn = isn;
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_openreq_init_rwin(req, sk, dst);
+- fastopen = !want_cookie &&
+- tcp_try_fastopen(sk, skb, req, &foc, dst);
+- err = tcp_v6_send_synack(sk, dst, &fl6, req,
+- skb_get_queue_mapping(skb), &foc);
+- if (!fastopen) {
+- if (err || want_cookie)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->listener = NULL;
+- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+- }
+- return 0;
+-
+-drop_and_release:
+- dst_release(dst);
+-drop_and_free:
+- reqsk_free(req);
+ drop:
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ return 0; /* don't send reset */
+ }
+
+-static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req,
+- struct dst_entry *dst)
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req,
++ struct dst_entry *dst)
+ {
+ struct inet_request_sock *ireq;
+ struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
+@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+
+ newsk->sk_v6_rcv_saddr = newnp->saddr;
+
+- inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(newsk))
++ inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
++ else
++#endif
++ inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
+ newsk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -1329,7 +1292,7 @@ out:
+ * This is because we cannot sleep with the original spinlock
+ * held.
+ */
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ {
+ struct ipv6_pinfo *np = inet6_sk(sk);
+ struct tcp_sock *tp;
+@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ goto discard;
+ #endif
+
++ if (is_meta_sk(sk))
++ return mptcp_v6_do_rcv(sk, skb);
++
+ if (sk_filter(sk, skb))
+ goto discard;
+
+@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ {
+ const struct tcphdr *th;
+ const struct ipv6hdr *hdr;
+- struct sock *sk;
++ struct sock *sk, *meta_sk = NULL;
+ int ret;
+ struct net *net = dev_net(skb->dev);
+
+@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ skb->len - th->doff*4);
+ TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++ TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ TCP_SKB_CB(skb)->when = 0;
+ TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
+ TCP_SKB_CB(skb)->sacked = 0;
+
+ sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+- if (!sk)
+- goto no_tcp_socket;
+
+ process:
+- if (sk->sk_state == TCP_TIME_WAIT)
++ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ goto do_time_wait;
+
++#ifdef CONFIG_MPTCP
++ if (!sk && th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, NULL);
++
++ if (ret < 0) {
++ tcp_v6_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++
++ /* Is there a pending request sock for this segment ? */
++ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++ if (sk)
++ sock_put(sk);
++ return 0;
++ }
++#endif
++
++ if (!sk)
++ goto no_tcp_socket;
++
+ if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+@@ -1529,11 +1520,21 @@ process:
+ sk_mark_napi_id(sk, skb);
+ skb->dev = NULL;
+
+- bh_lock_sock_nested(sk);
++ if (mptcp(tcp_sk(sk))) {
++ meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk))
++ skb->sk = sk;
++ } else {
++ meta_sk = sk;
++ bh_lock_sock_nested(sk);
++ }
++
+ ret = 0;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+- struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp = tcp_sk(meta_sk);
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = net_dma_find_channel();
+ if (tp->ucopy.dma_chan)
+@@ -1541,16 +1542,17 @@ process:
+ else
+ #endif
+ {
+- if (!tcp_prequeue(sk, skb))
++ if (!tcp_prequeue(meta_sk, skb))
+ ret = tcp_v6_do_rcv(sk, skb);
+ }
+- } else if (unlikely(sk_add_backlog(sk, skb,
+- sk->sk_rcvbuf + sk->sk_sndbuf))) {
+- bh_unlock_sock(sk);
++ } else if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ goto discard_and_relse;
+ }
+- bh_unlock_sock(sk);
++
++ bh_unlock_sock(meta_sk);
+
+ sock_put(sk);
+ return ret ? -1 : 0;
+@@ -1607,6 +1609,18 @@ do_time_wait:
+ sk = sk2;
+ goto process;
+ }
++#ifdef CONFIG_MPTCP
++ if (th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++ if (ret < 0) {
++ tcp_v6_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++#endif
+ /* Fall through to ACK */
+ }
+ case TCP_TW_ACK:
+@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
+ }
+ }
+
+-static struct timewait_sock_ops tcp6_timewait_sock_ops = {
++struct timewait_sock_ops tcp6_timewait_sock_ops = {
+ .twsk_obj_size = sizeof(struct tcp6_timewait_sock),
+ .twsk_unique = tcp_twsk_unique,
+ .twsk_destructor = tcp_twsk_destructor,
+@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
+
+ tcp_init_sock(sk);
+
+- icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_specific;
+
+ #ifdef CONFIG_TCP_MD5SIG
+ tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
+@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
+ return 0;
+ }
+
+-static void tcp_v6_destroy_sock(struct sock *sk)
++void tcp_v6_destroy_sock(struct sock *sk)
+ {
+ tcp_v4_destroy_sock(sk);
+ inet6_destroy_sock(sk);
+@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
+ static void tcp_v6_clear_sk(struct sock *sk, int size)
+ {
+ struct inet_sock *inet = inet_sk(sk);
++#ifdef CONFIG_MPTCP
++ struct tcp_sock *tp = tcp_sk(sk);
++ /* size_tk_table goes from the end of tk_table to the end of sk */
++ int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
++ sizeof(tp->tk_table);
++#endif
+
+ /* we do not want to clear pinet6 field, because of RCU lookups */
+ sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
+
+ size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
++
++#ifdef CONFIG_MPTCP
++ /* We zero out only from pinet6 to tk_table */
++ size -= size_tk_table + sizeof(tp->tk_table);
++#endif
+ memset(&inet->pinet6 + 1, 0, size);
++
++#ifdef CONFIG_MPTCP
++ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
++#endif
++
+ }
+
+ struct proto tcpv6_prot = {
+diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
+new file mode 100644
+index 000000000000..cdfc03adabf8
+--- /dev/null
++++ b/net/mptcp/Kconfig
+@@ -0,0 +1,115 @@
++#
++# MPTCP configuration
++#
++config MPTCP
++ bool "MPTCP protocol"
++ depends on (IPV6=y || IPV6=n)
++ ---help---
++ This replaces the normal TCP stack with a Multipath TCP stack,
++ able to use several paths at once.
++
++menuconfig MPTCP_PM_ADVANCED
++ bool "MPTCP: advanced path-manager control"
++ depends on MPTCP=y
++ ---help---
++ Support for selection of different path-managers. You should choose 'Y' here,
++ because otherwise you will not actively create new MPTCP-subflows.
++
++if MPTCP_PM_ADVANCED
++
++config MPTCP_FULLMESH
++ tristate "MPTCP Full-Mesh Path-Manager"
++ depends on MPTCP=y
++ ---help---
++ This path-management module will create a full-mesh among all IP-addresses.
++
++config MPTCP_NDIFFPORTS
++ tristate "MPTCP ndiff-ports"
++ depends on MPTCP=y
++ ---help---
++ This path-management module will create multiple subflows between the same
++ pair of IP-addresses, modifying the source-port. You can set the number
++ of subflows via the mptcp_ndiffports-sysctl.
++
++config MPTCP_BINDER
++ tristate "MPTCP Binder"
++ depends on (MPTCP=y)
++ ---help---
++ This path-management module works like ndiffports, and adds the sysctl
++ option to set the gateway (and/or path to) per each additional subflow
++ via Loose Source Routing (IPv4 only).
++
++choice
++ prompt "Default MPTCP Path-Manager"
++ default DEFAULT
++ help
++ Select the Path-Manager of your choice
++
++ config DEFAULT_FULLMESH
++ bool "Full mesh" if MPTCP_FULLMESH=y
++
++ config DEFAULT_NDIFFPORTS
++ bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
++
++ config DEFAULT_BINDER
++ bool "binder" if MPTCP_BINDER=y
++
++ config DEFAULT_DUMMY
++ bool "Default"
++
++endchoice
++
++endif
++
++config DEFAULT_MPTCP_PM
++ string
++ default "default" if DEFAULT_DUMMY
++ default "fullmesh" if DEFAULT_FULLMESH
++ default "ndiffports" if DEFAULT_NDIFFPORTS
++ default "binder" if DEFAULT_BINDER
++ default "default"
++
++menuconfig MPTCP_SCHED_ADVANCED
++ bool "MPTCP: advanced scheduler control"
++ depends on MPTCP=y
++ ---help---
++ Support for selection of different schedulers. You should choose 'Y' here,
++ if you want to choose a different scheduler than the default one.
++
++if MPTCP_SCHED_ADVANCED
++
++config MPTCP_ROUNDROBIN
++ tristate "MPTCP Round-Robin"
++ depends on (MPTCP=y)
++ ---help---
++ This is a very simple round-robin scheduler. Probably has bad performance
++ but might be interesting for researchers.
++
++choice
++ prompt "Default MPTCP Scheduler"
++ default DEFAULT
++ help
++ Select the Scheduler of your choice
++
++ config DEFAULT_SCHEDULER
++ bool "Default"
++ ---help---
++ This is the default scheduler, sending first on the subflow
++ with the lowest RTT.
++
++ config DEFAULT_ROUNDROBIN
++ bool "Round-Robin" if MPTCP_ROUNDROBIN=y
++ ---help---
++ This is the round-rob scheduler, sending in a round-robin
++ fashion..
++
++endchoice
++endif
++
++config DEFAULT_MPTCP_SCHED
++ string
++ depends on (MPTCP=y)
++ default "default" if DEFAULT_SCHEDULER
++ default "roundrobin" if DEFAULT_ROUNDROBIN
++ default "default"
++
+diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
+new file mode 100644
+index 000000000000..35561a7012e3
+--- /dev/null
++++ b/net/mptcp/Makefile
+@@ -0,0 +1,20 @@
++#
++## Makefile for MultiPath TCP support code.
++#
++#
++
++obj-$(CONFIG_MPTCP) += mptcp.o
++
++mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
++ mptcp_output.o mptcp_input.o mptcp_sched.o
++
++obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
++obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
++obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
++obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
++obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
++obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
++obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
++
++mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
++
+diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
+new file mode 100644
+index 000000000000..95d8da560715
+--- /dev/null
++++ b/net/mptcp/mptcp_binder.c
+@@ -0,0 +1,487 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#include <linux/route.h>
++#include <linux/inet.h>
++#include <linux/mroute.h>
++#include <linux/spinlock_types.h>
++#include <net/inet_ecn.h>
++#include <net/route.h>
++#include <net/xfrm.h>
++#include <net/compat.h>
++#include <linux/slab.h>
++
++#define MPTCP_GW_MAX_LISTS 10
++#define MPTCP_GW_LIST_MAX_LEN 6
++#define MPTCP_GW_SYSCTL_MAX_LEN (15 * MPTCP_GW_LIST_MAX_LEN * \
++ MPTCP_GW_MAX_LISTS)
++
++struct mptcp_gw_list {
++ struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
++ u8 len[MPTCP_GW_MAX_LISTS];
++};
++
++struct binder_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++
++ struct mptcp_cb *mpcb;
++
++ /* Prevent multiple sub-sockets concurrently iterating over sockets */
++ spinlock_t *flow_lock;
++};
++
++static struct mptcp_gw_list *mptcp_gws;
++static rwlock_t mptcp_gws_lock;
++
++static int mptcp_binder_ndiffports __read_mostly = 1;
++
++static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
++
++static int mptcp_get_avail_list_ipv4(struct sock *sk)
++{
++ int i, j, list_taken, opt_ret, opt_len;
++ unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
++
++ for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
++ if (mptcp_gws->len[i] == 0)
++ goto error;
++
++ mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
++ list_taken = 0;
++
++ /* Loop through all sub-sockets in this connection */
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
++
++ /* Reset length and options buffer, then retrieve
++ * from socket
++ */
++ opt_len = MAX_IPOPTLEN;
++ memset(opt, 0, MAX_IPOPTLEN);
++ opt_ret = ip_getsockopt(sk, IPPROTO_IP,
++ IP_OPTIONS, opt, &opt_len);
++ if (opt_ret < 0) {
++ mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
++ __func__, opt_ret);
++ goto error;
++ }
++
++ /* If socket has no options, it has no stake in this list */
++ if (opt_len <= 0)
++ continue;
++
++ /* Iterate options buffer */
++ for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
++ if (*opt_ptr == IPOPT_LSRR) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
++ goto sock_lsrr;
++ }
++ }
++ continue;
++
++sock_lsrr:
++ /* Pointer to the 2nd to last address */
++ opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
++
++ /* Addresses start 3 bytes after type offset */
++ opt_ptr += 3;
++ j = 0;
++
++ /* Different length lists cannot be the same */
++ if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
++ continue;
++
++ /* Iterate if we are still inside options list
++ * and sysctl list
++ */
++ while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
++ /* If there is a different address, this list must
++ * not be set on this socket
++ */
++ if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
++ break;
++
++ /* Jump 4 bytes to next address */
++ opt_ptr += 4;
++ j++;
++ }
++
++ /* Reached the end without a differing address, lists
++ * are therefore identical.
++ */
++ if (j == mptcp_gws->len[i]) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
++ list_taken = 1;
++ break;
++ }
++ }
++
++ /* Free list found if not taken by a socket */
++ if (!list_taken) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
++ break;
++ }
++ }
++
++ if (i >= MPTCP_GW_MAX_LISTS)
++ goto error;
++
++ return i;
++error:
++ return -1;
++}
++
++/* The list of addresses is parsed each time a new connection is opened,
++ * to make sure it's up to date. In case of error, all the lists are
++ * marked as unavailable and the subflow's fingerprint is set to 0.
++ */
++static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
++{
++ int i, j, ret;
++ unsigned char opt[MAX_IPOPTLEN] = {0};
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
++
++ /* Read lock: multiple sockets can read LSRR addresses at the same
++ * time, but writes are done in mutual exclusion.
++ * Spin lock: must search for free list for one socket at a time, or
++ * multiple sockets could take the same list.
++ */
++ read_lock(&mptcp_gws_lock);
++ spin_lock(fmp->flow_lock);
++
++ i = mptcp_get_avail_list_ipv4(sk);
++
++ /* Execution enters here only if a free path is found.
++ */
++ if (i >= 0) {
++ opt[0] = IPOPT_NOP;
++ opt[1] = IPOPT_LSRR;
++ opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
++ (mptcp_gws->len[i] + 1) + 3;
++ opt[3] = IPOPT_MINOFF;
++ for (j = 0; j < mptcp_gws->len[i]; ++j)
++ memcpy(opt + 4 +
++ (j * sizeof(mptcp_gws->list[i][0].s_addr)),
++ &mptcp_gws->list[i][j].s_addr,
++ sizeof(mptcp_gws->list[i][0].s_addr));
++ /* Final destination must be part of IP_OPTIONS parameter. */
++ memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
++ sizeof(addr.s_addr));
++
++ /* setsockopt must be inside the lock, otherwise another
++ * subflow could fail to see that we have taken a list.
++ */
++ ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
++ 4 + sizeof(mptcp_gws->list[i][0].s_addr)
++ * (mptcp_gws->len[i] + 1));
++
++ if (ret < 0) {
++ mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
++ __func__, ret);
++ }
++ }
++
++ spin_unlock(fmp->flow_lock);
++ read_unlock(&mptcp_gws_lock);
++
++ return;
++}
++
++/* Parses gateways string for a list of paths to different
++ * gateways, and stores them for use with the Loose Source Routing (LSRR)
++ * socket option. Each list must have "," separated addresses, and the lists
++ * themselves must be separated by "-". Returns -1 in case one or more of the
++ * addresses is not a valid ipv4/6 address.
++ */
++static int mptcp_parse_gateway_ipv4(char *gateways)
++{
++ int i, j, k, ret;
++ char *tmp_string = NULL;
++ struct in_addr tmp_addr;
++
++ tmp_string = kzalloc(16, GFP_KERNEL);
++ if (tmp_string == NULL)
++ return -ENOMEM;
++
++ write_lock(&mptcp_gws_lock);
++
++ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++
++ /* A TMP string is used since inet_pton needs a null terminated string
++ * but we do not want to modify the sysctl for obvious reasons.
++ * i will iterate over the SYSCTL string, j will iterate over the
++ * temporary string where each IP is copied into, k will iterate over
++ * the IPs in each list.
++ */
++ for (i = j = k = 0;
++ i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
++ ++i) {
++ if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
++ /* If the temp IP is empty and the current list is
++ * empty, we are done.
++ */
++ if (j == 0 && mptcp_gws->len[k] == 0)
++ break;
++
++ /* Terminate the temp IP string, then if it is
++ * non-empty parse the IP and copy it.
++ */
++ tmp_string[j] = '\0';
++ if (j > 0) {
++ mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
++
++ ret = in4_pton(tmp_string, strlen(tmp_string),
++ (u8 *)&tmp_addr.s_addr, '\0',
++ NULL);
++
++ if (ret) {
++ mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
++ ret,
++ &tmp_addr.s_addr);
++ memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
++ &tmp_addr.s_addr,
++ sizeof(tmp_addr.s_addr));
++ mptcp_gws->len[k]++;
++ j = 0;
++ tmp_string[j] = '\0';
++ /* Since we can't impose a limit to
++ * what the user can input, make sure
++ * there are not too many IPs in the
++ * SYSCTL string.
++ */
++ if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
++ mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
++ k,
++ MPTCP_GW_LIST_MAX_LEN);
++ goto error;
++ }
++ } else {
++ goto error;
++ }
++ }
++
++ if (gateways[i] == '-' || gateways[i] == '\0')
++ ++k;
++ } else {
++ tmp_string[j] = gateways[i];
++ ++j;
++ }
++ }
++
++ /* Number of flows is number of gateway lists plus master flow */
++ mptcp_binder_ndiffports = k+1;
++
++ write_unlock(&mptcp_gws_lock);
++ kfree(tmp_string);
++
++ return 0;
++
++error:
++ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++ memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
++ write_unlock(&mptcp_gws_lock);
++ kfree(tmp_string);
++ return -1;
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ const struct binder_priv *pm_priv = container_of(work,
++ struct binder_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = pm_priv->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ int iter = 0;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ if (mptcp_binder_ndiffports > iter &&
++ mptcp_binder_ndiffports > mpcb->cnt_subflows) {
++ struct mptcp_loc4 loc;
++ struct mptcp_rem4 rem;
++
++ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++ loc.loc4_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem4_id = 0; /* Default 0 */
++
++ mptcp_init4_subsockets(meta_sk, &loc, &rem);
++
++ goto next_subflow;
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void binder_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
++ static DEFINE_SPINLOCK(flow_lock);
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (meta_sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(meta_sk)) {
++ mptcp_fallback_default(mpcb);
++ return;
++ }
++#endif
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ fmp->mpcb = mpcb;
++
++ fmp->flow_lock = &flow_lock;
++}
++
++static void binder_create_subflows(struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (!work_pending(&pm_priv->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &pm_priv->subflow_work);
++ }
++}
++
++static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
++ * Inspired from proc_tcp_congestion_control().
++ */
++static int proc_mptcp_gateways(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ int ret;
++ ctl_table tbl = {
++ .maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
++ };
++
++ if (write) {
++ tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
++ if (tbl.data == NULL)
++ return -1;
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (ret == 0) {
++ ret = mptcp_parse_gateway_ipv4(tbl.data);
++ memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
++ }
++ kfree(tbl.data);
++ } else {
++ ret = proc_dostring(ctl, write, buffer, lenp, ppos);
++ }
++
++
++ return ret;
++}
++
++static struct mptcp_pm_ops binder __read_mostly = {
++ .new_session = binder_new_session,
++ .fully_established = binder_create_subflows,
++ .get_local_id = binder_get_local_id,
++ .init_subsocket_v4 = mptcp_v4_add_lsrr,
++ .name = "binder",
++ .owner = THIS_MODULE,
++};
++
++static struct ctl_table binder_table[] = {
++ {
++ .procname = "mptcp_binder_gateways",
++ .data = &sysctl_mptcp_binder_gateways,
++ .maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
++ .mode = 0644,
++ .proc_handler = &proc_mptcp_gateways
++ },
++ { }
++};
++
++struct ctl_table_header *mptcp_sysctl_binder;
++
++/* General initialization of MPTCP_PM */
++static int __init binder_register(void)
++{
++ mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
++ if (!mptcp_gws)
++ return -ENOMEM;
++
++ rwlock_init(&mptcp_gws_lock);
++
++ BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
++
++ mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
++ binder_table);
++ if (!mptcp_sysctl_binder)
++ goto sysctl_fail;
++
++ if (mptcp_register_path_manager(&binder))
++ goto pm_failed;
++
++ return 0;
++
++pm_failed:
++ unregister_net_sysctl_table(mptcp_sysctl_binder);
++sysctl_fail:
++ kfree(mptcp_gws);
++
++ return -1;
++}
++
++static void binder_unregister(void)
++{
++ mptcp_unregister_path_manager(&binder);
++ unregister_net_sysctl_table(mptcp_sysctl_binder);
++ kfree(mptcp_gws);
++}
++
++module_init(binder_register);
++module_exit(binder_unregister);
++
++MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("BINDER MPTCP");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
+new file mode 100644
+index 000000000000..5d761164eb85
+--- /dev/null
++++ b/net/mptcp/mptcp_coupled.c
+@@ -0,0 +1,270 @@
++/*
++ * MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++/* Scaling is done in the numerator with alpha_scale_num and in the denominator
++ * with alpha_scale_den.
++ *
++ * To downscale, we just need to use alpha_scale.
++ *
++ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
++ */
++static int alpha_scale_den = 10;
++static int alpha_scale_num = 32;
++static int alpha_scale = 12;
++
++struct mptcp_ccc {
++ u64 alpha;
++ bool forced_update;
++};
++
++static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
++{
++ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
++{
++ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
++}
++
++static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
++{
++ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
++}
++
++static inline u64 mptcp_ccc_scale(u32 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++static inline bool mptcp_get_forced(const struct sock *meta_sk)
++{
++ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
++}
++
++static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
++{
++ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
++}
++
++static void mptcp_ccc_recalc_alpha(const struct sock *sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ const struct sock *sub_sk;
++ int best_cwnd = 0, best_rtt = 0, can_send = 0;
++ u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
++
++ if (!mpcb)
++ return;
++
++ /* Only one subflow left - fall back to normal reno-behavior
++ * (set alpha to 1)
++ */
++ if (mpcb->cnt_established <= 1)
++ goto exit;
++
++ /* Do regular alpha-calculation for multiple subflows */
++
++ /* Find the max numerator of the alpha-calculation */
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++ u64 tmp;
++
++ if (!mptcp_ccc_sk_can_send(sub_sk))
++ continue;
++
++ can_send++;
++
++ /* We need to look for the path, that provides the max-value.
++ * Integer-overflow is not possible here, because
++ * tmp will be in u64.
++ */
++ tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
++ alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
++
++ if (tmp >= max_numerator) {
++ max_numerator = tmp;
++ best_cwnd = sub_tp->snd_cwnd;
++ best_rtt = sub_tp->srtt_us;
++ }
++ }
++
++ /* No subflow is able to send - we don't care anymore */
++ if (unlikely(!can_send))
++ goto exit;
++
++ /* Calculate the denominator */
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++
++ if (!mptcp_ccc_sk_can_send(sub_sk))
++ continue;
++
++ sum_denominator += div_u64(
++ mptcp_ccc_scale(sub_tp->snd_cwnd,
++ alpha_scale_den) * best_rtt,
++ sub_tp->srtt_us);
++ }
++ sum_denominator *= sum_denominator;
++ if (unlikely(!sum_denominator)) {
++ pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
++ __func__, mpcb->cnt_established);
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++ pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
++ __func__, sub_tp->mptcp->path_index,
++ sub_sk->sk_state, sub_tp->srtt_us,
++ sub_tp->snd_cwnd);
++ }
++ }
++
++ alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
++
++ if (unlikely(!alpha))
++ alpha = 1;
++
++exit:
++ mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
++}
++
++static void mptcp_ccc_init(struct sock *sk)
++{
++ if (mptcp(tcp_sk(sk))) {
++ mptcp_set_forced(mptcp_meta_sk(sk), 0);
++ mptcp_set_alpha(mptcp_meta_sk(sk), 1);
++ }
++ /* If we do not mptcp, behave like reno: return */
++}
++
++static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++ if (event == CA_EVENT_LOSS)
++ mptcp_ccc_recalc_alpha(sk);
++}
++
++static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
++{
++ if (!mptcp(tcp_sk(sk)))
++ return;
++
++ mptcp_set_forced(mptcp_meta_sk(sk), 1);
++}
++
++static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++ int snd_cwnd;
++
++ if (!mptcp(tp)) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ if (!tcp_is_cwnd_limited(sk))
++ return;
++
++ if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ /* In "safe" area, increase. */
++ tcp_slow_start(tp, acked);
++ mptcp_ccc_recalc_alpha(sk);
++ return;
++ }
++
++ if (mptcp_get_forced(mptcp_meta_sk(sk))) {
++ mptcp_ccc_recalc_alpha(sk);
++ mptcp_set_forced(mptcp_meta_sk(sk), 0);
++ }
++
++ if (mpcb->cnt_established > 1) {
++ u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
++
++ /* This may happen, if at the initialization, the mpcb
++ * was not yet attached to the sock, and thus
++ * initializing alpha failed.
++ */
++ if (unlikely(!alpha))
++ alpha = 1;
++
++ snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
++ alpha);
++
++ /* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
++ * Thus, we select here the max value.
++ */
++ if (snd_cwnd < tp->snd_cwnd)
++ snd_cwnd = tp->snd_cwnd;
++ } else {
++ snd_cwnd = tp->snd_cwnd;
++ }
++
++ if (tp->snd_cwnd_cnt >= snd_cwnd) {
++ if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
++ tp->snd_cwnd++;
++ mptcp_ccc_recalc_alpha(sk);
++ }
++
++ tp->snd_cwnd_cnt = 0;
++ } else {
++ tp->snd_cwnd_cnt++;
++ }
++}
++
++static struct tcp_congestion_ops mptcp_ccc = {
++ .init = mptcp_ccc_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_ccc_cong_avoid,
++ .cwnd_event = mptcp_ccc_cwnd_event,
++ .set_state = mptcp_ccc_set_state,
++ .owner = THIS_MODULE,
++ .name = "lia",
++};
++
++static int __init mptcp_ccc_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
++ return tcp_register_congestion_control(&mptcp_ccc);
++}
++
++static void __exit mptcp_ccc_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_ccc);
++}
++
++module_init(mptcp_ccc_register);
++module_exit(mptcp_ccc_unregister);
++
++MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
+new file mode 100644
+index 000000000000..28dfa0479f5e
+--- /dev/null
++++ b/net/mptcp/mptcp_ctrl.c
+@@ -0,0 +1,2401 @@
++/*
++ * MPTCP implementation - MPTCP-control
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <net/inet_common.h>
++#include <net/inet6_hashtables.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/ip6_route.h>
++#include <net/mptcp_v6.h>
++#endif
++#include <net/sock.h>
++#include <net/tcp.h>
++#include <net/tcp_states.h>
++#include <net/transp_v6.h>
++#include <net/xfrm.h>
++
++#include <linux/cryptohash.h>
++#include <linux/kconfig.h>
++#include <linux/module.h>
++#include <linux/netpoll.h>
++#include <linux/list.h>
++#include <linux/jhash.h>
++#include <linux/tcp.h>
++#include <linux/net.h>
++#include <linux/in.h>
++#include <linux/random.h>
++#include <linux/inetdevice.h>
++#include <linux/workqueue.h>
++#include <linux/atomic.h>
++#include <linux/sysctl.h>
++
++static struct kmem_cache *mptcp_sock_cache __read_mostly;
++static struct kmem_cache *mptcp_cb_cache __read_mostly;
++static struct kmem_cache *mptcp_tw_cache __read_mostly;
++
++int sysctl_mptcp_enabled __read_mostly = 1;
++int sysctl_mptcp_checksum __read_mostly = 1;
++int sysctl_mptcp_debug __read_mostly;
++EXPORT_SYMBOL(sysctl_mptcp_debug);
++int sysctl_mptcp_syn_retries __read_mostly = 3;
++
++bool mptcp_init_failed __read_mostly;
++
++struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
++EXPORT_SYMBOL(mptcp_static_key);
++
++static int proc_mptcp_path_manager(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ char val[MPTCP_PM_NAME_MAX];
++ ctl_table tbl = {
++ .data = val,
++ .maxlen = MPTCP_PM_NAME_MAX,
++ };
++ int ret;
++
++ mptcp_get_default_path_manager(val);
++
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (write && ret == 0)
++ ret = mptcp_set_default_path_manager(val);
++ return ret;
++}
++
++static int proc_mptcp_scheduler(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ char val[MPTCP_SCHED_NAME_MAX];
++ ctl_table tbl = {
++ .data = val,
++ .maxlen = MPTCP_SCHED_NAME_MAX,
++ };
++ int ret;
++
++ mptcp_get_default_scheduler(val);
++
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (write && ret == 0)
++ ret = mptcp_set_default_scheduler(val);
++ return ret;
++}
++
++static struct ctl_table mptcp_table[] = {
++ {
++ .procname = "mptcp_enabled",
++ .data = &sysctl_mptcp_enabled,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_checksum",
++ .data = &sysctl_mptcp_checksum,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_debug",
++ .data = &sysctl_mptcp_debug,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_syn_retries",
++ .data = &sysctl_mptcp_syn_retries,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_path_manager",
++ .mode = 0644,
++ .maxlen = MPTCP_PM_NAME_MAX,
++ .proc_handler = proc_mptcp_path_manager,
++ },
++ {
++ .procname = "mptcp_scheduler",
++ .mode = 0644,
++ .maxlen = MPTCP_SCHED_NAME_MAX,
++ .proc_handler = proc_mptcp_scheduler,
++ },
++ { }
++};
++
++static inline u32 mptcp_hash_tk(u32 token)
++{
++ return token % MPTCP_HASH_SIZE;
++}
++
++struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++EXPORT_SYMBOL(tk_hashtable);
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
++
++/* The following hash table is used to avoid collision of token */
++static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_tk_hashlock; /* hashtable protection */
++
++static bool mptcp_reqsk_find_tk(const u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct mptcp_request_sock *mtreqsk;
++ const struct hlist_nulls_node *node;
++
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreqsk, node,
++ &mptcp_reqsk_tk_htb[hash], hash_entry) {
++ if (token == mtreqsk->mptcp_loc_token)
++ return true;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++ return false;
++}
++
++static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
++{
++ u32 hash = mptcp_hash_tk(token);
++
++ hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
++ &mptcp_reqsk_tk_htb[hash]);
++}
++
++static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
++{
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++}
++
++void mptcp_reqsk_destructor(struct request_sock *req)
++{
++ if (!mptcp_rsk(req)->is_sub) {
++ if (in_softirq()) {
++ mptcp_reqsk_remove_tk(req);
++ } else {
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++ }
++ } else {
++ mptcp_hash_request_remove(req);
++ }
++}
++
++static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
++{
++ u32 hash = mptcp_hash_tk(token);
++ hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
++ meta_tp->inside_tk_table = 1;
++}
++
++static bool mptcp_find_token(u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct tcp_sock *meta_tp;
++ const struct hlist_nulls_node *node;
++
++begin:
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
++ if (token == meta_tp->mptcp_loc_token)
++ return true;
++ }
++ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++ return false;
++}
++
++static void mptcp_set_key_reqsk(struct request_sock *req,
++ const struct sk_buff *skb)
++{
++ const struct inet_request_sock *ireq = inet_rsk(req);
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++ if (skb->protocol == htons(ETH_P_IP)) {
++ mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr,
++ htons(ireq->ir_num),
++ ireq->ir_rmt_port);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
++ ipv6_hdr(skb)->daddr.s6_addr32,
++ htons(ireq->ir_num),
++ ireq->ir_rmt_port);
++#endif
++ }
++
++ mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
++}
++
++/* New MPTCP-connection request, prepare a new token for the meta-socket that
++ * will be created in mptcp_check_req_master(), and store the received token.
++ */
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++ inet_rsk(req)->saw_mpc = 1;
++
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ do {
++ mptcp_set_key_reqsk(req, skb);
++ } while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
++ mptcp_find_token(mtreq->mptcp_loc_token));
++
++ mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++ mtreq->mptcp_rem_key = mopt->mptcp_key;
++}
++
++static void mptcp_set_key_sk(const struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct inet_sock *isk = inet_sk(sk);
++
++ if (sk->sk_family == AF_INET)
++ tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
++ isk->inet_daddr,
++ isk->inet_sport,
++ isk->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
++ sk->sk_v6_daddr.s6_addr32,
++ isk->inet_sport,
++ isk->inet_dport);
++#endif
++
++ mptcp_key_sha1(tp->mptcp_loc_key,
++ &tp->mptcp_loc_token, NULL);
++}
++
++void mptcp_connect_init(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ do {
++ mptcp_set_key_sk(sk);
++ } while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
++ mptcp_find_token(tp->mptcp_loc_token));
++
++ __mptcp_hash_insert(tp, tp->mptcp_loc_token);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++}
++
++/**
++ * This function increments the refcount of the mpcb struct.
++ * It is the responsibility of the caller to decrement when releasing
++ * the structure.
++ */
++struct sock *mptcp_hash_find(const struct net *net, const u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct tcp_sock *meta_tp;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
++ tk_table) {
++ meta_sk = (struct sock *)meta_tp;
++ if (token == meta_tp->mptcp_loc_token &&
++ net_eq(net, sock_net(meta_sk))) {
++ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ goto out;
++ if (unlikely(token != meta_tp->mptcp_loc_token ||
++ !net_eq(net, sock_net(meta_sk)))) {
++ sock_gen_put(meta_sk);
++ goto begin;
++ }
++ goto found;
++ }
++ }
++ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++out:
++ meta_sk = NULL;
++found:
++ rcu_read_unlock();
++ return meta_sk;
++}
++
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
++{
++ /* remove from the token hashtable */
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++ meta_tp->inside_tk_table = 0;
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++}
++
++void mptcp_hash_remove(struct tcp_sock *meta_tp)
++{
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++ meta_tp->inside_tk_table = 0;
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++}
++
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk, *rttsk = NULL, *lastsk = NULL;
++ u32 min_time = 0, last_active = 0;
++
++ mptcp_for_each_sk(meta_tp->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ u32 elapsed;
++
++ if (!mptcp_sk_can_send_ack(sk) || tp->pf)
++ continue;
++
++ elapsed = keepalive_time_elapsed(tp);
++
++ /* We take the one with the lowest RTT within a reasonable
++ * (meta-RTO)-timeframe
++ */
++ if (elapsed < inet_csk(meta_sk)->icsk_rto) {
++ if (!min_time || tp->srtt_us < min_time) {
++ min_time = tp->srtt_us;
++ rttsk = sk;
++ }
++ continue;
++ }
++
++ /* Otherwise, we just take the most recent active */
++ if (!rttsk && (!last_active || elapsed < last_active)) {
++ last_active = elapsed;
++ lastsk = sk;
++ }
++ }
++
++ if (rttsk)
++ return rttsk;
++
++ return lastsk;
++}
++EXPORT_SYMBOL(mptcp_select_ack_sock);
++
++static void mptcp_sock_def_error_report(struct sock *sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ if (!sock_flag(sk, SOCK_DEAD))
++ mptcp_sub_close(sk, 0);
++
++ if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
++ mpcb->send_infinite_mapping) {
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ meta_sk->sk_err = sk->sk_err;
++ meta_sk->sk_err_soft = sk->sk_err_soft;
++
++ if (!sock_flag(meta_sk, SOCK_DEAD))
++ meta_sk->sk_error_report(meta_sk);
++
++ tcp_done(meta_sk);
++ }
++
++ sk->sk_err = 0;
++ return;
++}
++
++static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
++{
++ if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
++ mptcp_cleanup_path_manager(mpcb);
++ mptcp_cleanup_scheduler(mpcb);
++ kmem_cache_free(mptcp_cb_cache, mpcb);
++ }
++}
++
++static void mptcp_sock_destruct(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ inet_sock_destruct(sk);
++
++ if (!is_meta_sk(sk) && !tp->was_meta_sk) {
++ BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
++
++ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++ tp->mptcp = NULL;
++
++ /* Taken when mpcb pointer was set */
++ sock_put(mptcp_meta_sk(sk));
++ mptcp_mpcb_put(tp->mpcb);
++ } else {
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct mptcp_tw *mptw;
++
++ /* The mpcb is disappearing - we can make the final
++ * update to the rcv_nxt of the time-wait-sock and remove
++ * its reference to the mpcb.
++ */
++ spin_lock_bh(&mpcb->tw_lock);
++ list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
++ list_del_rcu(&mptw->list);
++ mptw->in_list = 0;
++ mptcp_mpcb_put(mpcb);
++ rcu_assign_pointer(mptw->mpcb, NULL);
++ }
++ spin_unlock_bh(&mpcb->tw_lock);
++
++ mptcp_mpcb_put(mpcb);
++
++ mptcp_debug("%s destroying meta-sk\n", __func__);
++ }
++
++ WARN_ON(!static_key_false(&mptcp_static_key));
++ /* Must be the last call, because is_meta_sk() above still needs the
++ * static key
++ */
++ static_key_slow_dec(&mptcp_static_key);
++}
++
++void mptcp_destroy_sock(struct sock *sk)
++{
++ if (is_meta_sk(sk)) {
++ struct sock *sk_it, *tmpsk;
++
++ __skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
++ mptcp_purge_ofo_queue(tcp_sk(sk));
++
++ /* We have to close all remaining subflows. Normally, they
++ * should all be about to get closed. But, if the kernel is
++ * forcing a closure (e.g., tcp_write_err), the subflows might
++ * not have been closed properly (as we are waiting for the
++ * DATA_ACK of the DATA_FIN).
++ */
++ mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
++ /* Already did call tcp_close - waiting for graceful
++ * closure, or if we are retransmitting fast-close on
++ * the subflow. The reset (or timeout) will kill the
++ * subflow..
++ */
++ if (tcp_sk(sk_it)->closing ||
++ tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++
++ /* Allow the delayed work first to prevent time-wait state */
++ if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
++ continue;
++
++ mptcp_sub_close(sk_it, 0);
++ }
++
++ mptcp_delete_synack_timer(sk);
++ } else {
++ mptcp_del_sock(sk);
++ }
++}
++
++static void mptcp_set_state(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ /* Meta is not yet established - wake up the application */
++ if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
++ sk->sk_state == TCP_ESTABLISHED) {
++ tcp_set_state(meta_sk, TCP_ESTABLISHED);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ meta_sk->sk_state_change(meta_sk);
++ sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
++ }
++ }
++
++ if (sk->sk_state == TCP_ESTABLISHED) {
++ tcp_sk(sk)->mptcp->establish_increased = 1;
++ tcp_sk(sk)->mpcb->cnt_established++;
++ }
++}
++
++void mptcp_init_congestion_control(struct sock *sk)
++{
++ struct inet_connection_sock *icsk = inet_csk(sk);
++ struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
++ const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
++
++ /* The application didn't set the congestion control to use
++ * fallback to the default one.
++ */
++ if (ca == &tcp_init_congestion_ops)
++ goto use_default;
++
++ /* Use the same congestion control as set by the user. If the
++ * module is not available fallback to the default one.
++ */
++ if (!try_module_get(ca->owner)) {
++ pr_warn("%s: fallback to the system default CC\n", __func__);
++ goto use_default;
++ }
++
++ icsk->icsk_ca_ops = ca;
++ if (icsk->icsk_ca_ops->init)
++ icsk->icsk_ca_ops->init(sk);
++
++ return;
++
++use_default:
++ icsk->icsk_ca_ops = &tcp_init_congestion_ops;
++ tcp_init_congestion_control(sk);
++}
++
++u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
++u32 mptcp_seed = 0;
++
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
++{
++ u32 workspace[SHA_WORKSPACE_WORDS];
++ u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
++ u8 input[64];
++ int i;
++
++ memset(workspace, 0, sizeof(workspace));
++
++ /* Initialize input with appropriate padding */
++ memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
++ * is explicitly set too
++ */
++ memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
++ input[8] = 0x80; /* Padding: First bit after message = 1 */
++ input[63] = 0x40; /* Padding: Length of the message = 64 bits */
++
++ sha_init(mptcp_hashed_key);
++ sha_transform(mptcp_hashed_key, input, workspace);
++
++ for (i = 0; i < 5; i++)
++ mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
++
++ if (token)
++ *token = mptcp_hashed_key[0];
++ if (idsn)
++ *idsn = *((u64 *)&mptcp_hashed_key[3]);
++}
++
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++ u32 *hash_out)
++{
++ u32 workspace[SHA_WORKSPACE_WORDS];
++ u8 input[128]; /* 2 512-bit blocks */
++ int i;
++
++ memset(workspace, 0, sizeof(workspace));
++
++ /* Generate key xored with ipad */
++ memset(input, 0x36, 64);
++ for (i = 0; i < 8; i++)
++ input[i] ^= key_1[i];
++ for (i = 0; i < 8; i++)
++ input[i + 8] ^= key_2[i];
++
++ memcpy(&input[64], rand_1, 4);
++ memcpy(&input[68], rand_2, 4);
++ input[72] = 0x80; /* Padding: First bit after message = 1 */
++ memset(&input[73], 0, 53);
++
++ /* Padding: Length of the message = 512 + 64 bits */
++ input[126] = 0x02;
++ input[127] = 0x40;
++
++ sha_init(hash_out);
++ sha_transform(hash_out, input, workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ sha_transform(hash_out, &input[64], workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ for (i = 0; i < 5; i++)
++ hash_out[i] = cpu_to_be32(hash_out[i]);
++
++ /* Prepare second part of hmac */
++ memset(input, 0x5C, 64);
++ for (i = 0; i < 8; i++)
++ input[i] ^= key_1[i];
++ for (i = 0; i < 8; i++)
++ input[i + 8] ^= key_2[i];
++
++ memcpy(&input[64], hash_out, 20);
++ input[84] = 0x80;
++ memset(&input[85], 0, 41);
++
++ /* Padding: Length of the message = 512 + 160 bits */
++ input[126] = 0x02;
++ input[127] = 0xA0;
++
++ sha_init(hash_out);
++ sha_transform(hash_out, input, workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ sha_transform(hash_out, &input[64], workspace);
++
++ for (i = 0; i < 5; i++)
++ hash_out[i] = cpu_to_be32(hash_out[i]);
++}
++
++static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
++{
++ /* Socket-options handled by sk_clone_lock while creating the meta-sk.
++ * ======
++ * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
++ * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
++ * TCP_NODELAY, TCP_CORK
++ *
++ * Socket-options handled in this function here
++ * ======
++ * TCP_DEFER_ACCEPT
++ * SO_KEEPALIVE
++ *
++ * Socket-options on the todo-list
++ * ======
++ * SO_BINDTODEVICE - should probably prevent creation of new subsocks
++ * across other devices. - what about the api-draft?
++ * SO_DEBUG
++ * SO_REUSEADDR - probably we don't care about this
++ * SO_DONTROUTE, SO_BROADCAST
++ * SO_OOBINLINE
++ * SO_LINGER
++ * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
++ * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
++ * SO_RXQ_OVFL
++ * TCP_COOKIE_TRANSACTIONS
++ * TCP_MAXSEG
++ * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
++ * in mptcp_retransmit_timer. AND we need to check what is
++ * about the subsockets.
++ * TCP_LINGER2
++ * TCP_WINDOW_CLAMP
++ * TCP_USER_TIMEOUT
++ * TCP_MD5SIG
++ *
++ * Socket-options of no concern for the meta-socket (but for the subsocket)
++ * ======
++ * SO_PRIORITY
++ * SO_MARK
++ * TCP_CONGESTION
++ * TCP_SYNCNT
++ * TCP_QUICKACK
++ */
++
++ /* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
++ inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
++
++ /* Keepalives are handled entirely at the MPTCP-layer */
++ if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
++ inet_csk_reset_keepalive_timer(meta_sk,
++ keepalive_time_when(tcp_sk(meta_sk)));
++ sock_reset_flag(master_sk, SOCK_KEEPOPEN);
++ inet_csk_delete_keepalive_timer(master_sk);
++ }
++
++ /* Do not propagate subflow-errors up to the MPTCP-layer */
++ inet_sk(master_sk)->recverr = 0;
++}
++
++static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
++{
++ /* IP_TOS also goes to the subflow. */
++ if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
++ inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
++ sub_sk->sk_priority = meta_sk->sk_priority;
++ sk_dst_reset(sub_sk);
++ }
++
++ /* Inherit SO_REUSEADDR */
++ sub_sk->sk_reuse = meta_sk->sk_reuse;
++
++ /* Inherit snd/rcv-buffer locks */
++ sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
++
++ /* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
++ tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
++
++ /* Keepalives are handled entirely at the MPTCP-layer */
++ if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
++ sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
++ inet_csk_delete_keepalive_timer(sub_sk);
++ }
++
++ /* Do not propagate subflow-errors up to the MPTCP-layer */
++ inet_sk(sub_sk)->recverr = 0;
++}
++
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ /* skb-sk may be NULL if we receive a packet immediatly after the
++ * SYN/ACK + MP_CAPABLE.
++ */
++ struct sock *sk = skb->sk ? skb->sk : meta_sk;
++ int ret = 0;
++
++ skb->sk = NULL;
++
++ if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
++ kfree_skb(skb);
++ return 0;
++ }
++
++ if (sk->sk_family == AF_INET)
++ ret = tcp_v4_do_rcv(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ ret = tcp_v6_do_rcv(sk, skb);
++#endif
++
++ sock_put(sk);
++ return ret;
++}
++
++struct lock_class_key meta_key;
++struct lock_class_key meta_slock_key;
++
++static void mptcp_synack_timer_handler(unsigned long data)
++{
++ struct sock *meta_sk = (struct sock *) data;
++ struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
++
++ /* Only process if socket is not in use. */
++ bh_lock_sock(meta_sk);
++
++ if (sock_owned_by_user(meta_sk)) {
++ /* Try again later. */
++ mptcp_reset_synack_timer(meta_sk, HZ/20);
++ goto out;
++ }
++
++ /* May happen if the queue got destructed in mptcp_close */
++ if (!lopt)
++ goto out;
++
++ inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
++ TCP_TIMEOUT_INIT, TCP_RTO_MAX);
++
++ if (lopt->qlen)
++ mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
++
++out:
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk);
++}
++
++static const struct tcp_sock_ops mptcp_meta_specific = {
++ .__select_window = __mptcp_select_window,
++ .select_window = mptcp_select_window,
++ .select_initial_window = mptcp_select_initial_window,
++ .init_buffer_space = mptcp_init_buffer_space,
++ .set_rto = mptcp_tcp_set_rto,
++ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
++ .init_congestion_control = mptcp_init_congestion_control,
++ .send_fin = mptcp_send_fin,
++ .write_xmit = mptcp_write_xmit,
++ .send_active_reset = mptcp_send_active_reset,
++ .write_wakeup = mptcp_write_wakeup,
++ .prune_ofo_queue = mptcp_prune_ofo_queue,
++ .retransmit_timer = mptcp_retransmit_timer,
++ .time_wait = mptcp_time_wait,
++ .cleanup_rbuf = mptcp_cleanup_rbuf,
++};
++
++static const struct tcp_sock_ops mptcp_sub_specific = {
++ .__select_window = __mptcp_select_window,
++ .select_window = mptcp_select_window,
++ .select_initial_window = mptcp_select_initial_window,
++ .init_buffer_space = mptcp_init_buffer_space,
++ .set_rto = mptcp_tcp_set_rto,
++ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
++ .init_congestion_control = mptcp_init_congestion_control,
++ .send_fin = tcp_send_fin,
++ .write_xmit = tcp_write_xmit,
++ .send_active_reset = tcp_send_active_reset,
++ .write_wakeup = tcp_write_wakeup,
++ .prune_ofo_queue = tcp_prune_ofo_queue,
++ .retransmit_timer = tcp_retransmit_timer,
++ .time_wait = tcp_time_wait,
++ .cleanup_rbuf = tcp_cleanup_rbuf,
++};
++
++static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++ struct mptcp_cb *mpcb;
++ struct sock *master_sk;
++ struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
++ struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
++ u64 idsn;
++
++ dst_release(meta_sk->sk_rx_dst);
++ meta_sk->sk_rx_dst = NULL;
++ /* This flag is set to announce sock_lock_init to
++ * reclassify the lock-class of the master socket.
++ */
++ meta_tp->is_master_sk = 1;
++ master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
++ meta_tp->is_master_sk = 0;
++ if (!master_sk)
++ return -ENOBUFS;
++
++ master_tp = tcp_sk(master_sk);
++ master_icsk = inet_csk(master_sk);
++
++ mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
++ if (!mpcb) {
++ /* sk_free (and __sk_free) requirese wmem_alloc to be 1.
++ * All the rest is set to 0 thanks to __GFP_ZERO above.
++ */
++ atomic_set(&master_sk->sk_wmem_alloc, 1);
++ sk_free(master_sk);
++ return -ENOBUFS;
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
++ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++ newnp = inet6_sk(master_sk);
++ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++ newnp->ipv6_mc_list = NULL;
++ newnp->ipv6_ac_list = NULL;
++ newnp->ipv6_fl_list = NULL;
++ newnp->opt = NULL;
++ newnp->pktoptions = NULL;
++ (void)xchg(&newnp->rxpmtu, NULL);
++ } else if (meta_sk->sk_family == AF_INET6) {
++ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++ newnp = inet6_sk(master_sk);
++ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++ newnp->hop_limit = -1;
++ newnp->mcast_hops = IPV6_DEFAULT_MCASTHOPS;
++ newnp->mc_loop = 1;
++ newnp->pmtudisc = IPV6_PMTUDISC_WANT;
++ newnp->ipv6only = sock_net(master_sk)->ipv6.sysctl.bindv6only;
++ }
++#endif
++
++ meta_tp->mptcp = NULL;
++
++ /* Store the keys and generate the peer's token */
++ mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
++ mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
++
++ /* Generate Initial data-sequence-numbers */
++ mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
++ idsn = ntohll(idsn) + 1;
++ mpcb->snd_high_order[0] = idsn >> 32;
++ mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
++
++ meta_tp->write_seq = (u32)idsn;
++ meta_tp->snd_sml = meta_tp->write_seq;
++ meta_tp->snd_una = meta_tp->write_seq;
++ meta_tp->snd_nxt = meta_tp->write_seq;
++ meta_tp->pushed_seq = meta_tp->write_seq;
++ meta_tp->snd_up = meta_tp->write_seq;
++
++ mpcb->mptcp_rem_key = remote_key;
++ mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
++ idsn = ntohll(idsn) + 1;
++ mpcb->rcv_high_order[0] = idsn >> 32;
++ mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
++ meta_tp->copied_seq = (u32) idsn;
++ meta_tp->rcv_nxt = (u32) idsn;
++ meta_tp->rcv_wup = (u32) idsn;
++
++ meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
++ meta_tp->snd_wnd = window;
++ meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
++
++ meta_tp->packets_out = 0;
++ meta_icsk->icsk_probes_out = 0;
++
++ /* Set mptcp-pointers */
++ master_tp->mpcb = mpcb;
++ master_tp->meta_sk = meta_sk;
++ meta_tp->mpcb = mpcb;
++ meta_tp->meta_sk = meta_sk;
++ mpcb->meta_sk = meta_sk;
++ mpcb->master_sk = master_sk;
++
++ meta_tp->was_meta_sk = 0;
++
++ /* Initialize the queues */
++ skb_queue_head_init(&mpcb->reinject_queue);
++ skb_queue_head_init(&master_tp->out_of_order_queue);
++ tcp_prequeue_init(master_tp);
++ INIT_LIST_HEAD(&master_tp->tsq_node);
++
++ master_tp->tsq_flags = 0;
++
++ mutex_init(&mpcb->mpcb_mutex);
++
++ /* Init the accept_queue structure, we support a queue of 32 pending
++ * connections, it does not need to be huge, since we only store here
++ * pending subflow creations.
++ */
++ if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
++ inet_put_port(master_sk);
++ kmem_cache_free(mptcp_cb_cache, mpcb);
++ sk_free(master_sk);
++ return -ENOMEM;
++ }
++
++ /* Redefine function-pointers as the meta-sk is now fully ready */
++ static_key_slow_inc(&mptcp_static_key);
++ meta_tp->mpc = 1;
++ meta_tp->ops = &mptcp_meta_specific;
++
++ meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
++ meta_sk->sk_destruct = mptcp_sock_destruct;
++
++ /* Meta-level retransmit timer */
++ meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
++
++ tcp_init_xmit_timers(master_sk);
++ /* Has been set for sending out the SYN */
++ inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
++
++ if (!meta_tp->inside_tk_table) {
++ /* Adding the meta_tp in the token hashtable - coming from server-side */
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++
++ __mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
++
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++ }
++ master_tp->inside_tk_table = 0;
++
++ /* Init time-wait stuff */
++ INIT_LIST_HEAD(&mpcb->tw_list);
++ spin_lock_init(&mpcb->tw_lock);
++
++ INIT_HLIST_HEAD(&mpcb->callback_list);
++
++ mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
++
++ mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
++ mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
++ mpcb->orig_window_clamp = meta_tp->window_clamp;
++
++ /* The meta is directly linked - set refcnt to 1 */
++ atomic_set(&mpcb->mpcb_refcnt, 1);
++
++ mptcp_init_path_manager(mpcb);
++ mptcp_init_scheduler(mpcb);
++
++ setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
++ (unsigned long)meta_sk);
++
++ mptcp_debug("%s: created mpcb with token %#x\n",
++ __func__, mpcb->mptcp_loc_token);
++
++ return 0;
++}
++
++void mptcp_fallback_meta_sk(struct sock *meta_sk)
++{
++ kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
++ kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
++}
++
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++ gfp_t flags)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
++ if (!tp->mptcp)
++ return -ENOMEM;
++
++ tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
++ /* No more space for more subflows? */
++ if (!tp->mptcp->path_index) {
++ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++ return -EPERM;
++ }
++
++ INIT_HLIST_NODE(&tp->mptcp->cb_list);
++
++ tp->mptcp->tp = tp;
++ tp->mpcb = mpcb;
++ tp->meta_sk = meta_sk;
++
++ static_key_slow_inc(&mptcp_static_key);
++ tp->mpc = 1;
++ tp->ops = &mptcp_sub_specific;
++
++ tp->mptcp->loc_id = loc_id;
++ tp->mptcp->rem_id = rem_id;
++ if (mpcb->sched_ops->init)
++ mpcb->sched_ops->init(sk);
++
++ /* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
++ * included in mptcp_del_sock(), because the mpcb must remain alive
++ * until the last subsocket is completely destroyed.
++ */
++ sock_hold(meta_sk);
++ atomic_inc(&mpcb->mpcb_refcnt);
++
++ tp->mptcp->next = mpcb->connection_list;
++ mpcb->connection_list = tp;
++ tp->mptcp->attached = 1;
++
++ mpcb->cnt_subflows++;
++ atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
++ &meta_sk->sk_rmem_alloc);
++
++ mptcp_sub_inherit_sockopts(meta_sk, sk);
++ INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
++
++ /* As we successfully allocated the mptcp_tcp_sock, we have to
++ * change the function-pointers here (for sk_destruct to work correctly)
++ */
++ sk->sk_error_report = mptcp_sock_def_error_report;
++ sk->sk_data_ready = mptcp_data_ready;
++ sk->sk_write_space = mptcp_write_space;
++ sk->sk_state_change = mptcp_set_state;
++ sk->sk_destruct = mptcp_sock_destruct;
++
++ if (sk->sk_family == AF_INET)
++ mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
++ __func__ , mpcb->mptcp_loc_token,
++ tp->mptcp->path_index,
++ &((struct inet_sock *)tp)->inet_saddr,
++ ntohs(((struct inet_sock *)tp)->inet_sport),
++ &((struct inet_sock *)tp)->inet_daddr,
++ ntohs(((struct inet_sock *)tp)->inet_dport),
++ mpcb->cnt_subflows);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
++ __func__ , mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &inet6_sk(sk)->saddr,
++ ntohs(((struct inet_sock *)tp)->inet_sport),
++ &sk->sk_v6_daddr,
++ ntohs(((struct inet_sock *)tp)->inet_dport),
++ mpcb->cnt_subflows);
++#endif
++
++ return 0;
++}
++
++void mptcp_del_sock(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
++ struct mptcp_cb *mpcb;
++
++ if (!tp->mptcp || !tp->mptcp->attached)
++ return;
++
++ mpcb = tp->mpcb;
++ tp_prev = mpcb->connection_list;
++
++ mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
++ __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
++ sk->sk_state, is_meta_sk(sk));
++
++ if (tp_prev == tp) {
++ mpcb->connection_list = tp->mptcp->next;
++ } else {
++ for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
++ if (tp_prev->mptcp->next == tp) {
++ tp_prev->mptcp->next = tp->mptcp->next;
++ break;
++ }
++ }
++ }
++ mpcb->cnt_subflows--;
++ if (tp->mptcp->establish_increased)
++ mpcb->cnt_established--;
++
++ tp->mptcp->next = NULL;
++ tp->mptcp->attached = 0;
++ mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
++
++ if (!skb_queue_empty(&sk->sk_write_queue))
++ mptcp_reinject_data(sk, 0);
++
++ if (is_master_tp(tp))
++ mpcb->master_sk = NULL;
++ else if (tp->mptcp->pre_established)
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++
++ rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
++}
++
++/* Updates the metasocket ULID/port data, based on the given sock.
++ * The argument sock must be the sock accessible to the application.
++ * In this function, we update the meta socket info, based on the changes
++ * in the application socket (bind, address allocation, ...)
++ */
++void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
++{
++ if (tcp_sk(sk)->mpcb->pm_ops->new_session)
++ tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
++
++ tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
++}
++
++/* Clean up the receive buffer for full frames taken by the user,
++ * then send an ACK if necessary. COPIED is the number of bytes
++ * tcp_recvmsg has given to the user so far, it speeds up the
++ * calculation of whether or not we must ACK for the sake of
++ * a window update.
++ */
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk;
++ __u32 rcv_window_now = 0;
++
++ if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
++ rcv_window_now = tcp_receive_window(meta_tp);
++
++ if (2 * rcv_window_now > meta_tp->window_clamp)
++ rcv_window_now = 0;
++ }
++
++ mptcp_for_each_sk(meta_tp->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (!mptcp_sk_can_send_ack(sk))
++ continue;
++
++ if (!inet_csk_ack_scheduled(sk))
++ goto second_part;
++ /* Delayed ACKs frequently hit locked sockets during bulk
++ * receive.
++ */
++ if (icsk->icsk_ack.blocked ||
++ /* Once-per-two-segments ACK was not sent by tcp_input.c */
++ tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
++ /* If this read emptied read buffer, we send ACK, if
++ * connection is not bidirectional, user drained
++ * receive buffer and there was a small segment
++ * in queue.
++ */
++ (copied > 0 &&
++ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
++ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
++ !icsk->icsk_ack.pingpong)) &&
++ !atomic_read(&meta_sk->sk_rmem_alloc))) {
++ tcp_send_ack(sk);
++ continue;
++ }
++
++second_part:
++ /* This here is the second part of tcp_cleanup_rbuf */
++ if (rcv_window_now) {
++ __u32 new_window = tp->ops->__select_window(sk);
++
++ /* Send ACK now, if this read freed lots of space
++ * in our buffer. Certainly, new_window is new window.
++ * We can advertise it now, if it is not less than
++ * current one.
++ * "Lots" means "at least twice" here.
++ */
++ if (new_window && new_window >= 2 * rcv_window_now)
++ tcp_send_ack(sk);
++ }
++ }
++}
++
++static int mptcp_sub_send_fin(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *skb = tcp_write_queue_tail(sk);
++ int mss_now;
++
++ /* Optimization, tack on the FIN if we have a queue of
++ * unsent frames. But be careful about outgoing SACKS
++ * and IP options.
++ */
++ mss_now = tcp_current_mss(sk);
++
++ if (tcp_send_head(sk) != NULL) {
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ TCP_SKB_CB(skb)->end_seq++;
++ tp->write_seq++;
++ } else {
++ skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
++ if (!skb)
++ return 1;
++
++ /* Reserve space for headers and prepare control bits. */
++ skb_reserve(skb, MAX_TCP_HEADER);
++ /* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
++ tcp_init_nondata_skb(skb, tp->write_seq,
++ TCPHDR_ACK | TCPHDR_FIN);
++ tcp_queue_skb(sk, skb);
++ }
++ __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
++
++ return 0;
++}
++
++void mptcp_sub_close_wq(struct work_struct *work)
++{
++ struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
++ struct sock *sk = (struct sock *)tp;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ mutex_lock(&tp->mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ if (sock_flag(sk, SOCK_DEAD))
++ goto exit;
++
++ /* We come from tcp_disconnect. We are sure that meta_sk is set */
++ if (!mptcp(tp)) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ goto exit;
++ }
++
++ if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ } else if (tcp_close_state(sk)) {
++ sk->sk_shutdown |= SEND_SHUTDOWN;
++ tcp_send_fin(sk);
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&tp->mpcb->mpcb_mutex);
++ sock_put(sk);
++}
++
++void mptcp_sub_close(struct sock *sk, unsigned long delay)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
++
++ /* We are already closing - e.g., call from sock_def_error_report upon
++ * tcp_disconnect in tcp_close.
++ */
++ if (tp->closing)
++ return;
++
++ /* Work already scheduled ? */
++ if (work_pending(&work->work)) {
++ /* Work present - who will be first ? */
++ if (jiffies + delay > work->timer.expires)
++ return;
++
++ /* Try canceling - if it fails, work will be executed soon */
++ if (!cancel_delayed_work(work))
++ return;
++ sock_put(sk);
++ }
++
++ if (!delay) {
++ unsigned char old_state = sk->sk_state;
++
++ /* If we are in user-context we can directly do the closing
++ * procedure. No need to schedule a work-queue.
++ */
++ if (!in_softirq()) {
++ if (sock_flag(sk, SOCK_DEAD))
++ return;
++
++ if (!mptcp(tp)) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ return;
++ }
++
++ if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
++ sk->sk_state == TCP_CLOSE) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ } else if (tcp_close_state(sk)) {
++ sk->sk_shutdown |= SEND_SHUTDOWN;
++ tcp_send_fin(sk);
++ }
++
++ return;
++ }
++
++ /* We directly send the FIN. Because it may take so a long time,
++ * untile the work-queue will get scheduled...
++ *
++ * If mptcp_sub_send_fin returns 1, it failed and thus we reset
++ * the old state so that tcp_close will finally send the fin
++ * in user-context.
++ */
++ if (!sk->sk_err && old_state != TCP_CLOSE &&
++ tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
++ if (old_state == TCP_ESTABLISHED)
++ TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
++ sk->sk_state = old_state;
++ }
++ }
++
++ sock_hold(sk);
++ queue_delayed_work(mptcp_wq, work, delay);
++}
++
++void mptcp_sub_force_close(struct sock *sk)
++{
++ /* The below tcp_done may have freed the socket, if he is already dead.
++ * Thus, we are not allowed to access it afterwards. That's why
++ * we have to store the dead-state in this local variable.
++ */
++ int sock_is_dead = sock_flag(sk, SOCK_DEAD);
++
++ tcp_sk(sk)->mp_killed = 1;
++
++ if (sk->sk_state != TCP_CLOSE)
++ tcp_done(sk);
++
++ if (!sock_is_dead)
++ mptcp_sub_close(sk, 0);
++}
++EXPORT_SYMBOL(mptcp_sub_force_close);
++
++/* Update the mpcb send window, based on the contributions
++ * of each subflow
++ */
++void mptcp_update_sndbuf(const struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk, *sk;
++ int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
++
++ mptcp_for_each_sk(tp->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ new_sndbuf += sk->sk_sndbuf;
++
++ if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
++ new_sndbuf = sysctl_tcp_wmem[2];
++ break;
++ }
++ }
++ meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
++
++ /* The subflow's call to sk_write_space in tcp_new_space ends up in
++ * mptcp_write_space.
++ * It has nothing to do with waking up the application.
++ * So, we do it here.
++ */
++ if (old_sndbuf != meta_sk->sk_sndbuf)
++ meta_sk->sk_write_space(meta_sk);
++}
++
++void mptcp_close(struct sock *meta_sk, long timeout)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk_it, *tmpsk;
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb;
++ int data_was_unread = 0;
++ int state;
++
++ mptcp_debug("%s: Close of meta_sk with tok %#x\n",
++ __func__, mpcb->mptcp_loc_token);
++
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock(meta_sk);
++
++ if (meta_tp->inside_tk_table) {
++ /* Detach the mpcb from the token hashtable */
++ mptcp_hash_remove_bh(meta_tp);
++ reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
++ }
++
++ meta_sk->sk_shutdown = SHUTDOWN_MASK;
++ /* We need to flush the recv. buffs. We do this only on the
++ * descriptor close, not protocol-sourced closes, because the
++ * reader process may not have drained the data yet!
++ */
++ while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
++ u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
++ tcp_hdr(skb)->fin;
++ data_was_unread += len;
++ __kfree_skb(skb);
++ }
++
++ sk_mem_reclaim(meta_sk);
++
++ /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
++ if (meta_sk->sk_state == TCP_CLOSE) {
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++ mptcp_sub_close(sk_it, 0);
++ }
++ goto adjudge_to_death;
++ }
++
++ if (data_was_unread) {
++ /* Unread data was tossed, zap the connection. */
++ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
++ meta_sk->sk_allocation);
++ } else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
++ /* Check zero linger _after_ checking for unread data. */
++ meta_sk->sk_prot->disconnect(meta_sk, 0);
++ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ } else if (tcp_close_state(meta_sk)) {
++ mptcp_send_fin(meta_sk);
++ } else if (meta_tp->snd_una == meta_tp->write_seq) {
++ /* The DATA_FIN has been sent and acknowledged
++ * (e.g., by sk_shutdown). Close all the other subflows
++ */
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ unsigned long delay = 0;
++ /* If we are the passive closer, don't trigger
++ * subflow-fin until the subflow has been finned
++ * by the peer. - thus we add a delay
++ */
++ if (mpcb->passive_close &&
++ sk_it->sk_state == TCP_ESTABLISHED)
++ delay = inet_csk(sk_it)->icsk_rto << 3;
++
++ mptcp_sub_close(sk_it, delay);
++ }
++ }
++
++ sk_stream_wait_close(meta_sk, timeout);
++
++adjudge_to_death:
++ state = meta_sk->sk_state;
++ sock_hold(meta_sk);
++ sock_orphan(meta_sk);
++
++ /* socket will be freed after mptcp_close - we have to prevent
++ * access from the subflows.
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ /* Similar to sock_orphan, but we don't set it DEAD, because
++ * the callbacks are still set and must be called.
++ */
++ write_lock_bh(&sk_it->sk_callback_lock);
++ sk_set_socket(sk_it, NULL);
++ sk_it->sk_wq = NULL;
++ write_unlock_bh(&sk_it->sk_callback_lock);
++ }
++
++ /* It is the last release_sock in its life. It will remove backlog. */
++ release_sock(meta_sk);
++
++ /* Now socket is owned by kernel and we acquire BH lock
++ * to finish close. No need to check for user refs.
++ */
++ local_bh_disable();
++ bh_lock_sock(meta_sk);
++ WARN_ON(sock_owned_by_user(meta_sk));
++
++ percpu_counter_inc(meta_sk->sk_prot->orphan_count);
++
++ /* Have we already been destroyed by a softirq or backlog? */
++ if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
++ goto out;
++
++ /* This is a (useful) BSD violating of the RFC. There is a
++ * problem with TCP as specified in that the other end could
++ * keep a socket open forever with no application left this end.
++ * We use a 3 minute timeout (about the same as BSD) then kill
++ * our end. If they send after that then tough - BUT: long enough
++ * that we won't make the old 4*rto = almost no time - whoops
++ * reset mistake.
++ *
++ * Nope, it was not mistake. It is really desired behaviour
++ * f.e. on http servers, when such sockets are useless, but
++ * consume significant resources. Let's do it with special
++ * linger2 option. --ANK
++ */
++
++ if (meta_sk->sk_state == TCP_FIN_WAIT2) {
++ if (meta_tp->linger2 < 0) {
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPABORTONLINGER);
++ } else {
++ const int tmo = tcp_fin_time(meta_sk);
++
++ if (tmo > TCP_TIMEWAIT_LEN) {
++ inet_csk_reset_keepalive_timer(meta_sk,
++ tmo - TCP_TIMEWAIT_LEN);
++ } else {
++ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
++ tmo);
++ goto out;
++ }
++ }
++ }
++ if (meta_sk->sk_state != TCP_CLOSE) {
++ sk_mem_reclaim(meta_sk);
++ if (tcp_too_many_orphans(meta_sk, 0)) {
++ if (net_ratelimit())
++ pr_info("MPTCP: too many of orphaned sockets\n");
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPABORTONMEMORY);
++ }
++ }
++
++
++ if (meta_sk->sk_state == TCP_CLOSE)
++ inet_csk_destroy_sock(meta_sk);
++ /* Otherwise, socket is reprieved until protocol close. */
++
++out:
++ bh_unlock_sock(meta_sk);
++ local_bh_enable();
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk); /* Taken by sock_hold */
++}
++
++void mptcp_disconnect(struct sock *sk)
++{
++ struct sock *subsk, *tmpsk;
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ mptcp_delete_synack_timer(sk);
++
++ __skb_queue_purge(&tp->mpcb->reinject_queue);
++
++ if (tp->inside_tk_table) {
++ mptcp_hash_remove_bh(tp);
++ reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
++ }
++
++ local_bh_disable();
++ mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
++ /* The socket will get removed from the subsocket-list
++ * and made non-mptcp by setting mpc to 0.
++ *
++ * This is necessary, because tcp_disconnect assumes
++ * that the connection is completly dead afterwards.
++ * Thus we need to do a mptcp_del_sock. Due to this call
++ * we have to make it non-mptcp.
++ *
++ * We have to lock the socket, because we set mpc to 0.
++ * An incoming packet would take the subsocket's lock
++ * and go on into the receive-path.
++ * This would be a race.
++ */
++
++ bh_lock_sock(subsk);
++ mptcp_del_sock(subsk);
++ tcp_sk(subsk)->mpc = 0;
++ tcp_sk(subsk)->ops = &tcp_specific;
++ mptcp_sub_force_close(subsk);
++ bh_unlock_sock(subsk);
++ }
++ local_bh_enable();
++
++ tp->was_meta_sk = 1;
++ tp->mpc = 0;
++ tp->ops = &tcp_specific;
++}
++
++
++/* Returns 1 if we should enable MPTCP for that socket. */
++int mptcp_doit(struct sock *sk)
++{
++ /* Do not allow MPTCP enabling if the MPTCP initialization failed */
++ if (mptcp_init_failed)
++ return 0;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++ return 0;
++
++ /* Socket may already be established (e.g., called from tcp_recvmsg) */
++ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
++ return 1;
++
++ /* Don't do mptcp over loopback */
++ if (sk->sk_family == AF_INET &&
++ (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
++ ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
++ return 0;
++#if IS_ENABLED(CONFIG_IPV6)
++ if (sk->sk_family == AF_INET6 &&
++ (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
++ ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
++ return 0;
++#endif
++ if (mptcp_v6_is_v4_mapped(sk) &&
++ ipv4_is_loopback(inet_sk(sk)->inet_saddr))
++ return 0;
++
++#ifdef CONFIG_TCP_MD5SIG
++ /* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
++ if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
++ return 0;
++#endif
++
++ return 1;
++}
++
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++ struct tcp_sock *master_tp;
++ struct sock *master_sk;
++
++ if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
++ goto err_alloc_mpcb;
++
++ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++ master_tp = tcp_sk(master_sk);
++
++ if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
++ goto err_add_sock;
++
++ if (__inet_inherit_port(meta_sk, master_sk) < 0)
++ goto err_add_sock;
++
++ meta_sk->sk_prot->unhash(meta_sk);
++
++ if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
++ __inet_hash_nolisten(master_sk, NULL);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ __inet6_hash(master_sk, NULL);
++#endif
++
++ master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
++
++ return 0;
++
++err_add_sock:
++ mptcp_fallback_meta_sk(meta_sk);
++
++ inet_csk_prepare_forced_close(master_sk);
++ tcp_done(master_sk);
++ inet_csk_prepare_forced_close(meta_sk);
++ tcp_done(meta_sk);
++
++err_alloc_mpcb:
++ return -ENOBUFS;
++}
++
++static int __mptcp_check_req_master(struct sock *child,
++ struct request_sock *req)
++{
++ struct tcp_sock *child_tp = tcp_sk(child);
++ struct sock *meta_sk = child;
++ struct mptcp_cb *mpcb;
++ struct mptcp_request_sock *mtreq;
++
++ /* Never contained an MP_CAPABLE */
++ if (!inet_rsk(req)->mptcp_rqsk)
++ return 1;
++
++ if (!inet_rsk(req)->saw_mpc) {
++ /* Fallback to regular TCP, because we saw one SYN without
++ * MP_CAPABLE. In tcp_check_req we continue the regular path.
++ * But, the socket has been added to the reqsk_tk_htb, so we
++ * must still remove it.
++ */
++ mptcp_reqsk_remove_tk(req);
++ return 1;
++ }
++
++ /* Just set this values to pass them to mptcp_alloc_mpcb */
++ mtreq = mptcp_rsk(req);
++ child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
++ child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
++
++ if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
++ child_tp->snd_wnd))
++ return -ENOBUFS;
++
++ child = tcp_sk(child)->mpcb->master_sk;
++ child_tp = tcp_sk(child);
++ mpcb = child_tp->mpcb;
++
++ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++
++ mpcb->dss_csum = mtreq->dss_csum;
++ mpcb->server_side = 1;
++
++ /* Will be moved to ESTABLISHED by tcp_rcv_state_process() */
++ mptcp_update_metasocket(child, meta_sk);
++
++ /* Needs to be done here additionally, because when accepting a
++ * new connection we pass by __reqsk_free and not reqsk_free.
++ */
++ mptcp_reqsk_remove_tk(req);
++
++ /* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
++ sock_put(meta_sk);
++
++ return 0;
++}
++
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
++{
++ struct sock *meta_sk = child, *master_sk;
++ struct sk_buff *skb;
++ u32 new_mapping;
++ int ret;
++
++ ret = __mptcp_check_req_master(child, req);
++ if (ret)
++ return ret;
++
++ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++
++ /* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
++ * pre-MPTCP data in the receive queue.
++ */
++ tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
++ tcp_rsk(req)->rcv_isn - 1;
++
++ /* Map subflow sequence number to data sequence numbers. We need to map
++ * these data to [IDSN - len - 1, IDSN[.
++ */
++ new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
++
++ /* There should be only one skb: the SYN + data. */
++ skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
++ TCP_SKB_CB(skb)->seq += new_mapping;
++ TCP_SKB_CB(skb)->end_seq += new_mapping;
++ }
++
++ /* With fastopen we change the semantics of the relative subflow
++ * sequence numbers to deal with middleboxes that could add/remove
++ * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
++ * instead of the regular TCP ISN.
++ */
++ tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
++
++ /* We need to update copied_seq of the master_sk to account for the
++ * already moved data to the meta receive queue.
++ */
++ tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
++
++ /* Handled by the master_sk */
++ tcp_sk(meta_sk)->fastopen_rsk = NULL;
++
++ return 0;
++}
++
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev)
++{
++ struct sock *meta_sk = child;
++ int ret;
++
++ ret = __mptcp_check_req_master(child, req);
++ if (ret)
++ return ret;
++
++ inet_csk_reqsk_queue_unlink(sk, req, prev);
++ inet_csk_reqsk_queue_removed(sk, req);
++ inet_csk_reqsk_queue_add(sk, req, meta_sk);
++
++ return 0;
++}
++
++struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt)
++{
++ struct tcp_sock *child_tp = tcp_sk(child);
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ u8 hash_mac_check[20];
++
++ child_tp->inside_tk_table = 0;
++
++ if (!mopt->join_ack)
++ goto teardown;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mtreq->mptcp_rem_nonce,
++ (u8 *)&mtreq->mptcp_loc_nonce,
++ (u32 *)hash_mac_check);
++
++ if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
++ goto teardown;
++
++ /* Point it to the same struct socket and wq as the meta_sk */
++ sk_set_socket(child, meta_sk->sk_socket);
++ child->sk_wq = meta_sk->sk_wq;
++
++ if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
++ /* Has been inherited, but now child_tp->mptcp is NULL */
++ child_tp->mpc = 0;
++ child_tp->ops = &tcp_specific;
++
++ /* TODO when we support acking the third ack for new subflows,
++ * we should silently discard this third ack, by returning NULL.
++ *
++ * Maybe, at the retransmission we will have enough memory to
++ * fully add the socket to the meta-sk.
++ */
++ goto teardown;
++ }
++
++ /* The child is a clone of the meta socket, we must now reset
++ * some of the fields
++ */
++ child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
++
++ /* We should allow proper increase of the snd/rcv-buffers. Thus, we
++ * use the original values instead of the bloated up ones from the
++ * clone.
++ */
++ child->sk_sndbuf = mpcb->orig_sk_sndbuf;
++ child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
++
++ child_tp->mptcp->slave_sk = 1;
++ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++ child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
++
++ child_tp->tsq_flags = 0;
++
++ /* Subflows do not use the accept queue, as they
++ * are attached immediately to the mpcb.
++ */
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++ reqsk_free(req);
++ return child;
++
++teardown:
++ /* Drop this request - sock creation failed. */
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++ reqsk_free(req);
++ inet_csk_prepare_forced_close(child);
++ tcp_done(child);
++ return meta_sk;
++}
++
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
++{
++ struct mptcp_tw *mptw;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++
++ /* A subsocket in tw can only receive data. So, if we are in
++ * infinite-receive, then we should not reply with a data-ack or act
++ * upon general MPTCP-signaling. We prevent this by simply not creating
++ * the mptcp_tw_sock.
++ */
++ if (mpcb->infinite_mapping_rcv) {
++ tw->mptcp_tw = NULL;
++ return 0;
++ }
++
++ /* Alloc MPTCP-tw-sock */
++ mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
++ if (!mptw)
++ return -ENOBUFS;
++
++ atomic_inc(&mpcb->mpcb_refcnt);
++
++ tw->mptcp_tw = mptw;
++ mptw->loc_key = mpcb->mptcp_loc_key;
++ mptw->meta_tw = mpcb->in_time_wait;
++ if (mptw->meta_tw) {
++ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
++ if (mpcb->mptw_state != TCP_TIME_WAIT)
++ mptw->rcv_nxt++;
++ }
++ rcu_assign_pointer(mptw->mpcb, mpcb);
++
++ spin_lock(&mpcb->tw_lock);
++ list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
++ mptw->in_list = 1;
++ spin_unlock(&mpcb->tw_lock);
++
++ return 0;
++}
++
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
++{
++ struct mptcp_cb *mpcb;
++
++ rcu_read_lock();
++ mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
++
++ /* If we are still holding a ref to the mpcb, we have to remove ourself
++ * from the list and drop the ref properly.
++ */
++ if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
++ spin_lock(&mpcb->tw_lock);
++ if (tw->mptcp_tw->in_list) {
++ list_del_rcu(&tw->mptcp_tw->list);
++ tw->mptcp_tw->in_list = 0;
++ }
++ spin_unlock(&mpcb->tw_lock);
++
++ /* Twice, because we increased it above */
++ mptcp_mpcb_put(mpcb);
++ mptcp_mpcb_put(mpcb);
++ }
++
++ rcu_read_unlock();
++
++ kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
++}
++
++/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
++ * data-fin.
++ */
++void mptcp_time_wait(struct sock *sk, int state, int timeo)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_tw *mptw;
++
++ /* Used for sockets that go into tw after the meta
++ * (see mptcp_init_tw_sock())
++ */
++ tp->mpcb->in_time_wait = 1;
++ tp->mpcb->mptw_state = state;
++
++ /* Update the time-wait-sock's information */
++ rcu_read_lock_bh();
++ list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
++ mptw->meta_tw = 1;
++ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
++
++ /* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
++ * pretend as if the DATA_FIN has already reached us, that way
++ * the checks in tcp_timewait_state_process will be good as the
++ * DATA_FIN comes in.
++ */
++ if (state != TCP_TIME_WAIT)
++ mptw->rcv_nxt++;
++ }
++ rcu_read_unlock_bh();
++
++ tcp_done(sk);
++}
++
++void mptcp_tsq_flags(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ /* It will be handled as a regular deferred-call */
++ if (is_meta_sk(sk))
++ return;
++
++ if (hlist_unhashed(&tp->mptcp->cb_list)) {
++ hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
++ /* We need to hold it here, as the sock_hold is not assured
++ * by the release_sock as it is done in regular TCP.
++ *
++ * The subsocket may get inet_csk_destroy'd while it is inside
++ * the callback_list.
++ */
++ sock_hold(sk);
++ }
++
++ if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
++ sock_hold(meta_sk);
++}
++
++void mptcp_tsq_sub_deferred(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_tcp_sock *mptcp;
++ struct hlist_node *tmp;
++
++ BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
++
++ __sock_put(meta_sk);
++ hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
++ struct tcp_sock *tp = mptcp->tp;
++ struct sock *sk = (struct sock *)tp;
++
++ hlist_del_init(&mptcp->cb_list);
++ sk->sk_prot->release_cb(sk);
++ /* Final sock_put (cfr. mptcp_tsq_flags */
++ sock_put(sk);
++ }
++}
++
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_options_received mopt;
++ u8 mptcp_hash_mac[20];
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ mtreq = mptcp_rsk(req);
++ mtreq->mptcp_mpcb = mpcb;
++ mtreq->is_sub = 1;
++ inet_rsk(req)->mptcp_rqsk = 1;
++
++ mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mtreq->mptcp_loc_nonce,
++ (u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
++ mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
++
++ mtreq->rem_id = mopt.rem_id;
++ mtreq->rcv_low_prio = mopt.low_prio;
++ inet_rsk(req)->saw_mpc = 1;
++}
++
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
++{
++ struct mptcp_options_received mopt;
++ struct mptcp_request_sock *mreq = mptcp_rsk(req);
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ mreq->is_sub = 0;
++ inet_rsk(req)->mptcp_rqsk = 1;
++ mreq->dss_csum = mopt.dss_csum;
++ mreq->hash_entry.pprev = NULL;
++
++ mptcp_reqsk_new_mptcp(req, &mopt, skb);
++}
++
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
++{
++ struct mptcp_options_received mopt;
++ const struct tcp_sock *tp = tcp_sk(sk);
++ __u32 isn = TCP_SKB_CB(skb)->when;
++ bool want_cookie = false;
++
++ if ((sysctl_tcp_syncookies == 2 ||
++ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++ want_cookie = tcp_syn_flood_action(sk, skb,
++ mptcp_request_sock_ops.slab_name);
++ if (!want_cookie)
++ goto drop;
++ }
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ if (mopt.is_mp_join)
++ return mptcp_do_join_short(skb, &mopt, sock_net(sk));
++ if (mopt.drop_me)
++ goto drop;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
++ mopt.saw_mpc = 0;
++
++ if (skb->protocol == htons(ETH_P_IP)) {
++ if (mopt.saw_mpc && !want_cookie) {
++ if (skb_rtable(skb)->rt_flags &
++ (RTCF_BROADCAST | RTCF_MULTICAST))
++ goto drop;
++
++ return tcp_conn_request(&mptcp_request_sock_ops,
++ &mptcp_request_sock_ipv4_ops,
++ sk, skb);
++ }
++
++ return tcp_v4_conn_request(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ if (mopt.saw_mpc && !want_cookie) {
++ if (!ipv6_unicast_destination(skb))
++ goto drop;
++
++ return tcp_conn_request(&mptcp6_request_sock_ops,
++ &mptcp_request_sock_ipv6_ops,
++ sk, skb);
++ }
++
++ return tcp_v6_conn_request(sk, skb);
++#endif
++ }
++drop:
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++ return 0;
++}
++
++struct workqueue_struct *mptcp_wq;
++EXPORT_SYMBOL(mptcp_wq);
++
++/* Output /proc/net/mptcp */
++static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
++{
++ struct tcp_sock *meta_tp;
++ const struct net *net = seq->private;
++ int i, n = 0;
++
++ seq_printf(seq, " sl loc_tok rem_tok v6 local_address remote_address st ns tx_queue rx_queue inode");
++ seq_putc(seq, '\n');
++
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ struct hlist_nulls_node *node;
++ rcu_read_lock_bh();
++ hlist_nulls_for_each_entry_rcu(meta_tp, node,
++ &tk_hashtable[i], tk_table) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *meta_sk = (struct sock *)meta_tp;
++ struct inet_sock *isk = inet_sk(meta_sk);
++
++ if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
++ continue;
++
++ if (capable(CAP_NET_ADMIN)) {
++ seq_printf(seq, "%4d: %04X %04X ", n++,
++ mpcb->mptcp_loc_token,
++ mpcb->mptcp_rem_token);
++ } else {
++ seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
++ }
++ if (meta_sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(meta_sk)) {
++ seq_printf(seq, " 0 %08X:%04X %08X:%04X ",
++ isk->inet_rcv_saddr,
++ ntohs(isk->inet_sport),
++ isk->inet_daddr,
++ ntohs(isk->inet_dport));
++#if IS_ENABLED(CONFIG_IPV6)
++ } else if (meta_sk->sk_family == AF_INET6) {
++ struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
++ struct in6_addr *dst = &meta_sk->sk_v6_daddr;
++ seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
++ src->s6_addr32[0], src->s6_addr32[1],
++ src->s6_addr32[2], src->s6_addr32[3],
++ ntohs(isk->inet_sport),
++ dst->s6_addr32[0], dst->s6_addr32[1],
++ dst->s6_addr32[2], dst->s6_addr32[3],
++ ntohs(isk->inet_dport));
++#endif
++ }
++ seq_printf(seq, " %02X %02X %08X:%08X %lu",
++ meta_sk->sk_state, mpcb->cnt_subflows,
++ meta_tp->write_seq - meta_tp->snd_una,
++ max_t(int, meta_tp->rcv_nxt -
++ meta_tp->copied_seq, 0),
++ sock_i_ino(meta_sk));
++ seq_putc(seq, '\n');
++ }
++
++ rcu_read_unlock_bh();
++ }
++
++ return 0;
++}
++
++static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
++{
++ return single_open_net(inode, file, mptcp_pm_seq_show);
++}
++
++static const struct file_operations mptcp_pm_seq_fops = {
++ .owner = THIS_MODULE,
++ .open = mptcp_pm_seq_open,
++ .read = seq_read,
++ .llseek = seq_lseek,
++ .release = single_release_net,
++};
++
++static int mptcp_pm_init_net(struct net *net)
++{
++ if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
++ return -ENOMEM;
++
++ return 0;
++}
++
++static void mptcp_pm_exit_net(struct net *net)
++{
++ remove_proc_entry("mptcp", net->proc_net);
++}
++
++static struct pernet_operations mptcp_pm_proc_ops = {
++ .init = mptcp_pm_init_net,
++ .exit = mptcp_pm_exit_net,
++};
++
++/* General initialization of mptcp */
++void __init mptcp_init(void)
++{
++ int i;
++ struct ctl_table_header *mptcp_sysctl;
++
++ mptcp_sock_cache = kmem_cache_create("mptcp_sock",
++ sizeof(struct mptcp_tcp_sock),
++ 0, SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_sock_cache)
++ goto mptcp_sock_cache_failed;
++
++ mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
++ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_cb_cache)
++ goto mptcp_cb_cache_failed;
++
++ mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
++ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_tw_cache)
++ goto mptcp_tw_cache_failed;
++
++ get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
++
++ mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
++ if (!mptcp_wq)
++ goto alloc_workqueue_failed;
++
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
++ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
++ i + MPTCP_REQSK_NULLS_BASE);
++ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
++ }
++
++ spin_lock_init(&mptcp_reqsk_hlock);
++ spin_lock_init(&mptcp_tk_hashlock);
++
++ if (register_pernet_subsys(&mptcp_pm_proc_ops))
++ goto pernet_failed;
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (mptcp_pm_v6_init())
++ goto mptcp_pm_v6_failed;
++#endif
++ if (mptcp_pm_v4_init())
++ goto mptcp_pm_v4_failed;
++
++ mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
++ if (!mptcp_sysctl)
++ goto register_sysctl_failed;
++
++ if (mptcp_register_path_manager(&mptcp_pm_default))
++ goto register_pm_failed;
++
++ if (mptcp_register_scheduler(&mptcp_sched_default))
++ goto register_sched_failed;
++
++ pr_info("MPTCP: Stable release v0.89.0-rc");
++
++ mptcp_init_failed = false;
++
++ return;
++
++register_sched_failed:
++ mptcp_unregister_path_manager(&mptcp_pm_default);
++register_pm_failed:
++ unregister_net_sysctl_table(mptcp_sysctl);
++register_sysctl_failed:
++ mptcp_pm_v4_undo();
++mptcp_pm_v4_failed:
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_pm_v6_undo();
++mptcp_pm_v6_failed:
++#endif
++ unregister_pernet_subsys(&mptcp_pm_proc_ops);
++pernet_failed:
++ destroy_workqueue(mptcp_wq);
++alloc_workqueue_failed:
++ kmem_cache_destroy(mptcp_tw_cache);
++mptcp_tw_cache_failed:
++ kmem_cache_destroy(mptcp_cb_cache);
++mptcp_cb_cache_failed:
++ kmem_cache_destroy(mptcp_sock_cache);
++mptcp_sock_cache_failed:
++ mptcp_init_failed = true;
++}
+diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
+new file mode 100644
+index 000000000000..3a54413ce25b
+--- /dev/null
++++ b/net/mptcp/mptcp_fullmesh.c
+@@ -0,0 +1,1722 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#include <net/addrconf.h>
++#endif
++
++enum {
++ MPTCP_EVENT_ADD = 1,
++ MPTCP_EVENT_DEL,
++ MPTCP_EVENT_MOD,
++};
++
++#define MPTCP_SUBFLOW_RETRY_DELAY 1000
++
++/* Max number of local or remote addresses we can store.
++ * When changing, see the bitfield below in fullmesh_rem4/6.
++ */
++#define MPTCP_MAX_ADDR 8
++
++struct fullmesh_rem4 {
++ u8 rem4_id;
++ u8 bitfield;
++ u8 retry_bitfield;
++ __be16 port;
++ struct in_addr addr;
++};
++
++struct fullmesh_rem6 {
++ u8 rem6_id;
++ u8 bitfield;
++ u8 retry_bitfield;
++ __be16 port;
++ struct in6_addr addr;
++};
++
++struct mptcp_loc_addr {
++ struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
++ u8 loc4_bits;
++ u8 next_v4_index;
++
++ struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
++ u8 loc6_bits;
++ u8 next_v6_index;
++};
++
++struct mptcp_addr_event {
++ struct list_head list;
++ unsigned short family;
++ u8 code:7,
++ low_prio:1;
++ union inet_addr addr;
++};
++
++struct fullmesh_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++ /* Delayed worker, when the routing-tables are not yet ready. */
++ struct delayed_work subflow_retry_work;
++
++ /* Remote addresses */
++ struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
++ struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
++
++ struct mptcp_cb *mpcb;
++
++ u16 remove_addrs; /* Addresses to remove */
++ u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
++ u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
++
++ u8 add_addr; /* Are we sending an add_addr? */
++
++ u8 rem4_bits;
++ u8 rem6_bits;
++};
++
++struct mptcp_fm_ns {
++ struct mptcp_loc_addr __rcu *local;
++ spinlock_t local_lock; /* Protecting the above pointer */
++ struct list_head events;
++ struct delayed_work address_worker;
++
++ struct net *net;
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly;
++
++static void full_mesh_create_subflows(struct sock *meta_sk);
++
++static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
++{
++ return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
++}
++
++static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
++{
++ return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
++}
++
++/* Find the first free index in the bitfield */
++static int __mptcp_find_free_index(u8 bitfield, u8 base)
++{
++ int i;
++
++ /* There are anyways no free bits... */
++ if (bitfield == 0xff)
++ goto exit;
++
++ i = ffs(~(bitfield >> base)) - 1;
++ if (i < 0)
++ goto exit;
++
++ /* No free bits when starting at base, try from 0 on */
++ if (i + base >= sizeof(bitfield) * 8)
++ return __mptcp_find_free_index(bitfield, 0);
++
++ return i + base;
++exit:
++ return -1;
++}
++
++static int mptcp_find_free_index(u8 bitfield)
++{
++ return __mptcp_find_free_index(bitfield, 0);
++}
++
++static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
++ const struct in_addr *addr,
++ __be16 port, u8 id)
++{
++ int i;
++ struct fullmesh_rem4 *rem4;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ rem4 = &fmp->remaddr4[i];
++
++ /* Address is already in the list --- continue */
++ if (rem4->rem4_id == id &&
++ rem4->addr.s_addr == addr->s_addr && rem4->port == port)
++ return;
++
++ /* This may be the case, when the peer is behind a NAT. He is
++ * trying to JOIN, thus sending the JOIN with a certain ID.
++ * However the src_addr of the IP-packet has been changed. We
++ * update the addr in the list, because this is the address as
++ * OUR BOX sees it.
++ */
++ if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
++ /* update the address */
++ mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
++ __func__, &rem4->addr.s_addr,
++ &addr->s_addr, id);
++ rem4->addr.s_addr = addr->s_addr;
++ rem4->port = port;
++ mpcb->list_rcvd = 1;
++ return;
++ }
++ }
++
++ i = mptcp_find_free_index(fmp->rem4_bits);
++ /* Do we have already the maximum number of local/remote addresses? */
++ if (i < 0) {
++ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
++ __func__, MPTCP_MAX_ADDR, &addr->s_addr);
++ return;
++ }
++
++ rem4 = &fmp->remaddr4[i];
++
++ /* Address is not known yet, store it */
++ rem4->addr.s_addr = addr->s_addr;
++ rem4->port = port;
++ rem4->bitfield = 0;
++ rem4->retry_bitfield = 0;
++ rem4->rem4_id = id;
++ mpcb->list_rcvd = 1;
++ fmp->rem4_bits |= (1 << i);
++
++ return;
++}
++
++static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
++ const struct in6_addr *addr,
++ __be16 port, u8 id)
++{
++ int i;
++ struct fullmesh_rem6 *rem6;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ rem6 = &fmp->remaddr6[i];
++
++ /* Address is already in the list --- continue */
++ if (rem6->rem6_id == id &&
++ ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
++ return;
++
++ /* This may be the case, when the peer is behind a NAT. He is
++ * trying to JOIN, thus sending the JOIN with a certain ID.
++ * However the src_addr of the IP-packet has been changed. We
++ * update the addr in the list, because this is the address as
++ * OUR BOX sees it.
++ */
++ if (rem6->rem6_id == id) {
++ /* update the address */
++ mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
++ __func__, &rem6->addr, addr, id);
++ rem6->addr = *addr;
++ rem6->port = port;
++ mpcb->list_rcvd = 1;
++ return;
++ }
++ }
++
++ i = mptcp_find_free_index(fmp->rem6_bits);
++ /* Do we have already the maximum number of local/remote addresses? */
++ if (i < 0) {
++ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
++ __func__, MPTCP_MAX_ADDR, addr);
++ return;
++ }
++
++ rem6 = &fmp->remaddr6[i];
++
++ /* Address is not known yet, store it */
++ rem6->addr = *addr;
++ rem6->port = port;
++ rem6->bitfield = 0;
++ rem6->retry_bitfield = 0;
++ rem6->rem6_id = id;
++ mpcb->list_rcvd = 1;
++ fmp->rem6_bits |= (1 << i);
++
++ return;
++}
++
++static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ if (fmp->remaddr4[i].rem4_id == id) {
++ /* remove address from bitfield */
++ fmp->rem4_bits &= ~(1 << i);
++
++ break;
++ }
++ }
++}
++
++static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ if (fmp->remaddr6[i].rem6_id == id) {
++ /* remove address from bitfield */
++ fmp->rem6_bits &= ~(1 << i);
++
++ break;
++ }
++ }
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
++ const struct in_addr *addr, u8 index)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
++ fmp->remaddr4[i].bitfield |= (1 << index);
++ return;
++ }
++ }
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
++ const struct in6_addr *addr, u8 index)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
++ fmp->remaddr6[i].bitfield |= (1 << index);
++ return;
++ }
++ }
++}
++
++static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
++ const union inet_addr *addr,
++ sa_family_t family, u8 id)
++{
++ if (family == AF_INET)
++ mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
++ else
++ mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
++}
++
++static void retry_subflow_worker(struct work_struct *work)
++{
++ struct delayed_work *delayed_work = container_of(work,
++ struct delayed_work,
++ work);
++ struct fullmesh_priv *fmp = container_of(delayed_work,
++ struct fullmesh_priv,
++ subflow_retry_work);
++ struct mptcp_cb *mpcb = fmp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int iter = 0, i;
++
++ /* We need a local (stable) copy of the address-list. Really, it is not
++ * such a big deal, if the address-list is not 100% up-to-date.
++ */
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++ rcu_read_unlock_bh();
++
++ if (!mptcp_local)
++ return;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
++ /* Do we need to retry establishing a subflow ? */
++ if (rem->retry_bitfield) {
++ int i = mptcp_find_free_index(~rem->retry_bitfield);
++ struct mptcp_rem4 rem4;
++
++ rem->bitfield |= (1 << i);
++ rem->retry_bitfield &= ~(1 << i);
++
++ rem4.addr = rem->addr;
++ rem4.port = rem->port;
++ rem4.rem4_id = rem->rem4_id;
++
++ mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
++ goto next_subflow;
++ }
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
++
++ /* Do we need to retry establishing a subflow ? */
++ if (rem->retry_bitfield) {
++ int i = mptcp_find_free_index(~rem->retry_bitfield);
++ struct mptcp_rem6 rem6;
++
++ rem->bitfield |= (1 << i);
++ rem->retry_bitfield &= ~(1 << i);
++
++ rem6.addr = rem->addr;
++ rem6.port = rem->port;
++ rem6.rem6_id = rem->rem6_id;
++
++ mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
++ goto next_subflow;
++ }
++ }
++#endif
++
++exit:
++ kfree(mptcp_local);
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = fmp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int iter = 0, retry = 0;
++ int i;
++
++ /* We need a local (stable) copy of the address-list. Really, it is not
++ * such a big deal, if the address-list is not 100% up-to-date.
++ */
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++ rcu_read_unlock_bh();
++
++ if (!mptcp_local)
++ return;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ struct fullmesh_rem4 *rem;
++ u8 remaining_bits;
++
++ rem = &fmp->remaddr4[i];
++ remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
++
++ /* Are there still combinations to handle? */
++ if (remaining_bits) {
++ int i = mptcp_find_free_index(~remaining_bits);
++ struct mptcp_rem4 rem4;
++
++ rem->bitfield |= (1 << i);
++
++ rem4.addr = rem->addr;
++ rem4.port = rem->port;
++ rem4.rem4_id = rem->rem4_id;
++
++ /* If a route is not yet available then retry once */
++ if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
++ &rem4) == -ENETUNREACH)
++ retry = rem->retry_bitfield |= (1 << i);
++ goto next_subflow;
++ }
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ struct fullmesh_rem6 *rem;
++ u8 remaining_bits;
++
++ rem = &fmp->remaddr6[i];
++ remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
++
++ /* Are there still combinations to handle? */
++ if (remaining_bits) {
++ int i = mptcp_find_free_index(~remaining_bits);
++ struct mptcp_rem6 rem6;
++
++ rem->bitfield |= (1 << i);
++
++ rem6.addr = rem->addr;
++ rem6.port = rem->port;
++ rem6.rem6_id = rem->rem6_id;
++
++ /* If a route is not yet available then retry once */
++ if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
++ &rem6) == -ENETUNREACH)
++ retry = rem->retry_bitfield |= (1 << i);
++ goto next_subflow;
++ }
++ }
++#endif
++
++ if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
++ sock_hold(meta_sk);
++ queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
++ msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
++ }
++
++exit:
++ kfree(mptcp_local);
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ struct sock *sk = mptcp_select_ack_sock(meta_sk);
++
++ fmp->remove_addrs |= (1 << addr_id);
++ mpcb->addr_signal = 1;
++
++ if (sk)
++ tcp_send_ack(sk);
++}
++
++static void update_addr_bitfields(struct sock *meta_sk,
++ const struct mptcp_loc_addr *mptcp_local)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ int i;
++
++ /* The bits in announced_addrs_* always match with loc*_bits. So, a
++ * simply & operation unsets the correct bits, because these go from
++ * announced to non-announced
++ */
++ fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
++ fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
++ }
++
++ fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
++ fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
++ }
++}
++
++static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
++ sa_family_t family, const union inet_addr *addr)
++{
++ int i;
++ u8 loc_bits;
++ bool found = false;
++
++ if (family == AF_INET)
++ loc_bits = mptcp_local->loc4_bits;
++ else
++ loc_bits = mptcp_local->loc6_bits;
++
++ mptcp_for_each_bit_set(loc_bits, i) {
++ if (family == AF_INET &&
++ mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
++ found = true;
++ break;
++ }
++ if (family == AF_INET6 &&
++ ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
++ &addr->in6)) {
++ found = true;
++ break;
++ }
++ }
++
++ if (!found)
++ return -1;
++
++ return i;
++}
++
++static void mptcp_address_worker(struct work_struct *work)
++{
++ const struct delayed_work *delayed_work = container_of(work,
++ struct delayed_work,
++ work);
++ struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
++ struct mptcp_fm_ns,
++ address_worker);
++ struct net *net = fm_ns->net;
++ struct mptcp_addr_event *event = NULL;
++ struct mptcp_loc_addr *mptcp_local, *old;
++ int i, id = -1; /* id is used in the socket-code on a delete-event */
++ bool success; /* Used to indicate if we succeeded handling the event */
++
++next_event:
++ success = false;
++ kfree(event);
++
++ /* First, let's dequeue an event from our event-list */
++ rcu_read_lock_bh();
++ spin_lock(&fm_ns->local_lock);
++
++ event = list_first_entry_or_null(&fm_ns->events,
++ struct mptcp_addr_event, list);
++ if (!event) {
++ spin_unlock(&fm_ns->local_lock);
++ rcu_read_unlock_bh();
++ return;
++ }
++
++ list_del(&event->list);
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++ if (event->code == MPTCP_EVENT_DEL) {
++ id = mptcp_find_address(mptcp_local, event->family, &event->addr);
++
++ /* Not in the list - so we don't care */
++ if (id < 0) {
++ mptcp_debug("%s could not find id\n", __func__);
++ goto duno;
++ }
++
++ old = mptcp_local;
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++ GFP_ATOMIC);
++ if (!mptcp_local)
++ goto duno;
++
++ if (event->family == AF_INET)
++ mptcp_local->loc4_bits &= ~(1 << id);
++ else
++ mptcp_local->loc6_bits &= ~(1 << id);
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ kfree(old);
++ } else {
++ int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
++ int j = i;
++
++ if (j < 0) {
++ /* Not in the list, so we have to find an empty slot */
++ if (event->family == AF_INET)
++ i = __mptcp_find_free_index(mptcp_local->loc4_bits,
++ mptcp_local->next_v4_index);
++ if (event->family == AF_INET6)
++ i = __mptcp_find_free_index(mptcp_local->loc6_bits,
++ mptcp_local->next_v6_index);
++
++ if (i < 0) {
++ mptcp_debug("%s no more space\n", __func__);
++ goto duno;
++ }
++
++ /* It might have been a MOD-event. */
++ event->code = MPTCP_EVENT_ADD;
++ } else {
++ /* Let's check if anything changes */
++ if (event->family == AF_INET &&
++ event->low_prio == mptcp_local->locaddr4[i].low_prio)
++ goto duno;
++
++ if (event->family == AF_INET6 &&
++ event->low_prio == mptcp_local->locaddr6[i].low_prio)
++ goto duno;
++ }
++
++ old = mptcp_local;
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++ GFP_ATOMIC);
++ if (!mptcp_local)
++ goto duno;
++
++ if (event->family == AF_INET) {
++ mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
++ mptcp_local->locaddr4[i].loc4_id = i + 1;
++ mptcp_local->locaddr4[i].low_prio = event->low_prio;
++ } else {
++ mptcp_local->locaddr6[i].addr = event->addr.in6;
++ mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
++ mptcp_local->locaddr6[i].low_prio = event->low_prio;
++ }
++
++ if (j < 0) {
++ if (event->family == AF_INET) {
++ mptcp_local->loc4_bits |= (1 << i);
++ mptcp_local->next_v4_index = i + 1;
++ } else {
++ mptcp_local->loc6_bits |= (1 << i);
++ mptcp_local->next_v6_index = i + 1;
++ }
++ }
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ kfree(old);
++ }
++ success = true;
++
++duno:
++ spin_unlock(&fm_ns->local_lock);
++ rcu_read_unlock_bh();
++
++ if (!success)
++ goto next_event;
++
++ /* Now we iterate over the MPTCP-sockets and apply the event. */
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ const struct hlist_nulls_node *node;
++ struct tcp_sock *meta_tp;
++
++ rcu_read_lock_bh();
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
++ tk_table) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *meta_sk = (struct sock *)meta_tp, *sk;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ if (sock_net(meta_sk) != net)
++ continue;
++
++ if (meta_v4) {
++ /* skip IPv6 events if meta is IPv4 */
++ if (event->family == AF_INET6)
++ continue;
++ }
++ /* skip IPv4 events if IPV6_V6ONLY is set */
++ else if (event->family == AF_INET &&
++ inet6_sk(meta_sk)->ipv6only)
++ continue;
++
++ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ continue;
++
++ bh_lock_sock(meta_sk);
++
++ if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
++ mpcb->infinite_mapping_snd ||
++ mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping)
++ goto next;
++
++ /* May be that the pm has changed in-between */
++ if (mpcb->pm_ops != &full_mesh)
++ goto next;
++
++ if (sock_owned_by_user(meta_sk)) {
++ if (!test_and_set_bit(MPTCP_PATH_MANAGER,
++ &meta_tp->tsq_flags))
++ sock_hold(meta_sk);
++
++ goto next;
++ }
++
++ if (event->code == MPTCP_EVENT_ADD) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++
++ full_mesh_create_subflows(meta_sk);
++ }
++
++ if (event->code == MPTCP_EVENT_DEL) {
++ struct sock *sk, *tmpsk;
++ struct mptcp_loc_addr *mptcp_local;
++ bool found = false;
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++ /* In any case, we need to update our bitfields */
++ if (id >= 0)
++ update_addr_bitfields(meta_sk, mptcp_local);
++
++ /* Look for the socket and remove him */
++ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++ if ((event->family == AF_INET6 &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk))) ||
++ (event->family == AF_INET &&
++ (sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(sk))))
++ continue;
++
++ if (event->family == AF_INET &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk)) &&
++ inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
++ continue;
++
++ if (event->family == AF_INET6 &&
++ sk->sk_family == AF_INET6 &&
++ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
++ continue;
++
++ /* Reinject, so that pf = 1 and so we
++ * won't select this one as the
++ * ack-sock.
++ */
++ mptcp_reinject_data(sk, 0);
++
++ /* We announce the removal of this id */
++ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
++
++ mptcp_sub_force_close(sk);
++ found = true;
++ }
++
++ if (found)
++ goto next;
++
++ /* The id may have been given by the event,
++ * matching on a local address. And it may not
++ * have matched on one of the above sockets,
++ * because the client never created a subflow.
++ * So, we have to finally remove it here.
++ */
++ if (id > 0)
++ announce_remove_addr(id, meta_sk);
++ }
++
++ if (event->code == MPTCP_EVENT_MOD) {
++ struct sock *sk;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ if (event->family == AF_INET &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk)) &&
++ inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
++ if (event->low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = event->low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (event->family == AF_INET6 &&
++ sk->sk_family == AF_INET6 &&
++ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
++ if (event->low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = event->low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++ }
++ }
++next:
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk);
++ }
++ rcu_read_unlock_bh();
++ }
++ goto next_event;
++}
++
++static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
++ const struct mptcp_addr_event *event)
++{
++ struct mptcp_addr_event *eventq;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++ list_for_each_entry(eventq, &fm_ns->events, list) {
++ if (eventq->family != event->family)
++ continue;
++ if (event->family == AF_INET) {
++ if (eventq->addr.in.s_addr == event->addr.in.s_addr)
++ return eventq;
++ } else {
++ if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
++ return eventq;
++ }
++ }
++ return NULL;
++}
++
++/* We already hold the net-namespace MPTCP-lock */
++static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
++{
++ struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++ if (eventq) {
++ switch (event->code) {
++ case MPTCP_EVENT_DEL:
++ mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
++ list_del(&eventq->list);
++ kfree(eventq);
++ break;
++ case MPTCP_EVENT_ADD:
++ mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
++ eventq->low_prio = event->low_prio;
++ eventq->code = MPTCP_EVENT_ADD;
++ return;
++ case MPTCP_EVENT_MOD:
++ mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
++ eventq->low_prio = event->low_prio;
++ eventq->code = MPTCP_EVENT_MOD;
++ return;
++ }
++ }
++
++ /* OK, we have to add the new address to the wait queue */
++ eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
++ if (!eventq)
++ return;
++
++ list_add_tail(&eventq->list, &fm_ns->events);
++
++ /* Create work-queue */
++ if (!delayed_work_pending(&fm_ns->address_worker))
++ queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
++ msecs_to_jiffies(500));
++}
++
++static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
++ struct net *net)
++{
++ const struct net_device *netdev = ifa->ifa_dev->dev;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ struct mptcp_addr_event mpevent;
++
++ if (ifa->ifa_scope > RT_SCOPE_LINK ||
++ ipv4_is_loopback(ifa->ifa_local))
++ return;
++
++ spin_lock_bh(&fm_ns->local_lock);
++
++ mpevent.family = AF_INET;
++ mpevent.addr.in.s_addr = ifa->ifa_local;
++ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++ if (event == NETDEV_DOWN || !netif_running(netdev) ||
++ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++ mpevent.code = MPTCP_EVENT_DEL;
++ else if (event == NETDEV_UP)
++ mpevent.code = MPTCP_EVENT_ADD;
++ else if (event == NETDEV_CHANGE)
++ mpevent.code = MPTCP_EVENT_MOD;
++
++ mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
++ &ifa->ifa_local, mpevent.code, mpevent.low_prio);
++ add_pm_event(net, &mpevent);
++
++ spin_unlock_bh(&fm_ns->local_lock);
++ return;
++}
++
++/* React on IPv4-addr add/rem-events */
++static int mptcp_pm_inetaddr_event(struct notifier_block *this,
++ unsigned long event, void *ptr)
++{
++ const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
++ struct net *net = dev_net(ifa->ifa_dev->dev);
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ addr4_event_handler(ifa, event, net);
++
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_inetaddr_notifier = {
++ .notifier_call = mptcp_pm_inetaddr_event,
++};
++
++#if IS_ENABLED(CONFIG_IPV6)
++
++/* IPV6-related address/interface watchers */
++struct mptcp_dad_data {
++ struct timer_list timer;
++ struct inet6_ifaddr *ifa;
++};
++
++static void dad_callback(unsigned long arg);
++static int inet6_addr_event(struct notifier_block *this,
++ unsigned long event, void *ptr);
++
++static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
++{
++ return (ifa->flags & IFA_F_TENTATIVE) &&
++ ifa->state == INET6_IFADDR_STATE_DAD;
++}
++
++static void dad_init_timer(struct mptcp_dad_data *data,
++ struct inet6_ifaddr *ifa)
++{
++ data->ifa = ifa;
++ data->timer.data = (unsigned long)data;
++ data->timer.function = dad_callback;
++ if (ifa->idev->cnf.rtr_solicit_delay)
++ data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
++ else
++ data->timer.expires = jiffies + (HZ/10);
++}
++
++static void dad_callback(unsigned long arg)
++{
++ struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
++
++ if (ipv6_is_in_dad_state(data->ifa)) {
++ dad_init_timer(data, data->ifa);
++ add_timer(&data->timer);
++ } else {
++ inet6_addr_event(NULL, NETDEV_UP, data->ifa);
++ in6_ifa_put(data->ifa);
++ kfree(data);
++ }
++}
++
++static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
++{
++ struct mptcp_dad_data *data;
++
++ data = kmalloc(sizeof(*data), GFP_ATOMIC);
++
++ if (!data)
++ return;
++
++ init_timer(&data->timer);
++ dad_init_timer(data, ifa);
++ add_timer(&data->timer);
++ in6_ifa_hold(ifa);
++}
++
++static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
++ struct net *net)
++{
++ const struct net_device *netdev = ifa->idev->dev;
++ int addr_type = ipv6_addr_type(&ifa->addr);
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ struct mptcp_addr_event mpevent;
++
++ if (ifa->scope > RT_SCOPE_LINK ||
++ addr_type == IPV6_ADDR_ANY ||
++ (addr_type & IPV6_ADDR_LOOPBACK) ||
++ (addr_type & IPV6_ADDR_LINKLOCAL))
++ return;
++
++ spin_lock_bh(&fm_ns->local_lock);
++
++ mpevent.family = AF_INET6;
++ mpevent.addr.in6 = ifa->addr;
++ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++ if (event == NETDEV_DOWN || !netif_running(netdev) ||
++ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++ mpevent.code = MPTCP_EVENT_DEL;
++ else if (event == NETDEV_UP)
++ mpevent.code = MPTCP_EVENT_ADD;
++ else if (event == NETDEV_CHANGE)
++ mpevent.code = MPTCP_EVENT_MOD;
++
++ mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
++ &ifa->addr, mpevent.code, mpevent.low_prio);
++ add_pm_event(net, &mpevent);
++
++ spin_unlock_bh(&fm_ns->local_lock);
++ return;
++}
++
++/* React on IPv6-addr add/rem-events */
++static int inet6_addr_event(struct notifier_block *this, unsigned long event,
++ void *ptr)
++{
++ struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
++ struct net *net = dev_net(ifa6->idev->dev);
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ if (ipv6_is_in_dad_state(ifa6))
++ dad_setup_timer(ifa6);
++ else
++ addr6_event_handler(ifa6, event, net);
++
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block inet6_addr_notifier = {
++ .notifier_call = inet6_addr_event,
++};
++
++#endif
++
++/* React on ifup/down-events */
++static int netdev_event(struct notifier_block *this, unsigned long event,
++ void *ptr)
++{
++ const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
++ struct in_device *in_dev;
++#if IS_ENABLED(CONFIG_IPV6)
++ struct inet6_dev *in6_dev;
++#endif
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ rcu_read_lock();
++ in_dev = __in_dev_get_rtnl(dev);
++
++ if (in_dev) {
++ for_ifa(in_dev) {
++ mptcp_pm_inetaddr_event(NULL, event, ifa);
++ } endfor_ifa(in_dev);
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ in6_dev = __in6_dev_get(dev);
++
++ if (in6_dev) {
++ struct inet6_ifaddr *ifa6;
++ list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
++ inet6_addr_event(NULL, event, ifa6);
++ }
++#endif
++
++ rcu_read_unlock();
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_netdev_notifier = {
++ .notifier_call = netdev_event,
++};
++
++static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
++ const union inet_addr *addr,
++ sa_family_t family, __be16 port, u8 id)
++{
++ if (family == AF_INET)
++ mptcp_addv4_raddr(mpcb, &addr->in, port, id);
++ else
++ mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
++}
++
++static void full_mesh_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int i, index;
++ union inet_addr saddr, daddr;
++ sa_family_t family;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ /* Init local variables necessary for the rest */
++ if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
++ saddr.ip = inet_sk(meta_sk)->inet_saddr;
++ daddr.ip = inet_sk(meta_sk)->inet_daddr;
++ family = AF_INET;
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ saddr.in6 = inet6_sk(meta_sk)->saddr;
++ daddr.in6 = meta_sk->sk_v6_daddr;
++ family = AF_INET6;
++#endif
++ }
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ index = mptcp_find_address(mptcp_local, family, &saddr);
++ if (index < 0)
++ goto fallback;
++
++ full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
++ mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
++ fmp->mpcb = mpcb;
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* Look for the address among the local addresses */
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ __be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
++
++ /* We do not need to announce the initial subflow's address again */
++ if (family == AF_INET && saddr.ip == ifa_address)
++ continue;
++
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++ }
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++ /* skip IPv6 addresses if meta-socket is IPv4 */
++ if (meta_v4)
++ goto skip_ipv6;
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
++
++ /* We do not need to announce the initial subflow's address again */
++ if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
++ continue;
++
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++ }
++
++skip_ipv6:
++#endif
++
++ rcu_read_unlock();
++
++ if (family == AF_INET)
++ fmp->announced_addrs_v4 |= (1 << index);
++ else
++ fmp->announced_addrs_v6 |= (1 << index);
++
++ for (i = fmp->add_addr; i && fmp->add_addr; i--)
++ tcp_send_ack(mpcb->master_sk);
++
++ return;
++
++fallback:
++ rcu_read_unlock();
++ mptcp_fallback_default(mpcb);
++ return;
++}
++
++static void full_mesh_create_subflows(struct sock *meta_sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ return;
++
++ if (!work_pending(&fmp->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &fmp->subflow_work);
++ }
++}
++
++/* Called upon release_sock, if the socket was owned by the user during
++ * a path-management event.
++ */
++static void full_mesh_release_sock(struct sock *meta_sk)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ struct sock *sk, *tmpsk;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++ int i;
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* First, detect modifications or additions */
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ struct in_addr ifa = mptcp_local->locaddr4[i].addr;
++ bool found = false;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(sk))
++ continue;
++
++ if (inet_sk(sk)->inet_saddr != ifa.s_addr)
++ continue;
++
++ found = true;
++
++ if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (!found) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++ full_mesh_create_subflows(meta_sk);
++ }
++ }
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++ /* skip IPv6 addresses if meta-socket is IPv4 */
++ if (meta_v4)
++ goto removal;
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
++ bool found = false;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk))
++ continue;
++
++ if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
++ continue;
++
++ found = true;
++
++ if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (!found) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++ full_mesh_create_subflows(meta_sk);
++ }
++ }
++
++removal:
++#endif
++
++ /* Now, detect address-removals */
++ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++ bool shall_remove = true;
++
++ if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
++ shall_remove = false;
++ break;
++ }
++ }
++ } else {
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
++ shall_remove = false;
++ break;
++ }
++ }
++ }
++
++ if (shall_remove) {
++ /* Reinject, so that pf = 1 and so we
++ * won't select this one as the
++ * ack-sock.
++ */
++ mptcp_reinject_data(sk, 0);
++
++ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
++ meta_sk);
++
++ mptcp_sub_force_close(sk);
++ }
++ }
++
++ /* Just call it optimistically. It actually cannot do any harm */
++ update_addr_bitfields(meta_sk, mptcp_local);
++
++ rcu_read_unlock();
++}
++
++static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ int index, id = -1;
++
++ /* Handle the backup-flows */
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ index = mptcp_find_address(mptcp_local, family, addr);
++
++ if (index != -1) {
++ if (family == AF_INET) {
++ id = mptcp_local->locaddr4[index].loc4_id;
++ *low_prio = mptcp_local->locaddr4[index].low_prio;
++ } else {
++ id = mptcp_local->locaddr6[index].loc6_id;
++ *low_prio = mptcp_local->locaddr6[index].low_prio;
++ }
++ }
++
++
++ rcu_read_unlock();
++
++ return id;
++}
++
++static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
++ struct tcp_out_options *opts,
++ struct sk_buff *skb)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
++ int remove_addr_len;
++ u8 unannouncedv4 = 0, unannouncedv6 = 0;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ mpcb->addr_signal = 0;
++
++ if (likely(!fmp->add_addr))
++ goto remove_addr;
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* IPv4 */
++ unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
++ if (unannouncedv4 &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
++ int ind = mptcp_find_free_index(~unannouncedv4);
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_ADD_ADDR;
++ opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
++ opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
++ opts->add_addr_v4 = 1;
++
++ if (skb) {
++ fmp->announced_addrs_v4 |= (1 << ind);
++ fmp->add_addr--;
++ }
++ *size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
++ }
++
++ if (meta_v4)
++ goto skip_ipv6;
++
++skip_ipv4:
++ /* IPv6 */
++ unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
++ if (unannouncedv6 &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
++ int ind = mptcp_find_free_index(~unannouncedv6);
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_ADD_ADDR;
++ opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
++ opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
++ opts->add_addr_v6 = 1;
++
++ if (skb) {
++ fmp->announced_addrs_v6 |= (1 << ind);
++ fmp->add_addr--;
++ }
++ *size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
++ }
++
++skip_ipv6:
++ rcu_read_unlock();
++
++ if (!unannouncedv4 && !unannouncedv6 && skb)
++ fmp->add_addr--;
++
++remove_addr:
++ if (likely(!fmp->remove_addrs))
++ goto exit;
++
++ remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
++ if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
++ goto exit;
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_REMOVE_ADDR;
++ opts->remove_addrs = fmp->remove_addrs;
++ *size += remove_addr_len;
++ if (skb)
++ fmp->remove_addrs = 0;
++
++exit:
++ mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
++}
++
++static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
++{
++ mptcp_v4_rem_raddress(mpcb, rem_id);
++ mptcp_v6_rem_raddress(mpcb, rem_id);
++}
++
++/* Output /proc/net/mptcp_fullmesh */
++static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
++{
++ const struct net *net = seq->private;
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ int i;
++
++ seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
++
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
++
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
++
++ seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
++ loc4->low_prio, &loc4->addr);
++ }
++
++ seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
++
++ seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
++ loc6->low_prio, &loc6->addr);
++ }
++ rcu_read_unlock_bh();
++
++ return 0;
++}
++
++static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
++{
++ return single_open_net(inode, file, mptcp_fm_seq_show);
++}
++
++static const struct file_operations mptcp_fm_seq_fops = {
++ .owner = THIS_MODULE,
++ .open = mptcp_fm_seq_open,
++ .read = seq_read,
++ .llseek = seq_lseek,
++ .release = single_release_net,
++};
++
++static int mptcp_fm_init_net(struct net *net)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns;
++ int err = 0;
++
++ fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
++ if (!fm_ns)
++ return -ENOBUFS;
++
++ mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
++ if (!mptcp_local) {
++ err = -ENOBUFS;
++ goto err_mptcp_local;
++ }
++
++ if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
++ &mptcp_fm_seq_fops)) {
++ err = -ENOMEM;
++ goto err_seq_fops;
++ }
++
++ mptcp_local->next_v4_index = 1;
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
++ INIT_LIST_HEAD(&fm_ns->events);
++ spin_lock_init(&fm_ns->local_lock);
++ fm_ns->net = net;
++ net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
++
++ return 0;
++err_seq_fops:
++ kfree(mptcp_local);
++err_mptcp_local:
++ kfree(fm_ns);
++ return err;
++}
++
++static void mptcp_fm_exit_net(struct net *net)
++{
++ struct mptcp_addr_event *eventq, *tmp;
++ struct mptcp_fm_ns *fm_ns;
++ struct mptcp_loc_addr *mptcp_local;
++
++ fm_ns = fm_get_ns(net);
++ cancel_delayed_work_sync(&fm_ns->address_worker);
++
++ rcu_read_lock_bh();
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ kfree(mptcp_local);
++
++ spin_lock(&fm_ns->local_lock);
++ list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
++ list_del(&eventq->list);
++ kfree(eventq);
++ }
++ spin_unlock(&fm_ns->local_lock);
++
++ rcu_read_unlock_bh();
++
++ remove_proc_entry("mptcp_fullmesh", net->proc_net);
++
++ kfree(fm_ns);
++}
++
++static struct pernet_operations full_mesh_net_ops = {
++ .init = mptcp_fm_init_net,
++ .exit = mptcp_fm_exit_net,
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly = {
++ .new_session = full_mesh_new_session,
++ .release_sock = full_mesh_release_sock,
++ .fully_established = full_mesh_create_subflows,
++ .new_remote_address = full_mesh_create_subflows,
++ .get_local_id = full_mesh_get_local_id,
++ .addr_signal = full_mesh_addr_signal,
++ .add_raddr = full_mesh_add_raddr,
++ .rem_raddr = full_mesh_rem_raddr,
++ .name = "fullmesh",
++ .owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init full_mesh_register(void)
++{
++ int ret;
++
++ BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
++
++ ret = register_pernet_subsys(&full_mesh_net_ops);
++ if (ret)
++ goto out;
++
++ ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++ if (ret)
++ goto err_reg_inetaddr;
++ ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
++ if (ret)
++ goto err_reg_netdev;
++
++#if IS_ENABLED(CONFIG_IPV6)
++ ret = register_inet6addr_notifier(&inet6_addr_notifier);
++ if (ret)
++ goto err_reg_inet6addr;
++#endif
++
++ ret = mptcp_register_path_manager(&full_mesh);
++ if (ret)
++ goto err_reg_pm;
++
++out:
++ return ret;
++
++
++err_reg_pm:
++#if IS_ENABLED(CONFIG_IPV6)
++ unregister_inet6addr_notifier(&inet6_addr_notifier);
++err_reg_inet6addr:
++#endif
++ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++err_reg_netdev:
++ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++err_reg_inetaddr:
++ unregister_pernet_subsys(&full_mesh_net_ops);
++ goto out;
++}
++
++static void full_mesh_unregister(void)
++{
++#if IS_ENABLED(CONFIG_IPV6)
++ unregister_inet6addr_notifier(&inet6_addr_notifier);
++#endif
++ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++ unregister_pernet_subsys(&full_mesh_net_ops);
++ mptcp_unregister_path_manager(&full_mesh);
++}
++
++module_init(full_mesh_register);
++module_exit(full_mesh_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("Full-Mesh MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
+new file mode 100644
+index 000000000000..43704ccb639e
+--- /dev/null
++++ b/net/mptcp/mptcp_input.c
+@@ -0,0 +1,2405 @@
++/*
++ * MPTCP implementation - Sending side
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <asm/unaligned.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++
++#include <linux/kconfig.h>
++
++/* is seq1 < seq2 ? */
++static inline bool before64(const u64 seq1, const u64 seq2)
++{
++ return (s64)(seq1 - seq2) < 0;
++}
++
++/* is seq1 > seq2 ? */
++#define after64(seq1, seq2) before64(seq2, seq1)
++
++static inline void mptcp_become_fully_estab(struct sock *sk)
++{
++ tcp_sk(sk)->mptcp->fully_established = 1;
++
++ if (is_master_tp(tcp_sk(sk)) &&
++ tcp_sk(sk)->mpcb->pm_ops->fully_established)
++ tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
++}
++
++/* Similar to tcp_tso_acked without any memory accounting */
++static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
++ struct sk_buff *skb)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ u32 packets_acked, len;
++
++ BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
++
++ packets_acked = tcp_skb_pcount(skb);
++
++ if (skb_unclone(skb, GFP_ATOMIC))
++ return 0;
++
++ len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
++ __pskb_trim_head(skb, len);
++
++ TCP_SKB_CB(skb)->seq += len;
++ skb->ip_summed = CHECKSUM_PARTIAL;
++ skb->truesize -= len;
++
++ /* Any change of skb->len requires recalculation of tso factor. */
++ if (tcp_skb_pcount(skb) > 1)
++ tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
++ packets_acked -= tcp_skb_pcount(skb);
++
++ if (packets_acked) {
++ BUG_ON(tcp_skb_pcount(skb) == 0);
++ BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
++ }
++
++ return packets_acked;
++}
++
++/**
++ * Cleans the meta-socket retransmission queue and the reinject-queue.
++ * @sk must be the metasocket.
++ */
++static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
++{
++ struct sk_buff *skb, *tmp;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ bool acked = false;
++ u32 acked_pcount;
++
++ while ((skb = tcp_write_queue_head(meta_sk)) &&
++ skb != tcp_send_head(meta_sk)) {
++ bool fully_acked = true;
++
++ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++ if (tcp_skb_pcount(skb) == 1 ||
++ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++ break;
++
++ acked_pcount = tcp_tso_acked(meta_sk, skb);
++ if (!acked_pcount)
++ break;
++
++ fully_acked = false;
++ } else {
++ acked_pcount = tcp_skb_pcount(skb);
++ }
++
++ acked = true;
++ meta_tp->packets_out -= acked_pcount;
++ meta_tp->retrans_stamp = 0;
++
++ if (!fully_acked)
++ break;
++
++ tcp_unlink_write_queue(skb, meta_sk);
++
++ if (mptcp_is_data_fin(skb)) {
++ struct sock *sk_it;
++
++ /* DATA_FIN has been acknowledged - now we can close
++ * the subflows
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ unsigned long delay = 0;
++
++ /* If we are the passive closer, don't trigger
++ * subflow-fin until the subflow has been finned
++ * by the peer - thus we add a delay.
++ */
++ if (mpcb->passive_close &&
++ sk_it->sk_state == TCP_ESTABLISHED)
++ delay = inet_csk(sk_it)->icsk_rto << 3;
++
++ mptcp_sub_close(sk_it, delay);
++ }
++ }
++ sk_wmem_free_skb(meta_sk, skb);
++ }
++ /* Remove acknowledged data from the reinject queue */
++ skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
++ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++ if (tcp_skb_pcount(skb) == 1 ||
++ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++ break;
++
++ mptcp_tso_acked_reinject(meta_sk, skb);
++ break;
++ }
++
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ __kfree_skb(skb);
++ }
++
++ if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
++ meta_tp->snd_up = meta_tp->snd_una;
++
++ if (acked) {
++ tcp_rearm_rto(meta_sk);
++ /* Normally this is done in tcp_try_undo_loss - but MPTCP
++ * does not call this function.
++ */
++ inet_csk(meta_sk)->icsk_retransmits = 0;
++ }
++}
++
++/* Inspired by tcp_rcv_state_process */
++static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
++ const struct sk_buff *skb, u32 data_seq,
++ u16 data_len)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++ const struct tcphdr *th = tcp_hdr(skb);
++
++ /* State-machine handling if FIN has been enqueued and he has
++ * been acked (snd_una == write_seq) - it's important that this
++ * here is after sk_wmem_free_skb because otherwise
++ * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
++ */
++ switch (meta_sk->sk_state) {
++ case TCP_FIN_WAIT1: {
++ struct dst_entry *dst;
++ int tmo;
++
++ if (meta_tp->snd_una != meta_tp->write_seq)
++ break;
++
++ tcp_set_state(meta_sk, TCP_FIN_WAIT2);
++ meta_sk->sk_shutdown |= SEND_SHUTDOWN;
++
++ dst = __sk_dst_get(sk);
++ if (dst)
++ dst_confirm(dst);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ /* Wake up lingering close() */
++ meta_sk->sk_state_change(meta_sk);
++ break;
++ }
++
++ if (meta_tp->linger2 < 0 ||
++ (data_len &&
++ after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
++ meta_tp->rcv_nxt))) {
++ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++ tcp_done(meta_sk);
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ return 1;
++ }
++
++ tmo = tcp_fin_time(meta_sk);
++ if (tmo > TCP_TIMEWAIT_LEN) {
++ inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
++ } else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
++ /* Bad case. We could lose such FIN otherwise.
++ * It is not a big problem, but it looks confusing
++ * and not so rare event. We still can lose it now,
++ * if it spins in bh_lock_sock(), but it is really
++ * marginal case.
++ */
++ inet_csk_reset_keepalive_timer(meta_sk, tmo);
++ } else {
++ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
++ }
++ break;
++ }
++ case TCP_CLOSING:
++ case TCP_LAST_ACK:
++ if (meta_tp->snd_una == meta_tp->write_seq) {
++ tcp_done(meta_sk);
++ return 1;
++ }
++ break;
++ }
++
++ /* step 7: process the segment text */
++ switch (meta_sk->sk_state) {
++ case TCP_FIN_WAIT1:
++ case TCP_FIN_WAIT2:
++ /* RFC 793 says to queue data in these states,
++ * RFC 1122 says we MUST send a reset.
++ * BSD 4.4 also does reset.
++ */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++ if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
++ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++ !mptcp_is_data_fin2(skb, tp)) {
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++ tcp_reset(meta_sk);
++ return 1;
++ }
++ }
++ break;
++ }
++
++ return 0;
++}
++
++/**
++ * @return:
++ * i) 1: Everything's fine.
++ * ii) -1: A reset has been sent on the subflow - csum-failure
++ * iii) 0: csum-failure but no reset sent, because it's the last subflow.
++ * Last packet should not be destroyed by the caller because it has
++ * been done here.
++ */
++static int mptcp_verif_dss_csum(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *tmp, *tmp1, *last = NULL;
++ __wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
++ int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
++ int iter = 0;
++
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
++ unsigned int csum_len;
++
++ if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
++ /* Mapping ends in the middle of the packet -
++ * csum only these bytes
++ */
++ csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
++ else
++ csum_len = tmp->len;
++
++ offset = 0;
++ if (overflowed) {
++ char first_word[4];
++ first_word[0] = 0;
++ first_word[1] = 0;
++ first_word[2] = 0;
++ first_word[3] = *(tmp->data);
++ csum_tcp = csum_partial(first_word, 4, csum_tcp);
++ offset = 1;
++ csum_len--;
++ overflowed = 0;
++ }
++
++ csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
++
++ /* Was it on an odd-length? Then we have to merge the next byte
++ * correctly (see above)
++ */
++ if (csum_len != (csum_len & (~1)))
++ overflowed = 1;
++
++ if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
++ __be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
++
++ /* If a 64-bit dss is present, we increase the offset
++ * by 4 bytes, as the high-order 64-bits will be added
++ * in the final csum_partial-call.
++ */
++ u32 offset = skb_transport_offset(tmp) +
++ TCP_SKB_CB(tmp)->dss_off;
++ if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
++ offset += 4;
++
++ csum_tcp = skb_checksum(tmp, offset,
++ MPTCP_SUB_LEN_SEQ_CSUM,
++ csum_tcp);
++
++ csum_tcp = csum_partial(&data_seq,
++ sizeof(data_seq), csum_tcp);
++
++ dss_csum_added = 1; /* Just do it once */
++ }
++ last = tmp;
++ iter++;
++
++ if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
++ !before(TCP_SKB_CB(tmp1)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++
++ /* Now, checksum must be 0 */
++ if (unlikely(csum_fold(csum_tcp))) {
++ pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
++ __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
++ dss_csum_added, overflowed, iter);
++
++ tp->mptcp->send_mp_fail = 1;
++
++ /* map_data_seq is the data-seq number of the
++ * mapping we are currently checking
++ */
++ tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
++
++ if (tp->mpcb->cnt_subflows > 1) {
++ mptcp_send_reset(sk);
++ ans = -1;
++ } else {
++ tp->mpcb->send_infinite_mapping = 1;
++
++ /* Need to purge the rcv-queue as it's no more valid */
++ while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
++ tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
++ kfree_skb(tmp);
++ }
++
++ ans = 0;
++ }
++ }
++
++ return ans;
++}
++
++static inline void mptcp_prepare_skb(struct sk_buff *skb,
++ const struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 inc = 0;
++
++ /* If skb is the end of this mapping (end is always at mapping-boundary
++ * thanks to the splitting/trimming), then we need to increase
++ * data-end-seq by 1 if this here is a data-fin.
++ *
++ * We need to do -1 because end_seq includes the subflow-FIN.
++ */
++ if (tp->mptcp->map_data_fin &&
++ (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
++ (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++ inc = 1;
++
++ /* We manually set the fin-flag if it is a data-fin. For easy
++ * processing in tcp_recvmsg.
++ */
++ tcp_hdr(skb)->fin = 1;
++ } else {
++ /* We may have a subflow-fin with data but without data-fin */
++ tcp_hdr(skb)->fin = 0;
++ }
++
++ /* Adapt data-seq's to the packet itself. We kinda transform the
++ * dss-mapping to a per-packet granularity. This is necessary to
++ * correctly handle overlapping mappings coming from different
++ * subflows. Otherwise it would be a complete mess.
++ */
++ tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
++ tcb->end_seq = tcb->seq + skb->len + inc;
++}
++
++/**
++ * @return: 1 if the segment has been eaten and can be suppressed,
++ * otherwise 0.
++ */
++static inline int mptcp_direct_copy(const struct sk_buff *skb,
++ struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
++ int eaten = 0;
++
++ __set_current_state(TASK_RUNNING);
++
++ local_bh_enable();
++ if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
++ meta_tp->ucopy.len -= chunk;
++ meta_tp->copied_seq += chunk;
++ eaten = (chunk == skb->len);
++ tcp_rcv_space_adjust(meta_sk);
++ }
++ local_bh_disable();
++ return eaten;
++}
++
++static inline void mptcp_reset_mapping(struct tcp_sock *tp)
++{
++ tp->mptcp->map_data_len = 0;
++ tp->mptcp->map_data_seq = 0;
++ tp->mptcp->map_subseq = 0;
++ tp->mptcp->map_data_fin = 0;
++ tp->mptcp->mapping_present = 0;
++}
++
++/* The DSS-mapping received on the sk only covers the second half of the skb
++ * (cut at seq). We trim the head from the skb.
++ * Data will be freed upon kfree().
++ *
++ * Inspired by tcp_trim_head().
++ */
++static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++ int len = seq - TCP_SKB_CB(skb)->seq;
++ u32 new_seq = TCP_SKB_CB(skb)->seq + len;
++
++ if (len < skb_headlen(skb))
++ __skb_pull(skb, len);
++ else
++ __pskb_trim_head(skb, len - skb_headlen(skb));
++
++ TCP_SKB_CB(skb)->seq = new_seq;
++
++ skb->truesize -= len;
++ atomic_sub(len, &sk->sk_rmem_alloc);
++ sk_mem_uncharge(sk, len);
++}
++
++/* The DSS-mapping received on the sk only covers the first half of the skb
++ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
++ * as further packets may resolve the mapping of the second half of data.
++ *
++ * Inspired by tcp_fragment().
++ */
++static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++ struct sk_buff *buff;
++ int nsize;
++ int nlen, len;
++
++ len = seq - TCP_SKB_CB(skb)->seq;
++ nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
++ if (nsize < 0)
++ nsize = 0;
++
++ /* Get a new skb... force flag on. */
++ buff = alloc_skb(nsize, GFP_ATOMIC);
++ if (buff == NULL)
++ return -ENOMEM;
++
++ skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
++ skb_reset_transport_header(buff);
++
++ tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
++ tcp_hdr(skb)->fin = 0;
++
++ /* We absolutly need to call skb_set_owner_r before refreshing the
++ * truesize of buff, otherwise the moved data will account twice.
++ */
++ skb_set_owner_r(buff, sk);
++ nlen = skb->len - len - nsize;
++ buff->truesize += nlen;
++ skb->truesize -= nlen;
++
++ /* Correct the sequence numbers. */
++ TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
++ TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
++ TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
++
++ skb_split(skb, buff, len);
++
++ __skb_queue_after(&sk->sk_receive_queue, skb, buff);
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
++ if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
++ !tp->mpcb->infinite_mapping_rcv) {
++ /* Remove a pure subflow-fin from the queue and increase
++ * copied_seq.
++ */
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++
++ /* If we are not yet fully established and do not know the mapping for
++ * this segment, this path has to fallback to infinite or be torn down.
++ */
++ if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
++ !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
++ pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
++ __func__, tp->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, __builtin_return_address(0),
++ TCP_SKB_CB(skb)->seq);
++
++ if (!is_master_tp(tp)) {
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mpcb->infinite_mapping_rcv = 1;
++ /* We do a seamless fallback and should not send a inf.mapping. */
++ tp->mpcb->send_infinite_mapping = 0;
++ tp->mptcp->fully_established = 1;
++ }
++
++ /* Receiver-side becomes fully established when a whole rcv-window has
++ * been received without the need to fallback due to the previous
++ * condition.
++ */
++ if (!tp->mptcp->fully_established) {
++ tp->mptcp->init_rcv_wnd -= skb->len;
++ if (tp->mptcp->init_rcv_wnd < 0)
++ mptcp_become_fully_estab(sk);
++ }
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 *ptr;
++ u32 data_seq, sub_seq, data_len, tcp_end_seq;
++
++ /* If we are in infinite-mapping-mode, the subflow is guaranteed to be
++ * in-order at the data-level. Thus data-seq-numbers can be inferred
++ * from what is expected at the data-level.
++ */
++ if (mpcb->infinite_mapping_rcv) {
++ tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
++ tp->mptcp->map_subseq = tcb->seq;
++ tp->mptcp->map_data_len = skb->len;
++ tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
++ tp->mptcp->mapping_present = 1;
++ return 0;
++ }
++
++ /* No mapping here? Exit - it is either already set or still on its way */
++ if (!mptcp_is_data_seq(skb)) {
++ /* Too many packets without a mapping - this subflow is broken */
++ if (!tp->mptcp->mapping_present &&
++ tp->rcv_nxt - tp->copied_seq > 65536) {
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ return 0;
++ }
++
++ ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
++ ptr++;
++ sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
++ ptr++;
++ data_len = get_unaligned_be16(ptr);
++
++ /* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
++ * The draft sets it to 0, but we really would like to have the
++ * real value, to have an easy handling afterwards here in this
++ * function.
++ */
++ if (mptcp_is_data_fin(skb) && skb->len == 0)
++ sub_seq = TCP_SKB_CB(skb)->seq;
++
++ /* If there is already a mapping - we check if it maps with the current
++ * one. If not - we reset.
++ */
++ if (tp->mptcp->mapping_present &&
++ (data_seq != (u32)tp->mptcp->map_data_seq ||
++ sub_seq != tp->mptcp->map_subseq ||
++ data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
++ mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
++ /* Mapping in packet is different from what we want */
++ pr_err("%s Mappings do not match!\n", __func__);
++ pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
++ __func__, data_seq, (u32)tp->mptcp->map_data_seq,
++ sub_seq, tp->mptcp->map_subseq, data_len,
++ tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
++ tp->mptcp->map_data_fin);
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ /* If the previous check was good, the current mapping is valid and we exit. */
++ if (tp->mptcp->mapping_present)
++ return 0;
++
++ /* Mapping not yet set on this subflow - we set it here! */
++
++ if (!data_len) {
++ mpcb->infinite_mapping_rcv = 1;
++ tp->mptcp->fully_established = 1;
++ /* We need to repeat mp_fail's until the sender felt
++ * back to infinite-mapping - here we stop repeating it.
++ */
++ tp->mptcp->send_mp_fail = 0;
++
++ /* We have to fixup data_len - it must be the same as skb->len */
++ data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
++ sub_seq = tcb->seq;
++
++ /* TODO kill all other subflows than this one */
++ /* data_seq and so on are set correctly */
++
++ /* At this point, the meta-ofo-queue has to be emptied,
++ * as the following data is guaranteed to be in-order at
++ * the data and subflow-level
++ */
++ mptcp_purge_ofo_queue(meta_tp);
++ }
++
++ /* We are sending mp-fail's and thus are in fallback mode.
++ * Ignore packets which do not announce the fallback and still
++ * want to provide a mapping.
++ */
++ if (tp->mptcp->send_mp_fail) {
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++
++ /* FIN increased the mapping-length by 1 */
++ if (mptcp_is_data_fin(skb))
++ data_len--;
++
++ /* Subflow-sequences of packet must be
++ * (at least partially) be part of the DSS-mapping's
++ * subflow-sequence-space.
++ *
++ * Basically the mapping is not valid, if either of the
++ * following conditions is true:
++ *
++ * 1. It's not a data_fin and
++ * MPTCP-sub_seq >= TCP-end_seq
++ *
++ * 2. It's a data_fin and TCP-end_seq > TCP-seq and
++ * MPTCP-sub_seq >= TCP-end_seq
++ *
++ * The previous two can be merged into:
++ * TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
++ * Because if it's not a data-fin, TCP-end_seq > TCP-seq
++ *
++ * 3. It's a data_fin and skb->len == 0 and
++ * MPTCP-sub_seq > TCP-end_seq
++ *
++ * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
++ * MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
++ *
++ * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
++ */
++
++ /* subflow-fin is not part of the mapping - ignore it here ! */
++ tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
++ if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
++ (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
++ (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
++ before(sub_seq, tp->copied_seq)) {
++ /* Subflow-sequences of packet is different from what is in the
++ * packet's dss-mapping. The peer is misbehaving - reset
++ */
++ pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
++ "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
++ "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
++ skb->len, data_len, tp->copied_seq);
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ /* Does the DSS had 64-bit seqnum's ? */
++ if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
++ /* Wrapped around? */
++ if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
++ } else {
++ /* Else, access the default high-order bits */
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
++ }
++ } else {
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
++
++ if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
++ /* We make sure that the data_seq is invalid.
++ * It will be dropped later.
++ */
++ tp->mptcp->map_data_seq += 0xFFFFFFFF;
++ tp->mptcp->map_data_seq += 0xFFFFFFFF;
++ }
++ }
++
++ tp->mptcp->map_data_len = data_len;
++ tp->mptcp->map_subseq = sub_seq;
++ tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
++ tp->mptcp->mapping_present = 1;
++
++ return 0;
++}
++
++/* Similar to tcp_sequence(...) */
++static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
++ u64 data_seq, u64 end_data_seq)
++{
++ const struct mptcp_cb *mpcb = meta_tp->mpcb;
++ u64 rcv_wup64;
++
++ /* Wrap-around? */
++ if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
++ rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
++ meta_tp->rcv_wup;
++ } else {
++ rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++ meta_tp->rcv_wup);
++ }
++
++ return !before64(end_data_seq, rcv_wup64) &&
++ !after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *tmp, *tmp1;
++ u32 tcp_end_seq;
++
++ if (!tp->mptcp->mapping_present)
++ return 0;
++
++ /* either, the new skb gave us the mapping and the first segment
++ * in the sub-rcv-queue has to be trimmed ...
++ */
++ tmp = skb_peek(&sk->sk_receive_queue);
++ if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
++ after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
++ mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
++
++ /* ... or the new skb (tail) has to be split at the end. */
++ tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
++ if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++ u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
++ if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
++ /* TODO : maybe handle this here better.
++ * We now just force meta-retransmission.
++ */
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++ }
++
++ /* Now, remove old sk_buff's from the receive-queue.
++ * This may happen if the mapping has been lost for these segments and
++ * the next mapping has already been received.
++ */
++ if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
++ break;
++
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++
++ /* Impossible that we could free skb here, because his
++ * mapping is known to be valid from previous checks
++ */
++ __kfree_skb(tmp1);
++ }
++ }
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this mapping has been put in the meta-receive-queue
++ * -2 this mapping has been eaten by the application
++ */
++static int mptcp_queue_skb(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sk_buff *tmp, *tmp1;
++ u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
++ bool data_queued = false;
++
++ /* Have we not yet received the full mapping? */
++ if (!tp->mptcp->mapping_present ||
++ before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ return 0;
++
++ /* Is this an overlapping mapping? rcv_nxt >= end_data_seq
++ * OR
++ * This mapping is out of window
++ */
++ if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
++ !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
++ tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ __kfree_skb(tmp1);
++
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++
++ mptcp_reset_mapping(tp);
++
++ return -1;
++ }
++
++ /* Record it, because we want to send our data_fin on the same path */
++ if (tp->mptcp->map_data_fin) {
++ mpcb->dfin_path_index = tp->mptcp->path_index;
++ mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
++ }
++
++ /* Verify the checksum */
++ if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
++ int ret = mptcp_verif_dss_csum(sk);
++
++ if (ret <= 0) {
++ mptcp_reset_mapping(tp);
++ return 1;
++ }
++ }
++
++ if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
++ /* Seg's have to go to the meta-ofo-queue */
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_prepare_skb(tmp1, sk);
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ /* MUST be done here, because fragstolen may be true later.
++ * Then, kfree_skb_partial will not account the memory.
++ */
++ skb_orphan(tmp1);
++
++ if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
++ mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
++ else
++ __kfree_skb(tmp1);
++
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++ tcp_enter_quickack_mode(sk);
++ } else {
++ /* Ready for the meta-rcv-queue */
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ int eaten = 0;
++ const bool copied_early = false;
++ bool fragstolen = false;
++ u32 old_rcv_nxt = meta_tp->rcv_nxt;
++
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_prepare_skb(tmp1, sk);
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ /* MUST be done here, because fragstolen may be true.
++ * Then, kfree_skb_partial will not account the memory.
++ */
++ skb_orphan(tmp1);
++
++ /* This segment has already been received */
++ if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
++ __kfree_skb(tmp1);
++ goto next;
++ }
++
++#ifdef CONFIG_NET_DMA
++ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.task == current &&
++ meta_tp->copied_seq == meta_tp->rcv_nxt &&
++ tmp1->len <= meta_tp->ucopy.len &&
++ sock_owned_by_user(meta_sk) &&
++ tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
++ copied_early = true;
++ eaten = 1;
++ }
++#endif
++
++ /* Is direct copy possible ? */
++ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.task == current &&
++ meta_tp->copied_seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
++ !copied_early)
++ eaten = mptcp_direct_copy(tmp1, meta_sk);
++
++ if (mpcb->in_time_wait) /* In time-wait, do not receive data */
++ eaten = 1;
++
++ if (!eaten)
++ eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
++
++ meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++#ifdef CONFIG_NET_DMA
++ if (copied_early)
++ meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
++#endif
++
++ if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
++ mptcp_fin(meta_sk);
++
++ /* Check if this fills a gap in the ofo queue */
++ if (!skb_queue_empty(&meta_tp->out_of_order_queue))
++ mptcp_ofo_queue(meta_sk);
++
++#ifdef CONFIG_NET_DMA
++ if (copied_early)
++ __skb_queue_tail(&meta_sk->sk_async_wait_queue,
++ tmp1);
++ else
++#endif
++ if (eaten)
++ kfree_skb_partial(tmp1, fragstolen);
++
++ data_queued = true;
++next:
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++ }
++
++ inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
++ mptcp_reset_mapping(tp);
++
++ return data_queued ? -1 : -2;
++}
++
++void mptcp_data_ready(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct sk_buff *skb, *tmp;
++ int queued = 0;
++
++ /* restart before the check, because mptcp_fin might have changed the
++ * state.
++ */
++restart:
++ /* If the meta cannot receive data, there is no point in pushing data.
++ * If we are in time-wait, we may still be waiting for the final FIN.
++ * So, we should proceed with the processing.
++ */
++ if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
++ skb_queue_purge(&sk->sk_receive_queue);
++ tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
++ goto exit;
++ }
++
++ /* Iterate over all segments, detect their mapping (if we don't have
++ * one yet), validate them and push everything one level higher.
++ */
++ skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
++ int ret;
++ /* Pre-validation - e.g., early fallback */
++ ret = mptcp_prevalidate_skb(sk, skb);
++ if (ret < 0)
++ goto restart;
++ else if (ret > 0)
++ break;
++
++ /* Set the current mapping */
++ ret = mptcp_detect_mapping(sk, skb);
++ if (ret < 0)
++ goto restart;
++ else if (ret > 0)
++ break;
++
++ /* Validation */
++ if (mptcp_validate_mapping(sk, skb) < 0)
++ goto restart;
++
++ /* Push a level higher */
++ ret = mptcp_queue_skb(sk);
++ if (ret < 0) {
++ if (ret == -1)
++ queued = ret;
++ goto restart;
++ } else if (ret == 0) {
++ continue;
++ } else { /* ret == 1 */
++ break;
++ }
++ }
++
++exit:
++ if (tcp_sk(sk)->close_it) {
++ tcp_send_ack(sk);
++ tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
++ }
++
++ if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
++ meta_sk->sk_data_ready(meta_sk);
++}
++
++
++int mptcp_check_req(struct sk_buff *skb, struct net *net)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ struct sock *meta_sk = NULL;
++
++ /* MPTCP structures not initialized */
++ if (mptcp_init_failed)
++ return 0;
++
++ if (skb->protocol == htons(ETH_P_IP))
++ meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr, net);
++#if IS_ENABLED(CONFIG_IPV6)
++ else /* IPv6 */
++ meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
++ &ipv6_hdr(skb)->daddr, net);
++#endif /* CONFIG_IPV6 */
++
++ if (!meta_sk)
++ return 0;
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
++ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++ sock_put(meta_sk); /* Taken by mptcp_search_req */
++ kfree_skb(skb);
++ return 1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else { /* IPv6 */
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
++ return 1;
++}
++
++struct mp_join *mptcp_find_join(const struct sk_buff *skb)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ unsigned char *ptr;
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++ /* Jump through the options to check whether JOIN is there */
++ ptr = (unsigned char *)(th + 1);
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return NULL;
++ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2) /* "silly options" */
++ return NULL;
++ if (opsize > length)
++ return NULL; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
++ return (struct mp_join *)(ptr - 2);
++ }
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++ }
++ return NULL;
++}
++
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
++{
++ const struct mptcp_cb *mpcb;
++ struct sock *meta_sk;
++ u32 token;
++ bool meta_v4;
++ struct mp_join *join_opt = mptcp_find_join(skb);
++ if (!join_opt)
++ return 0;
++
++ /* MPTCP structures were not initialized, so return error */
++ if (mptcp_init_failed)
++ return -1;
++
++ token = join_opt->u.syn.token;
++ meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
++ if (!meta_sk) {
++ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++ return -1;
++ }
++
++ meta_v4 = meta_sk->sk_family == AF_INET;
++ if (meta_v4) {
++ if (skb->protocol == htons(ETH_P_IPV6)) {
++ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP) &&
++ inet6_sk(meta_sk)->ipv6only) {
++ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ mpcb = tcp_sk(meta_sk)->mpcb;
++ if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
++ /* We are in fallback-mode on the reception-side -
++ * no new subflows!
++ */
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ /* Coming from time-wait-sock processing in tcp_v4_rcv.
++ * We have to deschedule it before continuing, because otherwise
++ * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
++ */
++ if (tw) {
++ inet_twsk_deschedule(tw, &tcp_death_row);
++ inet_twsk_put(tw);
++ }
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++ /* OK, this is a new syn/join, let's create a new open request and
++ * send syn+ack
++ */
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPBACKLOGDROP);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ kfree_skb(skb);
++ return 1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return 1;
++}
++
++int mptcp_do_join_short(struct sk_buff *skb,
++ const struct mptcp_options_received *mopt,
++ struct net *net)
++{
++ struct sock *meta_sk;
++ u32 token;
++ bool meta_v4;
++
++ token = mopt->mptcp_rem_token;
++ meta_sk = mptcp_hash_find(net, token);
++ if (!meta_sk) {
++ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++ return -1;
++ }
++
++ meta_v4 = meta_sk->sk_family == AF_INET;
++ if (meta_v4) {
++ if (skb->protocol == htons(ETH_P_IPV6)) {
++ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP) &&
++ inet6_sk(meta_sk)->ipv6only) {
++ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++ /* OK, this is a new syn/join, let's create a new open request and
++ * send syn+ack
++ */
++ bh_lock_sock(meta_sk);
++
++ /* This check is also done in mptcp_vX_do_rcv. But, there we cannot
++ * call tcp_vX_send_reset, because we hold already two socket-locks.
++ * (the listener and the meta from above)
++ *
++ * And the send-reset will try to take yet another one (ip_send_reply).
++ * Thus, we propagate the reset up to tcp_rcv_state_process.
++ */
++ if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
++ tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
++ meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
++ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++ else
++ /* Must make sure that upper layers won't free the
++ * skb if it is added to the backlog-queue.
++ */
++ skb_get(skb);
++ } else {
++ /* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
++ * the skb will finally be freed by tcp_v4_do_rcv (where we are
++ * coming from)
++ */
++ skb_get(skb);
++ if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else { /* IPv6 */
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ }
++
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return 0;
++}
++
++/**
++ * Equivalent of tcp_fin() for MPTCP
++ * Can be called only when the FIN is validly part
++ * of the data seqnum space. Not before when we get holes.
++ */
++void mptcp_fin(struct sock *meta_sk)
++{
++ struct sock *sk = NULL, *sk_it;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
++ sk = sk_it;
++ break;
++ }
++ }
++
++ if (!sk || sk->sk_state == TCP_CLOSE)
++ sk = mptcp_select_ack_sock(meta_sk);
++
++ inet_csk_schedule_ack(sk);
++
++ meta_sk->sk_shutdown |= RCV_SHUTDOWN;
++ sock_set_flag(meta_sk, SOCK_DONE);
++
++ switch (meta_sk->sk_state) {
++ case TCP_SYN_RECV:
++ case TCP_ESTABLISHED:
++ /* Move to CLOSE_WAIT */
++ tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
++ inet_csk(sk)->icsk_ack.pingpong = 1;
++ break;
++
++ case TCP_CLOSE_WAIT:
++ case TCP_CLOSING:
++ /* Received a retransmission of the FIN, do
++ * nothing.
++ */
++ break;
++ case TCP_LAST_ACK:
++ /* RFC793: Remain in the LAST-ACK state. */
++ break;
++
++ case TCP_FIN_WAIT1:
++ /* This case occurs when a simultaneous close
++ * happens, we must ack the received FIN and
++ * enter the CLOSING state.
++ */
++ tcp_send_ack(sk);
++ tcp_set_state(meta_sk, TCP_CLOSING);
++ break;
++ case TCP_FIN_WAIT2:
++ /* Received a FIN -- send ACK and enter TIME_WAIT. */
++ tcp_send_ack(sk);
++ meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
++ break;
++ default:
++ /* Only TCP_LISTEN and TCP_CLOSE are left, in these
++ * cases we should never reach this piece of code.
++ */
++ pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
++ meta_sk->sk_state);
++ break;
++ }
++
++ /* It _is_ possible, that we have something out-of-order _after_ FIN.
++ * Probably, we should reset in this case. For now drop them.
++ */
++ mptcp_purge_ofo_queue(meta_tp);
++ sk_mem_reclaim(meta_sk);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ meta_sk->sk_state_change(meta_sk);
++
++ /* Do not send POLL_HUP for half duplex close. */
++ if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
++ meta_sk->sk_state == TCP_CLOSE)
++ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
++ else
++ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
++ }
++
++ return;
++}
++
++static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++
++ if (!meta_tp->packets_out)
++ return;
++
++ tcp_for_write_queue(skb, meta_sk) {
++ if (skb == tcp_send_head(meta_sk))
++ break;
++
++ if (mptcp_retransmit_skb(meta_sk, skb))
++ return;
++
++ if (skb == tcp_write_queue_head(meta_sk))
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++ inet_csk(meta_sk)->icsk_rto,
++ TCP_RTO_MAX);
++ }
++}
++
++/* Handle the DATA_ACK */
++static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 prior_snd_una = meta_tp->snd_una;
++ int prior_packets;
++ u32 nwin, data_ack, data_seq;
++ u16 data_len = 0;
++
++ /* A valid packet came in - subflow is operational again */
++ tp->pf = 0;
++
++ /* Even if there is no data-ack, we stop retransmitting.
++ * Except if this is a SYN/ACK. Then it is just a retransmission
++ */
++ if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
++ tp->mptcp->pre_established = 0;
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++ }
++
++ /* If we are in infinite mapping mode, rx_opt.data_ack has been
++ * set by mptcp_clean_rtx_infinite.
++ */
++ if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
++ goto exit;
++
++ data_ack = tp->mptcp->rx_opt.data_ack;
++
++ if (unlikely(!tp->mptcp->fully_established) &&
++ tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
++ /* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
++ * includes a data-ack, we are fully established
++ */
++ mptcp_become_fully_estab(sk);
++
++ /* Get the data_seq */
++ if (mptcp_is_data_seq(skb)) {
++ data_seq = tp->mptcp->rx_opt.data_seq;
++ data_len = tp->mptcp->rx_opt.data_len;
++ } else {
++ data_seq = meta_tp->snd_wl1;
++ }
++
++ /* If the ack is older than previous acks
++ * then we can probably ignore it.
++ */
++ if (before(data_ack, prior_snd_una))
++ goto exit;
++
++ /* If the ack includes data we haven't sent yet, discard
++ * this segment (RFC793 Section 3.9).
++ */
++ if (after(data_ack, meta_tp->snd_nxt))
++ goto exit;
++
++ /*** Now, update the window - inspired by tcp_ack_update_window ***/
++ nwin = ntohs(tcp_hdr(skb)->window);
++
++ if (likely(!tcp_hdr(skb)->syn))
++ nwin <<= tp->rx_opt.snd_wscale;
++
++ if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
++ tcp_update_wl(meta_tp, data_seq);
++
++ /* Draft v09, Section 3.3.5:
++ * [...] It should only update its local receive window values
++ * when the largest sequence number allowed (i.e. DATA_ACK +
++ * receive window) increases. [...]
++ */
++ if (meta_tp->snd_wnd != nwin &&
++ !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
++ meta_tp->snd_wnd = nwin;
++
++ if (nwin > meta_tp->max_window)
++ meta_tp->max_window = nwin;
++ }
++ }
++ /*** Done, update the window ***/
++
++ /* We passed data and got it acked, remove any soft error
++ * log. Something worked...
++ */
++ sk->sk_err_soft = 0;
++ inet_csk(meta_sk)->icsk_probes_out = 0;
++ meta_tp->rcv_tstamp = tcp_time_stamp;
++ prior_packets = meta_tp->packets_out;
++ if (!prior_packets)
++ goto no_queue;
++
++ meta_tp->snd_una = data_ack;
++
++ mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
++
++ /* We are in loss-state, and something got acked, retransmit the whole
++ * queue now!
++ */
++ if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
++ after(data_ack, prior_snd_una)) {
++ mptcp_xmit_retransmit_queue(meta_sk);
++ inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
++ }
++
++ /* Simplified version of tcp_new_space, because the snd-buffer
++ * is handled by all the subflows.
++ */
++ if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
++ sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
++ if (meta_sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++ meta_sk->sk_write_space(meta_sk);
++ }
++
++ if (meta_sk->sk_state != TCP_ESTABLISHED &&
++ mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
++ return;
++
++exit:
++ mptcp_push_pending_frames(meta_sk);
++
++ return;
++
++no_queue:
++ if (tcp_send_head(meta_sk))
++ tcp_ack_probe(meta_sk);
++
++ mptcp_push_pending_frames(meta_sk);
++
++ return;
++}
++
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
++
++ if (!tp->mpcb->infinite_mapping_snd)
++ return;
++
++ /* The difference between both write_seq's represents the offset between
++ * data-sequence and subflow-sequence. As we are infinite, this must
++ * match.
++ *
++ * Thus, from this difference we can infer the meta snd_una.
++ */
++ tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
++ tp->snd_una;
++
++ mptcp_data_ack(sk, skb);
++}
++
++/**** static functions used by mptcp_parse_options */
++
++static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
++{
++ struct sock *sk_it, *tmpsk;
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
++ mptcp_reinject_data(sk_it, 0);
++ sk_it->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk_it->sk_state))
++ tcp_sk(sk_it)->ops->send_active_reset(sk_it,
++ GFP_ATOMIC);
++ mptcp_sub_force_close(sk_it);
++ }
++ }
++}
++
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++ struct mptcp_options_received *mopt,
++ const struct sk_buff *skb)
++{
++ const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
++
++ /* If the socket is mp-capable we would have a mopt. */
++ if (!mopt)
++ return;
++
++ switch (mp_opt->sub) {
++ case MPTCP_SUB_CAPABLE:
++ {
++ const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
++ opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
++ mptcp_debug("%s: mp_capable: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ if (!sysctl_mptcp_enabled)
++ break;
++
++ /* We only support MPTCP version 0 */
++ if (mpcapable->ver != 0)
++ break;
++
++ /* MPTCP-RFC 6824:
++ * "If receiving a message with the 'B' flag set to 1, and this
++ * is not understood, then this SYN MUST be silently ignored;
++ */
++ if (mpcapable->b) {
++ mopt->drop_me = 1;
++ break;
++ }
++
++ /* MPTCP-RFC 6824:
++ * "An implementation that only supports this method MUST set
++ * bit "H" to 1, and bits "C" through "G" to 0."
++ */
++ if (!mpcapable->h)
++ break;
++
++ mopt->saw_mpc = 1;
++ mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
++
++ if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
++ mopt->mptcp_key = mpcapable->sender_key;
++
++ break;
++ }
++ case MPTCP_SUB_JOIN:
++ {
++ const struct mp_join *mpjoin = (struct mp_join *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
++ opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
++ opsize != MPTCP_SUB_LEN_JOIN_ACK) {
++ mptcp_debug("%s: mp_join: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ /* saw_mpc must be set, because in tcp_check_req we assume that
++ * it is set to support falling back to reg. TCP if a rexmitted
++ * SYN has no MP_CAPABLE or MP_JOIN
++ */
++ switch (opsize) {
++ case MPTCP_SUB_LEN_JOIN_SYN:
++ mopt->is_mp_join = 1;
++ mopt->saw_mpc = 1;
++ mopt->low_prio = mpjoin->b;
++ mopt->rem_id = mpjoin->addr_id;
++ mopt->mptcp_rem_token = mpjoin->u.syn.token;
++ mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
++ break;
++ case MPTCP_SUB_LEN_JOIN_SYNACK:
++ mopt->saw_mpc = 1;
++ mopt->low_prio = mpjoin->b;
++ mopt->rem_id = mpjoin->addr_id;
++ mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
++ mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
++ break;
++ case MPTCP_SUB_LEN_JOIN_ACK:
++ mopt->saw_mpc = 1;
++ mopt->join_ack = 1;
++ memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
++ break;
++ }
++ break;
++ }
++ case MPTCP_SUB_DSS:
++ {
++ const struct mp_dss *mdss = (struct mp_dss *)ptr;
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++
++ /* We check opsize for the csum and non-csum case. We do this,
++ * because the draft says that the csum SHOULD be ignored if
++ * it has not been negotiated in the MP_CAPABLE but still is
++ * present in the data.
++ *
++ * It will get ignored later in mptcp_queue_skb.
++ */
++ if (opsize != mptcp_sub_len_dss(mdss, 0) &&
++ opsize != mptcp_sub_len_dss(mdss, 1)) {
++ mptcp_debug("%s: mp_dss: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ ptr += 4;
++
++ if (mdss->A) {
++ tcb->mptcp_flags |= MPTCPHDR_ACK;
++
++ if (mdss->a) {
++ mopt->data_ack = (u32) get_unaligned_be64(ptr);
++ ptr += MPTCP_SUB_LEN_ACK_64;
++ } else {
++ mopt->data_ack = get_unaligned_be32(ptr);
++ ptr += MPTCP_SUB_LEN_ACK;
++ }
++ }
++
++ tcb->dss_off = (ptr - skb_transport_header(skb));
++
++ if (mdss->M) {
++ if (mdss->m) {
++ u64 data_seq64 = get_unaligned_be64(ptr);
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
++ mopt->data_seq = (u32) data_seq64;
++
++ ptr += 12; /* 64-bit dseq + subseq */
++ } else {
++ mopt->data_seq = get_unaligned_be32(ptr);
++ ptr += 8; /* 32-bit dseq + subseq */
++ }
++ mopt->data_len = get_unaligned_be16(ptr);
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++ /* Is a check-sum present? */
++ if (opsize == mptcp_sub_len_dss(mdss, 1))
++ tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
++
++ /* DATA_FIN only possible with DSS-mapping */
++ if (mdss->F)
++ tcb->mptcp_flags |= MPTCPHDR_FIN;
++ }
++
++ break;
++ }
++ case MPTCP_SUB_ADD_ADDR:
++ {
++#if IS_ENABLED(CONFIG_IPV6)
++ const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
++#else
++ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
++#endif /* CONFIG_IPV6 */
++ mptcp_debug("%s: mp_add_addr: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ /* We have to manually parse the options if we got two of them. */
++ if (mopt->saw_add_addr) {
++ mopt->more_add_addr = 1;
++ break;
++ }
++ mopt->saw_add_addr = 1;
++ mopt->add_addr_ptr = ptr;
++ break;
++ }
++ case MPTCP_SUB_REMOVE_ADDR:
++ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
++ mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ if (mopt->saw_rem_addr) {
++ mopt->more_rem_addr = 1;
++ break;
++ }
++ mopt->saw_rem_addr = 1;
++ mopt->rem_addr_ptr = ptr;
++ break;
++ case MPTCP_SUB_PRIO:
++ {
++ const struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_PRIO &&
++ opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
++ mptcp_debug("%s: mp_prio: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ mopt->saw_low_prio = 1;
++ mopt->low_prio = mpprio->b;
++
++ if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
++ mopt->saw_low_prio = 2;
++ mopt->prio_addr_id = mpprio->addr_id;
++ }
++ break;
++ }
++ case MPTCP_SUB_FAIL:
++ if (opsize != MPTCP_SUB_LEN_FAIL) {
++ mptcp_debug("%s: mp_fail: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++ mopt->mp_fail = 1;
++ break;
++ case MPTCP_SUB_FCLOSE:
++ if (opsize != MPTCP_SUB_LEN_FCLOSE) {
++ mptcp_debug("%s: mp_fclose: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ mopt->mp_fclose = 1;
++ mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
++
++ break;
++ default:
++ mptcp_debug("%s: Received unkown subtype: %d\n",
++ __func__, mp_opt->sub);
++ break;
++ }
++}
++
++/** Parse only MPTCP options */
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++ struct mptcp_options_received *mopt)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++ const unsigned char *ptr = (const unsigned char *)(th + 1);
++
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return;
++ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2) /* "silly options" */
++ return;
++ if (opsize > length)
++ return; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP)
++ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++ }
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++}
++
++int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sock *sk;
++ u32 rtt_max = 0;
++
++ /* In MPTCP, we take the max delay across all flows,
++ * in order to take into account meta-reordering buffers.
++ */
++ mptcp_for_each_sk(mpcb, sk) {
++ if (!mptcp_sk_can_recv(sk))
++ continue;
++
++ if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
++ rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
++ }
++ if (time < (rtt_max >> 3) || !rtt_max)
++ return 1;
++
++ return 0;
++}
++
++static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
++{
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ __be16 port = 0;
++ union inet_addr addr;
++ sa_family_t family;
++
++ if (mpadd->ipver == 4) {
++ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++ port = mpadd->u.v4.port;
++ family = AF_INET;
++ addr.in = mpadd->u.v4.addr;
++#if IS_ENABLED(CONFIG_IPV6)
++ } else if (mpadd->ipver == 6) {
++ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
++ port = mpadd->u.v6.port;
++ family = AF_INET6;
++ addr.in6 = mpadd->u.v6.addr;
++#endif /* CONFIG_IPV6 */
++ } else {
++ return;
++ }
++
++ if (mpcb->pm_ops->add_raddr)
++ mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
++}
++
++static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
++{
++ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++ int i;
++ u8 rem_id;
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
++ rem_id = (&mprem->addrs_id)[i];
++
++ if (mpcb->pm_ops->rem_raddr)
++ mpcb->pm_ops->rem_raddr(mpcb, rem_id);
++ mptcp_send_reset_rem_id(mpcb, rem_id);
++ }
++}
++
++static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
++{
++ struct tcphdr *th = tcp_hdr(skb);
++ unsigned char *ptr;
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++ /* Jump through the options to check whether ADD_ADDR is there */
++ ptr = (unsigned char *)(th + 1);
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return;
++ case TCPOPT_NOP:
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2)
++ return;
++ if (opsize > length)
++ return; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
++#if IS_ENABLED(CONFIG_IPV6)
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
++#else
++ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++#endif /* CONFIG_IPV6 */
++ goto cont;
++
++ mptcp_handle_add_addr(ptr, sk);
++ }
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
++ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
++ goto cont;
++
++ mptcp_handle_rem_addr(ptr, sk);
++ }
++cont:
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++ }
++ return;
++}
++
++static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
++{
++ struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ if (unlikely(mptcp->rx_opt.mp_fail)) {
++ mptcp->rx_opt.mp_fail = 0;
++
++ if (!th->rst && !mpcb->infinite_mapping_snd) {
++ struct sock *sk_it;
++
++ mpcb->send_infinite_mapping = 1;
++ /* We resend everything that has not been acknowledged */
++ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++
++ /* We artificially restart the whole send-queue. Thus,
++ * it is as if no packets are in flight
++ */
++ tcp_sk(meta_sk)->packets_out = 0;
++
++ /* If the snd_nxt already wrapped around, we have to
++ * undo the wrapping, as we are restarting from snd_una
++ * on.
++ */
++ if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
++ mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
++ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++ }
++ tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
++
++ /* Trigger a sending on the meta. */
++ mptcp_push_pending_frames(meta_sk);
++
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (sk != sk_it)
++ mptcp_sub_force_close(sk_it);
++ }
++ }
++
++ return 0;
++ }
++
++ if (unlikely(mptcp->rx_opt.mp_fclose)) {
++ struct sock *sk_it, *tmpsk;
++
++ mptcp->rx_opt.mp_fclose = 0;
++ if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
++ return 0;
++
++ if (tcp_need_reset(sk->sk_state))
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
++ mptcp_sub_force_close(sk_it);
++
++ tcp_reset(meta_sk);
++
++ return 1;
++ }
++
++ return 0;
++}
++
++static inline void mptcp_path_array_check(struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++
++ if (unlikely(mpcb->list_rcvd)) {
++ mpcb->list_rcvd = 0;
++ if (mpcb->pm_ops->new_remote_address)
++ mpcb->pm_ops->new_remote_address(meta_sk);
++ }
++}
++
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++ const struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++ if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
++ return 0;
++
++ if (mptcp_mp_fail_rcvd(sk, th))
++ return 1;
++
++ /* RFC 6824, Section 3.3:
++ * If a checksum is not present when its use has been negotiated, the
++ * receiver MUST close the subflow with a RST as it is considered broken.
++ */
++ if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
++ !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
++ if (tcp_need_reset(sk->sk_state))
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
++
++ mptcp_sub_force_close(sk);
++ return 1;
++ }
++
++ /* We have to acknowledge retransmissions of the third
++ * ack.
++ */
++ if (mopt->join_ack) {
++ tcp_send_delayed_ack(sk);
++ mopt->join_ack = 0;
++ }
++
++ if (mopt->saw_add_addr || mopt->saw_rem_addr) {
++ if (mopt->more_add_addr || mopt->more_rem_addr) {
++ mptcp_parse_addropt(skb, sk);
++ } else {
++ if (mopt->saw_add_addr)
++ mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
++ if (mopt->saw_rem_addr)
++ mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
++ }
++
++ mopt->more_add_addr = 0;
++ mopt->saw_add_addr = 0;
++ mopt->more_rem_addr = 0;
++ mopt->saw_rem_addr = 0;
++ }
++ if (mopt->saw_low_prio) {
++ if (mopt->saw_low_prio == 1) {
++ tp->mptcp->rcv_low_prio = mopt->low_prio;
++ } else {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tp->mpcb, sk_it) {
++ struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
++ if (mptcp->rem_id == mopt->prio_addr_id)
++ mptcp->rcv_low_prio = mopt->low_prio;
++ }
++ }
++ mopt->saw_low_prio = 0;
++ }
++
++ mptcp_data_ack(sk, skb);
++
++ mptcp_path_array_check(mptcp_meta_sk(sk));
++ /* Socket may have been mp_killed by a REMOVE_ADDR */
++ if (tp->mp_killed)
++ return 1;
++
++ return 0;
++}
++
++/* In case of fastopen, some data can already be in the write queue.
++ * We need to update the sequence number of the segments as they
++ * were initially TCP sequence numbers.
++ */
++static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
++ struct sk_buff *skb;
++ u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
++
++ /* There should only be one skb in write queue: the data not
++ * acknowledged in the SYN+ACK. In this case, we need to map
++ * this data to data sequence numbers.
++ */
++ skb_queue_walk(&meta_sk->sk_write_queue, skb) {
++ /* If the server only acknowledges partially the data sent in
++ * the SYN, we need to trim the acknowledged part because
++ * we don't want to retransmit this already received data.
++ * When we reach this point, tcp_ack() has already cleaned up
++ * fully acked segments. However, tcp trims partially acked
++ * segments only when retransmitting. Since MPTCP comes into
++ * play only now, we will fake an initial transmit, and
++ * retransmit_skb() will not be called. The following fragment
++ * comes from __tcp_retransmit_skb().
++ */
++ if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
++ BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
++ master_tp->snd_una));
++ /* tcp_trim_head can only returns ENOMEM if skb is
++ * cloned. It is not the case here (see
++ * tcp_send_syn_data).
++ */
++ BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
++ TCP_SKB_CB(skb)->seq));
++ }
++
++ TCP_SKB_CB(skb)->seq += new_mapping;
++ TCP_SKB_CB(skb)->end_seq += new_mapping;
++ }
++
++ /* We can advance write_seq by the number of bytes unacknowledged
++ * and that were mapped in the previous loop.
++ */
++ meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
++
++ /* The packets from the master_sk will be entailed to it later
++ * Until that time, its write queue is empty, and
++ * write_seq must align with snd_una
++ */
++ master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
++ master_tp->packets_out = 0;
++
++ /* Although these data have been sent already over the subsk,
++ * They have never been sent over the meta_sk, so we rewind
++ * the send_head so that tcp considers it as an initial send
++ * (instead of retransmit).
++ */
++ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++}
++
++/* The skptr is needed, because if we become MPTCP-capable, we have to switch
++ * from meta-socket to master-socket.
++ *
++ * @return: 1 - we want to reset this connection
++ * 2 - we want to discard the received syn/ack
++ * 0 - everything is fine - continue
++ */
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++ const struct sk_buff *skb,
++ const struct mptcp_options_received *mopt)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (mptcp(tp)) {
++ u8 hash_mac_check[20];
++ struct mptcp_cb *mpcb = tp->mpcb;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++ (u8 *)&tp->mptcp->mptcp_loc_nonce,
++ (u32 *)hash_mac_check);
++ if (memcmp(hash_mac_check,
++ (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
++ mptcp_sub_force_close(sk);
++ return 1;
++ }
++
++ /* Set this flag in order to postpone data sending
++ * until the 4th ack arrives.
++ */
++ tp->mptcp->pre_established = 1;
++ tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&tp->mptcp->mptcp_loc_nonce,
++ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++ (u32 *)&tp->mptcp->sender_mac[0]);
++
++ } else if (mopt->saw_mpc) {
++ struct sock *meta_sk = sk;
++
++ if (mptcp_create_master_sk(sk, mopt->mptcp_key,
++ ntohs(tcp_hdr(skb)->window)))
++ return 2;
++
++ sk = tcp_sk(sk)->mpcb->master_sk;
++ *skptr = sk;
++ tp = tcp_sk(sk);
++
++ /* If fastopen was used data might be in the send queue. We
++ * need to update their sequence number to MPTCP-level seqno.
++ * Note that it can happen in rare cases that fastopen_req is
++ * NULL and syn_data is 0 but fastopen indeed occurred and
++ * data has been queued in the write queue (but not sent).
++ * Example of such rare cases: connect is non-blocking and
++ * TFO is configured to work without cookies.
++ */
++ if (!skb_queue_empty(&meta_sk->sk_write_queue))
++ mptcp_rcv_synsent_fastopen(meta_sk);
++
++ /* -1, because the SYN consumed 1 byte. In case of TFO, we
++ * start the subflow-sequence number as if the data of the SYN
++ * is not part of any mapping.
++ */
++ tp->mptcp->snt_isn = tp->snd_una - 1;
++ tp->mpcb->dss_csum = mopt->dss_csum;
++ tp->mptcp->include_mpc = 1;
++
++ /* Ensure that fastopen is handled at the meta-level. */
++ tp->fastopen_req = NULL;
++
++ sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
++ sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
++
++ /* hold in sk_clone_lock due to initialization to 2 */
++ sock_put(sk);
++ } else {
++ tp->request_mptcp = 0;
++
++ if (tp->inside_tk_table)
++ mptcp_hash_remove(tp);
++ }
++
++ if (mptcp(tp))
++ tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
++
++ return 0;
++}
++
++bool mptcp_should_expand_sndbuf(const struct sock *sk)
++{
++ const struct sock *sk_it;
++ const struct sock *meta_sk = mptcp_meta_sk(sk);
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int cnt_backups = 0;
++ int backup_available = 0;
++
++ /* We circumvent this check in tcp_check_space, because we want to
++ * always call sk_write_space. So, we reproduce the check here.
++ */
++ if (!meta_sk->sk_socket ||
++ !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++ return false;
++
++ /* If the user specified a specific send buffer setting, do
++ * not modify it.
++ */
++ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++ return false;
++
++ /* If we are under global TCP memory pressure, do not expand. */
++ if (sk_under_memory_pressure(meta_sk))
++ return false;
++
++ /* If we are under soft global TCP memory pressure, do not expand. */
++ if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
++ return false;
++
++
++ /* For MPTCP we look for a subsocket that could send data.
++ * If we found one, then we update the send-buffer.
++ */
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++
++ if (!mptcp_sk_can_send(sk_it))
++ continue;
++
++ /* Backup-flows have to be counted - if there is no other
++ * subflow we take the backup-flow into account.
++ */
++ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
++ cnt_backups++;
++
++ if (tp_it->packets_out < tp_it->snd_cwnd) {
++ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
++ backup_available = 1;
++ continue;
++ }
++ return true;
++ }
++ }
++
++ /* Backup-flow is available for sending - update send-buffer */
++ if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
++ return true;
++ return false;
++}
++
++void mptcp_init_buffer_space(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int space;
++
++ tcp_init_buffer_space(sk);
++
++ if (is_master_tp(tp)) {
++ meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
++ meta_tp->rcvq_space.time = tcp_time_stamp;
++ meta_tp->rcvq_space.seq = meta_tp->copied_seq;
++
++ /* If there is only one subflow, we just use regular TCP
++ * autotuning. User-locks are handled already by
++ * tcp_init_buffer_space
++ */
++ meta_tp->window_clamp = tp->window_clamp;
++ meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
++ meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
++ meta_sk->sk_sndbuf = sk->sk_sndbuf;
++
++ return;
++ }
++
++ if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
++ goto snd_buf;
++
++ /* Adding a new subflow to the rcv-buffer space. We make a simple
++ * addition, to give some space to allow traffic on the new subflow.
++ * Autotuning will increase it further later on.
++ */
++ space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
++ if (space > meta_sk->sk_rcvbuf) {
++ meta_tp->window_clamp += tp->window_clamp;
++ meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
++ meta_sk->sk_rcvbuf = space;
++ }
++
++snd_buf:
++ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++ return;
++
++ /* Adding a new subflow to the send-buffer space. We make a simple
++ * addition, to give some space to allow traffic on the new subflow.
++ * Autotuning will increase it further later on.
++ */
++ space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
++ if (space > meta_sk->sk_sndbuf) {
++ meta_sk->sk_sndbuf = space;
++ meta_sk->sk_write_space(meta_sk);
++ }
++}
++
++void mptcp_tcp_set_rto(struct sock *sk)
++{
++ tcp_set_rto(sk);
++ mptcp_set_rto(sk);
++}
+diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
+new file mode 100644
+index 000000000000..1183d1305d35
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv4.c
+@@ -0,0 +1,483 @@
++/*
++ * MPTCP implementation - IPv4-specific functions
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/ip.h>
++#include <linux/list.h>
++#include <linux/skbuff.h>
++#include <linux/spinlock.h>
++#include <linux/tcp.h>
++
++#include <net/inet_common.h>
++#include <net/inet_connection_sock.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/request_sock.h>
++#include <net/tcp.h>
++
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++ u32 hash[MD5_DIGEST_WORDS];
++
++ hash[0] = (__force u32)saddr;
++ hash[1] = (__force u32)daddr;
++ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++ hash[3] = mptcp_seed++;
++
++ md5_transform(hash, mptcp_secret);
++
++ return hash[0];
++}
++
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++ u32 hash[MD5_DIGEST_WORDS];
++
++ hash[0] = (__force u32)saddr;
++ hash[1] = (__force u32)daddr;
++ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++ hash[3] = mptcp_seed++;
++
++ md5_transform(hash, mptcp_secret);
++
++ return *((u64 *)hash);
++}
++
++
++static void mptcp_v4_reqsk_destructor(struct request_sock *req)
++{
++ mptcp_reqsk_destructor(req);
++
++ tcp_v4_reqsk_destructor(req);
++}
++
++static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++ mptcp_reqsk_init(req, skb);
++
++ return 0;
++}
++
++static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ union inet_addr addr;
++ int loc_id;
++ bool low_prio = false;
++
++ /* We need to do this as early as possible. Because, if we fail later
++ * (e.g., get_local_id), then reqsk_free tries to remove the
++ * request-socket from the htb in mptcp_hash_request_remove as pprev
++ * may be different from NULL.
++ */
++ mtreq->hash_entry.pprev = NULL;
++
++ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++
++ mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr,
++ tcp_hdr(skb)->source,
++ tcp_hdr(skb)->dest);
++ addr.ip = inet_rsk(req)->ir_loc_addr;
++ loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
++ if (loc_id == -1)
++ return -1;
++ mtreq->loc_id = loc_id;
++ mtreq->low_prio = low_prio;
++
++ mptcp_join_reqsk_init(mpcb, req, skb);
++
++ return 0;
++}
++
++/* Similar to tcp_request_sock_ops */
++struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
++ .family = PF_INET,
++ .obj_size = sizeof(struct mptcp_request_sock),
++ .rtx_syn_ack = tcp_rtx_synack,
++ .send_ack = tcp_v4_reqsk_send_ack,
++ .destructor = mptcp_v4_reqsk_destructor,
++ .send_reset = tcp_v4_send_reset,
++ .syn_ack_timeout = tcp_syn_ack_timeout,
++};
++
++static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
++ struct request_sock *req,
++ const unsigned long timeout)
++{
++ const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ 0, MPTCP_HASH_SIZE);
++ /* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
++ * want to reset the keepalive-timer (responsible for retransmitting
++ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++ * overload the keepalive timer. Also, it's not a big deal, because the
++ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++ * if the third ACK gets lost, the client will handle the retransmission
++ * anyways. If our SYN/ACK gets lost, the client will retransmit the
++ * SYN.
++ */
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++ const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ lopt->hash_rnd, lopt->nr_table_entries);
++
++ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++ mptcp_reset_synack_timer(meta_sk, timeout);
++
++ rcu_read_lock();
++ spin_lock(&mptcp_reqsk_hlock);
++ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++ spin_unlock(&mptcp_reqsk_hlock);
++ rcu_read_unlock();
++}
++
++/* Similar to tcp_v4_conn_request */
++static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return tcp_conn_request(&mptcp_request_sock_ops,
++ &mptcp_join_request_sock_ipv4_ops,
++ meta_sk, skb);
++}
++
++/* We only process join requests here. (either the SYN or the final ACK) */
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *child, *rsk = NULL;
++ int ret;
++
++ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++ struct tcphdr *th = tcp_hdr(skb);
++ const struct iphdr *iph = ip_hdr(skb);
++ struct sock *sk;
++
++ sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
++ iph->saddr, th->source, iph->daddr,
++ th->dest, inet_iif(skb));
++
++ if (!sk) {
++ kfree_skb(skb);
++ return 0;
++ }
++ if (is_meta_sk(sk)) {
++ WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
++ kfree_skb(skb);
++ sock_put(sk);
++ return 0;
++ }
++
++ if (sk->sk_state == TCP_TIME_WAIT) {
++ inet_twsk_put(inet_twsk(sk));
++ kfree_skb(skb);
++ return 0;
++ }
++
++ ret = tcp_v4_do_rcv(sk, skb);
++ sock_put(sk);
++
++ return ret;
++ }
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++ /* Has been removed from the tk-table. Thus, no new subflows.
++ *
++ * Check for close-state is necessary, because we may have been closed
++ * without passing by mptcp_close().
++ *
++ * When falling back, no new subflows are allowed either.
++ */
++ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++ goto reset_and_discard;
++
++ child = tcp_v4_hnd_req(meta_sk, skb);
++
++ if (!child)
++ goto discard;
++
++ if (child != meta_sk) {
++ sock_rps_save_rxhash(child, skb);
++ /* We don't call tcp_child_process here, because we hold
++ * already the meta-sk-lock and are sure that it is not owned
++ * by the user.
++ */
++ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++ bh_unlock_sock(child);
++ sock_put(child);
++ if (ret) {
++ rsk = child;
++ goto reset_and_discard;
++ }
++ } else {
++ if (tcp_hdr(skb)->syn) {
++ mptcp_v4_join_request(meta_sk, skb);
++ goto discard;
++ }
++ goto reset_and_discard;
++ }
++ return 0;
++
++reset_and_discard:
++ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++ const struct tcphdr *th = tcp_hdr(skb);
++ const struct iphdr *iph = ip_hdr(skb);
++ struct request_sock **prev, *req;
++ /* If we end up here, it means we should not have matched on the
++ * request-socket. But, because the request-sock queue is only
++ * destroyed in mptcp_close, the socket may actually already be
++ * in close-state (e.g., through shutdown()) while still having
++ * pending request sockets.
++ */
++ req = inet_csk_search_req(meta_sk, &prev, th->source,
++ iph->saddr, iph->daddr);
++ if (req) {
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++ req);
++ reqsk_free(req);
++ }
++ }
++
++ tcp_v4_send_reset(rsk, skb);
++discard:
++ kfree_skb(skb);
++ return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++ const __be32 laddr, const struct net *net)
++{
++ const struct mptcp_request_sock *mtreq;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++ const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++ hash_entry) {
++ struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
++ meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++ if (ireq->ir_rmt_port == rport &&
++ ireq->ir_rmt_addr == raddr &&
++ ireq->ir_loc_addr == laddr &&
++ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
++ net_eq(net, sock_net(meta_sk)))
++ goto found;
++ meta_sk = NULL;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++ goto begin;
++
++found:
++ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ meta_sk = NULL;
++ rcu_read_unlock();
++
++ return meta_sk;
++}
++
++/* Create a new IPv4 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++ struct mptcp_rem4 *rem)
++{
++ struct tcp_sock *tp;
++ struct sock *sk;
++ struct sockaddr_in loc_in, rem_in;
++ struct socket sock;
++ int ret;
++
++ /** First, create and prepare the new socket */
++
++ sock.type = meta_sk->sk_socket->type;
++ sock.state = SS_UNCONNECTED;
++ sock.wq = meta_sk->sk_socket->wq;
++ sock.file = meta_sk->sk_socket->file;
++ sock.ops = NULL;
++
++ ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++ if (unlikely(ret < 0)) {
++ mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
++ return ret;
++ }
++
++ sk = sock.sk;
++ tp = tcp_sk(sk);
++
++ /* All subsockets need the MPTCP-lock-class */
++ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++ if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
++ goto error;
++
++ tp->mptcp->slave_sk = 1;
++ tp->mptcp->low_prio = loc->low_prio;
++
++ /* Initializing the timer for an MPTCP subflow */
++ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++ /** Then, connect the socket to the peer */
++ loc_in.sin_family = AF_INET;
++ rem_in.sin_family = AF_INET;
++ loc_in.sin_port = 0;
++ if (rem->port)
++ rem_in.sin_port = rem->port;
++ else
++ rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
++ loc_in.sin_addr = loc->addr;
++ rem_in.sin_addr = rem->addr;
++
++ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
++ if (ret < 0) {
++ mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
++ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &loc_in.sin_addr,
++ ntohs(loc_in.sin_port), &rem_in.sin_addr,
++ ntohs(rem_in.sin_port));
++
++ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
++ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
++
++ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++ sizeof(struct sockaddr_in), O_NONBLOCK);
++ if (ret < 0 && ret != -EINPROGRESS) {
++ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ sk_set_socket(sk, meta_sk->sk_socket);
++ sk->sk_wq = meta_sk->sk_wq;
++
++ return 0;
++
++error:
++ /* May happen if mptcp_add_sock fails first */
++ if (!mptcp(tp)) {
++ tcp_close(sk, 0);
++ } else {
++ local_bh_disable();
++ mptcp_sub_force_close(sk);
++ local_bh_enable();
++ }
++ return ret;
++}
++EXPORT_SYMBOL(mptcp_init4_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v4_specific = {
++ .queue_xmit = ip_queue_xmit,
++ .send_check = tcp_v4_send_check,
++ .rebuild_header = inet_sk_rebuild_header,
++ .sk_rx_dst_set = inet_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v4_syn_recv_sock,
++ .net_header_len = sizeof(struct iphdr),
++ .setsockopt = ip_setsockopt,
++ .getsockopt = ip_getsockopt,
++ .addr2sockaddr = inet_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in),
++ .bind_conflict = inet_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ip_setsockopt,
++ .compat_getsockopt = compat_ip_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++/* General initialization of IPv4 for MPTCP */
++int mptcp_pm_v4_init(void)
++{
++ int ret = 0;
++ struct request_sock_ops *ops = &mptcp_request_sock_ops;
++
++ mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++ mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
++
++ mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++ mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
++ mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
++
++ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
++ if (ops->slab_name == NULL) {
++ ret = -ENOMEM;
++ goto out;
++ }
++
++ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++
++ if (ops->slab == NULL) {
++ ret = -ENOMEM;
++ goto err_reqsk_create;
++ }
++
++out:
++ return ret;
++
++err_reqsk_create:
++ kfree(ops->slab_name);
++ ops->slab_name = NULL;
++ goto out;
++}
++
++void mptcp_pm_v4_undo(void)
++{
++ kmem_cache_destroy(mptcp_request_sock_ops.slab);
++ kfree(mptcp_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
+new file mode 100644
+index 000000000000..1036973aa855
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv6.c
+@@ -0,0 +1,518 @@
++/*
++ * MPTCP implementation - IPv6-specific functions
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/in6.h>
++#include <linux/kernel.h>
++
++#include <net/addrconf.h>
++#include <net/flow.h>
++#include <net/inet6_connection_sock.h>
++#include <net/inet6_hashtables.h>
++#include <net/inet_common.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/ip6_route.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
++#include <net/tcp.h>
++#include <net/transp_v6.h>
++
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport)
++{
++ u32 secret[MD5_MESSAGE_BYTES / 4];
++ u32 hash[MD5_DIGEST_WORDS];
++ u32 i;
++
++ memcpy(hash, saddr, 16);
++ for (i = 0; i < 4; i++)
++ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++ secret[4] = mptcp_secret[4] +
++ (((__force u16)sport << 16) + (__force u16)dport);
++ secret[5] = mptcp_seed++;
++ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++ secret[i] = mptcp_secret[i];
++
++ md5_transform(hash, secret);
++
++ return hash[0];
++}
++
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport)
++{
++ u32 secret[MD5_MESSAGE_BYTES / 4];
++ u32 hash[MD5_DIGEST_WORDS];
++ u32 i;
++
++ memcpy(hash, saddr, 16);
++ for (i = 0; i < 4; i++)
++ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++ secret[4] = mptcp_secret[4] +
++ (((__force u16)sport << 16) + (__force u16)dport);
++ secret[5] = mptcp_seed++;
++ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++ secret[i] = mptcp_secret[i];
++
++ md5_transform(hash, secret);
++
++ return *((u64 *)hash);
++}
++
++static void mptcp_v6_reqsk_destructor(struct request_sock *req)
++{
++ mptcp_reqsk_destructor(req);
++
++ tcp_v6_reqsk_destructor(req);
++}
++
++static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++ mptcp_reqsk_init(req, skb);
++
++ return 0;
++}
++
++static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ union inet_addr addr;
++ int loc_id;
++ bool low_prio = false;
++
++ /* We need to do this as early as possible. Because, if we fail later
++ * (e.g., get_local_id), then reqsk_free tries to remove the
++ * request-socket from the htb in mptcp_hash_request_remove as pprev
++ * may be different from NULL.
++ */
++ mtreq->hash_entry.pprev = NULL;
++
++ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++
++ mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
++ ipv6_hdr(skb)->daddr.s6_addr32,
++ tcp_hdr(skb)->source,
++ tcp_hdr(skb)->dest);
++ addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
++ loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
++ if (loc_id == -1)
++ return -1;
++ mtreq->loc_id = loc_id;
++ mtreq->low_prio = low_prio;
++
++ mptcp_join_reqsk_init(mpcb, req, skb);
++
++ return 0;
++}
++
++/* Similar to tcp6_request_sock_ops */
++struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
++ .family = AF_INET6,
++ .obj_size = sizeof(struct mptcp_request_sock),
++ .rtx_syn_ack = tcp_v6_rtx_synack,
++ .send_ack = tcp_v6_reqsk_send_ack,
++ .destructor = mptcp_v6_reqsk_destructor,
++ .send_reset = tcp_v6_send_reset,
++ .syn_ack_timeout = tcp_syn_ack_timeout,
++};
++
++static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
++ struct request_sock *req,
++ const unsigned long timeout)
++{
++ const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ 0, MPTCP_HASH_SIZE);
++ /* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
++ * want to reset the keepalive-timer (responsible for retransmitting
++ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++ * overload the keepalive timer. Also, it's not a big deal, because the
++ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++ * if the third ACK gets lost, the client will handle the retransmission
++ * anyways. If our SYN/ACK gets lost, the client will retransmit the
++ * SYN.
++ */
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++ const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ lopt->hash_rnd, lopt->nr_table_entries);
++
++ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++ mptcp_reset_synack_timer(meta_sk, timeout);
++
++ rcu_read_lock();
++ spin_lock(&mptcp_reqsk_hlock);
++ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++ spin_unlock(&mptcp_reqsk_hlock);
++ rcu_read_unlock();
++}
++
++static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return tcp_conn_request(&mptcp6_request_sock_ops,
++ &mptcp_join_request_sock_ipv6_ops,
++ meta_sk, skb);
++}
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *child, *rsk = NULL;
++ int ret;
++
++ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++ struct tcphdr *th = tcp_hdr(skb);
++ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
++ struct sock *sk;
++
++ sk = __inet6_lookup_established(sock_net(meta_sk),
++ &tcp_hashinfo,
++ &ip6h->saddr, th->source,
++ &ip6h->daddr, ntohs(th->dest),
++ inet6_iif(skb));
++
++ if (!sk) {
++ kfree_skb(skb);
++ return 0;
++ }
++ if (is_meta_sk(sk)) {
++ WARN("%s Did not find a sub-sk!\n", __func__);
++ kfree_skb(skb);
++ sock_put(sk);
++ return 0;
++ }
++
++ if (sk->sk_state == TCP_TIME_WAIT) {
++ inet_twsk_put(inet_twsk(sk));
++ kfree_skb(skb);
++ return 0;
++ }
++
++ ret = tcp_v6_do_rcv(sk, skb);
++ sock_put(sk);
++
++ return ret;
++ }
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++ /* Has been removed from the tk-table. Thus, no new subflows.
++ *
++ * Check for close-state is necessary, because we may have been closed
++ * without passing by mptcp_close().
++ *
++ * When falling back, no new subflows are allowed either.
++ */
++ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++ goto reset_and_discard;
++
++ child = tcp_v6_hnd_req(meta_sk, skb);
++
++ if (!child)
++ goto discard;
++
++ if (child != meta_sk) {
++ sock_rps_save_rxhash(child, skb);
++ /* We don't call tcp_child_process here, because we hold
++ * already the meta-sk-lock and are sure that it is not owned
++ * by the user.
++ */
++ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++ bh_unlock_sock(child);
++ sock_put(child);
++ if (ret) {
++ rsk = child;
++ goto reset_and_discard;
++ }
++ } else {
++ if (tcp_hdr(skb)->syn) {
++ mptcp_v6_join_request(meta_sk, skb);
++ goto discard;
++ }
++ goto reset_and_discard;
++ }
++ return 0;
++
++reset_and_discard:
++ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++ const struct tcphdr *th = tcp_hdr(skb);
++ struct request_sock **prev, *req;
++ /* If we end up here, it means we should not have matched on the
++ * request-socket. But, because the request-sock queue is only
++ * destroyed in mptcp_close, the socket may actually already be
++ * in close-state (e.g., through shutdown()) while still having
++ * pending request sockets.
++ */
++ req = inet6_csk_search_req(meta_sk, &prev, th->source,
++ &ipv6_hdr(skb)->saddr,
++ &ipv6_hdr(skb)->daddr, inet6_iif(skb));
++ if (req) {
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++ req);
++ reqsk_free(req);
++ }
++ }
++
++ tcp_v6_send_reset(rsk, skb);
++discard:
++ kfree_skb(skb);
++ return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++ const struct in6_addr *laddr, const struct net *net)
++{
++ const struct mptcp_request_sock *mtreq;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++ const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++ hash_entry) {
++ struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
++ meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++ if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
++ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
++ ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
++ ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
++ net_eq(net, sock_net(meta_sk)))
++ goto found;
++ meta_sk = NULL;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++ goto begin;
++
++found:
++ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ meta_sk = NULL;
++ rcu_read_unlock();
++
++ return meta_sk;
++}
++
++/* Create a new IPv6 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++ struct mptcp_rem6 *rem)
++{
++ struct tcp_sock *tp;
++ struct sock *sk;
++ struct sockaddr_in6 loc_in, rem_in;
++ struct socket sock;
++ int ret;
++
++ /** First, create and prepare the new socket */
++
++ sock.type = meta_sk->sk_socket->type;
++ sock.state = SS_UNCONNECTED;
++ sock.wq = meta_sk->sk_socket->wq;
++ sock.file = meta_sk->sk_socket->file;
++ sock.ops = NULL;
++
++ ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++ if (unlikely(ret < 0)) {
++ mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
++ return ret;
++ }
++
++ sk = sock.sk;
++ tp = tcp_sk(sk);
++
++ /* All subsockets need the MPTCP-lock-class */
++ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++ if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
++ goto error;
++
++ tp->mptcp->slave_sk = 1;
++ tp->mptcp->low_prio = loc->low_prio;
++
++ /* Initializing the timer for an MPTCP subflow */
++ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++ /** Then, connect the socket to the peer */
++ loc_in.sin6_family = AF_INET6;
++ rem_in.sin6_family = AF_INET6;
++ loc_in.sin6_port = 0;
++ if (rem->port)
++ rem_in.sin6_port = rem->port;
++ else
++ rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
++ loc_in.sin6_addr = loc->addr;
++ rem_in.sin6_addr = rem->addr;
++
++ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
++ if (ret < 0) {
++ mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
++ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &loc_in.sin6_addr,
++ ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
++ ntohs(rem_in.sin6_port));
++
++ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
++ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
++
++ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++ sizeof(struct sockaddr_in6), O_NONBLOCK);
++ if (ret < 0 && ret != -EINPROGRESS) {
++ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ sk_set_socket(sk, meta_sk->sk_socket);
++ sk->sk_wq = meta_sk->sk_wq;
++
++ return 0;
++
++error:
++ /* May happen if mptcp_add_sock fails first */
++ if (!mptcp(tp)) {
++ tcp_close(sk, 0);
++ } else {
++ local_bh_disable();
++ mptcp_sub_force_close(sk);
++ local_bh_enable();
++ }
++ return ret;
++}
++EXPORT_SYMBOL(mptcp_init6_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v6_specific = {
++ .queue_xmit = inet6_csk_xmit,
++ .send_check = tcp_v6_send_check,
++ .rebuild_header = inet6_sk_rebuild_header,
++ .sk_rx_dst_set = inet6_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v6_syn_recv_sock,
++ .net_header_len = sizeof(struct ipv6hdr),
++ .net_frag_header_len = sizeof(struct frag_hdr),
++ .setsockopt = ipv6_setsockopt,
++ .getsockopt = ipv6_getsockopt,
++ .addr2sockaddr = inet6_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in6),
++ .bind_conflict = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ipv6_setsockopt,
++ .compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
++ .queue_xmit = ip_queue_xmit,
++ .send_check = tcp_v4_send_check,
++ .rebuild_header = inet_sk_rebuild_header,
++ .sk_rx_dst_set = inet_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v6_syn_recv_sock,
++ .net_header_len = sizeof(struct iphdr),
++ .setsockopt = ipv6_setsockopt,
++ .getsockopt = ipv6_getsockopt,
++ .addr2sockaddr = inet6_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in6),
++ .bind_conflict = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ipv6_setsockopt,
++ .compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_pm_v6_init(void)
++{
++ int ret = 0;
++ struct request_sock_ops *ops = &mptcp6_request_sock_ops;
++
++ mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++ mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
++
++ mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++ mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
++ mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
++
++ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
++ if (ops->slab_name == NULL) {
++ ret = -ENOMEM;
++ goto out;
++ }
++
++ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++
++ if (ops->slab == NULL) {
++ ret = -ENOMEM;
++ goto err_reqsk_create;
++ }
++
++out:
++ return ret;
++
++err_reqsk_create:
++ kfree(ops->slab_name);
++ ops->slab_name = NULL;
++ goto out;
++}
++
++void mptcp_pm_v6_undo(void)
++{
++ kmem_cache_destroy(mptcp6_request_sock_ops.slab);
++ kfree(mptcp6_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
+new file mode 100644
+index 000000000000..6f5087983175
+--- /dev/null
++++ b/net/mptcp/mptcp_ndiffports.c
+@@ -0,0 +1,161 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++
++struct ndiffports_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++
++ struct mptcp_cb *mpcb;
++};
++
++static int num_subflows __read_mostly = 2;
++module_param(num_subflows, int, 0644);
++MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ const struct ndiffports_priv *pm_priv = container_of(work,
++ struct ndiffports_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = pm_priv->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ int iter = 0;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
++ if (meta_sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(meta_sk)) {
++ struct mptcp_loc4 loc;
++ struct mptcp_rem4 rem;
++
++ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++ loc.loc4_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem4_id = 0; /* Default 0 */
++
++ mptcp_init4_subsockets(meta_sk, &loc, &rem);
++ } else {
++#if IS_ENABLED(CONFIG_IPV6)
++ struct mptcp_loc6 loc;
++ struct mptcp_rem6 rem;
++
++ loc.addr = inet6_sk(meta_sk)->saddr;
++ loc.loc6_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr = meta_sk->sk_v6_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem6_id = 0; /* Default 0 */
++
++ mptcp_init6_subsockets(meta_sk, &loc, &rem);
++#endif
++ }
++ goto next_subflow;
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void ndiffports_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ fmp->mpcb = mpcb;
++}
++
++static void ndiffports_create_subflows(struct sock *meta_sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (!work_pending(&pm_priv->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &pm_priv->subflow_work);
++ }
++}
++
++static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++static struct mptcp_pm_ops ndiffports __read_mostly = {
++ .new_session = ndiffports_new_session,
++ .fully_established = ndiffports_create_subflows,
++ .get_local_id = ndiffports_get_local_id,
++ .name = "ndiffports",
++ .owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init ndiffports_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
++
++ if (mptcp_register_path_manager(&ndiffports))
++ goto exit;
++
++ return 0;
++
++exit:
++ return -1;
++}
++
++static void ndiffports_unregister(void)
++{
++ mptcp_unregister_path_manager(&ndiffports);
++}
++
++module_init(ndiffports_register);
++module_exit(ndiffports_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
+new file mode 100644
+index 000000000000..ec4e98622637
+--- /dev/null
++++ b/net/mptcp/mptcp_ofo_queue.c
+@@ -0,0 +1,295 @@
++/*
++ * MPTCP implementation - Fast algorithm for MPTCP meta-reordering
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <linux/slab.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb)
++{
++ struct tcp_sock *tp;
++
++ mptcp_for_each_tp(mpcb, tp) {
++ if (tp->mptcp->shortcut_ofoqueue == skb) {
++ tp->mptcp->shortcut_ofoqueue = NULL;
++ return;
++ }
++ }
++}
++
++/* Does 'skb' fits after 'here' in the queue 'head' ?
++ * If yes, we queue it and return 1
++ */
++static int mptcp_ofo_queue_after(struct sk_buff_head *head,
++ struct sk_buff *skb, struct sk_buff *here,
++ const struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++ /* We want to queue skb after here, thus seq >= end_seq */
++ if (before(seq, TCP_SKB_CB(here)->end_seq))
++ return 0;
++
++ if (seq == TCP_SKB_CB(here)->end_seq) {
++ bool fragstolen = false;
++
++ if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
++ __skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
++ return 1;
++ } else {
++ kfree_skb_partial(skb, fragstolen);
++ return -1;
++ }
++ }
++
++ /* If here is the last one, we can always queue it */
++ if (skb_queue_is_last(head, here)) {
++ __skb_queue_after(head, here, skb);
++ return 1;
++ } else {
++ struct sk_buff *skb1 = skb_queue_next(head, here);
++ /* It's not the last one, but does it fits between 'here' and
++ * the one after 'here' ? Thus, does end_seq <= after_here->seq
++ */
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
++ __skb_queue_after(head, here, skb);
++ return 1;
++ }
++ }
++
++ return 0;
++}
++
++static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
++ struct sk_buff_head *head, struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk;
++ struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb1, *best_shortcut = NULL;
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++ u32 distance = 0xffffffff;
++
++ /* First, check the tp's shortcut */
++ if (!shortcut) {
++ if (skb_queue_empty(head)) {
++ __skb_queue_head(head, skb);
++ goto end;
++ }
++ } else {
++ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++ /* Does the tp's shortcut is a hit? If yes, we insert. */
++
++ if (ret) {
++ skb = (ret > 0) ? skb : NULL;
++ goto end;
++ }
++ }
++
++ /* Check the shortcuts of the other subsockets. */
++ mptcp_for_each_tp(mpcb, tp_it) {
++ shortcut = tp_it->mptcp->shortcut_ofoqueue;
++ /* Can we queue it here? If yes, do so! */
++ if (shortcut) {
++ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++
++ if (ret) {
++ skb = (ret > 0) ? skb : NULL;
++ goto end;
++ }
++ }
++
++ /* Could not queue it, check if we are close.
++ * We are looking for a shortcut, close enough to seq to
++ * set skb1 prematurely and thus improve the subsequent lookup,
++ * which tries to find a skb1 so that skb1->seq <= seq.
++ *
++ * So, here we only take shortcuts, whose shortcut->seq > seq,
++ * and minimize the distance between shortcut->seq and seq and
++ * set best_shortcut to this one with the minimal distance.
++ *
++ * That way, the subsequent while-loop is shortest.
++ */
++ if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
++ /* Are we closer than the current best shortcut? */
++ if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
++ distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
++ best_shortcut = shortcut;
++ }
++ }
++ }
++
++ if (best_shortcut)
++ skb1 = best_shortcut;
++ else
++ skb1 = skb_peek_tail(head);
++
++ if (seq == TCP_SKB_CB(skb1)->end_seq) {
++ bool fragstolen = false;
++
++ if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
++ __skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
++ } else {
++ kfree_skb_partial(skb, fragstolen);
++ skb = NULL;
++ }
++
++ goto end;
++ }
++
++ /* Find the insertion point, starting from best_shortcut if available.
++ *
++ * Inspired from tcp_data_queue_ofo.
++ */
++ while (1) {
++ /* skb1->seq <= seq */
++ if (!after(TCP_SKB_CB(skb1)->seq, seq))
++ break;
++ if (skb_queue_is_first(head, skb1)) {
++ skb1 = NULL;
++ break;
++ }
++ skb1 = skb_queue_prev(head, skb1);
++ }
++
++ /* Do skb overlap to previous one? */
++ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* All the bits are present. */
++ __kfree_skb(skb);
++ skb = NULL;
++ goto end;
++ }
++ if (seq == TCP_SKB_CB(skb1)->seq) {
++ if (skb_queue_is_first(head, skb1))
++ skb1 = NULL;
++ else
++ skb1 = skb_queue_prev(head, skb1);
++ }
++ }
++ if (!skb1)
++ __skb_queue_head(head, skb);
++ else
++ __skb_queue_after(head, skb1, skb);
++
++ /* And clean segments covered by new one as whole. */
++ while (!skb_queue_is_last(head, skb)) {
++ skb1 = skb_queue_next(head, skb);
++
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++ break;
++
++ __skb_unlink(skb1, head);
++ mptcp_remove_shortcuts(mpcb, skb1);
++ __kfree_skb(skb1);
++ }
++
++end:
++ if (skb) {
++ skb_set_owner_r(skb, meta_sk);
++ tp->mptcp->shortcut_ofoqueue = skb;
++ }
++
++ return;
++}
++
++/**
++ * @sk: the subflow that received this skb.
++ */
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++ struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
++ &tcp_sk(meta_sk)->out_of_order_queue, tp);
++}
++
++bool mptcp_prune_ofo_queue(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ bool res = false;
++
++ if (!skb_queue_empty(&tp->out_of_order_queue)) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
++ mptcp_purge_ofo_queue(tp);
++
++ /* No sack at the mptcp-level */
++ sk_mem_reclaim(sk);
++ res = true;
++ }
++
++ return res;
++}
++
++void mptcp_ofo_queue(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++
++ while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
++ u32 old_rcv_nxt = meta_tp->rcv_nxt;
++ if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
++ break;
++
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
++ __skb_unlink(skb, &meta_tp->out_of_order_queue);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++ __kfree_skb(skb);
++ continue;
++ }
++
++ __skb_unlink(skb, &meta_tp->out_of_order_queue);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++
++ __skb_queue_tail(&meta_sk->sk_receive_queue, skb);
++ meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++ if (tcp_hdr(skb)->fin)
++ mptcp_fin(meta_sk);
++ }
++}
++
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
++{
++ struct sk_buff_head *head = &meta_tp->out_of_order_queue;
++ struct sk_buff *skb, *tmp;
++
++ skb_queue_walk_safe(head, skb, tmp) {
++ __skb_unlink(skb, head);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++ kfree_skb(skb);
++ }
++}
+diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
+new file mode 100644
+index 000000000000..53f5c43bb488
+--- /dev/null
++++ b/net/mptcp/mptcp_olia.c
+@@ -0,0 +1,311 @@
++/*
++ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
++ *
++ * Algorithm design:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ * Nicolas Gast <nicolas.gast@epfl.ch>
++ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
++ *
++ * Implementation:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++static int scale = 10;
++
++struct mptcp_olia {
++ u32 mptcp_loss1;
++ u32 mptcp_loss2;
++ u32 mptcp_loss3;
++ int epsilon_num;
++ u32 epsilon_den;
++ int mptcp_snd_cwnd_cnt;
++};
++
++static inline int mptcp_olia_sk_can_send(const struct sock *sk)
++{
++ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_olia_scale(u64 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++/* take care of artificially inflate (see RFC5681)
++ * of cwnd during fast-retransmit phase
++ */
++static u32 mptcp_get_crt_cwnd(struct sock *sk)
++{
++ const struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (icsk->icsk_ca_state == TCP_CA_Recovery)
++ return tcp_sk(sk)->snd_ssthresh;
++ else
++ return tcp_sk(sk)->snd_cwnd;
++}
++
++/* return the dominator of the first term of the increasing term */
++static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
++{
++ struct sock *sk;
++ u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ u64 scaled_num;
++ u32 tmp_cwnd;
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
++ rate += div_u64(scaled_num , tp->srtt_us);
++ }
++ rate *= rate;
++ return rate;
++}
++
++/* find the maximum cwnd, used to find set M */
++static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
++{
++ struct sock *sk;
++ u32 best_cwnd = 0;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ u32 tmp_cwnd;
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if (tmp_cwnd > best_cwnd)
++ best_cwnd = tmp_cwnd;
++ }
++ return best_cwnd;
++}
++
++static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
++{
++ struct mptcp_olia *ca;
++ struct tcp_sock *tp;
++ struct sock *sk;
++ u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
++ u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
++ u8 M = 0, B_not_M = 0;
++
++ /* TODO - integrate this in the following loop - we just want to iterate once */
++
++ max_cwnd = mptcp_get_max_cwnd(mpcb);
++
++ /* find the best path */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ /* TODO - check here and rename variables */
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
++ best_rtt = tmp_rtt;
++ best_int = tmp_int;
++ best_cwnd = tmp_cwnd;
++ }
++ }
++
++ /* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
++ /* find the size of M and B_not_M */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if (tmp_cwnd == max_cwnd) {
++ M++;
++ } else {
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++
++ if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
++ B_not_M++;
++ }
++ }
++
++ /* check if the path is in M or B_not_M and set the value of epsilon accordingly */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ if (B_not_M == 0) {
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ } else {
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++
++ if (tmp_cwnd < max_cwnd &&
++ (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
++ ca->epsilon_num = 1;
++ ca->epsilon_den = mpcb->cnt_established * B_not_M;
++ } else if (tmp_cwnd == max_cwnd) {
++ ca->epsilon_num = -1;
++ ca->epsilon_den = mpcb->cnt_established * M;
++ } else {
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ }
++ }
++ }
++}
++
++/* setting the initial values */
++static void mptcp_olia_init(struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++
++ if (mptcp(tp)) {
++ ca->mptcp_loss1 = tp->snd_una;
++ ca->mptcp_loss2 = tp->snd_una;
++ ca->mptcp_loss3 = tp->snd_una;
++ ca->mptcp_snd_cwnd_cnt = 0;
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ }
++}
++
++/* updating inter-loss distance and ssthresh */
++static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
++{
++ if (!mptcp(tcp_sk(sk)))
++ return;
++
++ if (new_state == TCP_CA_Loss ||
++ new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++
++ if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
++ !inet_csk(sk)->icsk_retransmits) {
++ ca->mptcp_loss1 = ca->mptcp_loss2;
++ ca->mptcp_loss2 = ca->mptcp_loss3;
++ }
++ }
++}
++
++/* main algorithm */
++static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++
++ u64 inc_num, inc_den, rate, cwnd_scaled;
++
++ if (!mptcp(tp)) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ ca->mptcp_loss3 = tp->snd_una;
++
++ if (!tcp_is_cwnd_limited(sk))
++ return;
++
++ /* slow start if it is in the safe area */
++ if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ tcp_slow_start(tp, acked);
++ return;
++ }
++
++ mptcp_get_epsilon(mpcb);
++ rate = mptcp_get_rate(mpcb, tp->srtt_us);
++ cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
++ inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
++
++ /* calculate the increasing term, scaling is used to reduce the rounding effect */
++ if (ca->epsilon_num == -1) {
++ if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
++ inc_num = rate - ca->epsilon_den *
++ cwnd_scaled * cwnd_scaled;
++ ca->mptcp_snd_cwnd_cnt -= div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ } else {
++ inc_num = ca->epsilon_den *
++ cwnd_scaled * cwnd_scaled - rate;
++ ca->mptcp_snd_cwnd_cnt += div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ }
++ } else {
++ inc_num = ca->epsilon_num * rate +
++ ca->epsilon_den * cwnd_scaled * cwnd_scaled;
++ ca->mptcp_snd_cwnd_cnt += div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ }
++
++
++ if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
++ if (tp->snd_cwnd < tp->snd_cwnd_clamp)
++ tp->snd_cwnd++;
++ ca->mptcp_snd_cwnd_cnt = 0;
++ } else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
++ tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
++ ca->mptcp_snd_cwnd_cnt = 0;
++ }
++}
++
++static struct tcp_congestion_ops mptcp_olia = {
++ .init = mptcp_olia_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_olia_cong_avoid,
++ .set_state = mptcp_olia_set_state,
++ .owner = THIS_MODULE,
++ .name = "olia",
++};
++
++static int __init mptcp_olia_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
++ return tcp_register_congestion_control(&mptcp_olia);
++}
++
++static void __exit mptcp_olia_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_olia);
++}
++
++module_init(mptcp_olia_register);
++module_exit(mptcp_olia_unregister);
++
++MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
+new file mode 100644
+index 000000000000..400ea254c078
+--- /dev/null
++++ b/net/mptcp/mptcp_output.c
+@@ -0,0 +1,1743 @@
++/*
++ * MPTCP implementation - Sending side
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/kconfig.h>
++#include <linux/skbuff.h>
++#include <linux/tcp.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++#include <net/sock.h>
++
++static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
++ MPTCP_SUB_LEN_ACK_ALIGN +
++ MPTCP_SUB_LEN_SEQ_ALIGN;
++
++static inline int mptcp_sub_len_remove_addr(u16 bitfield)
++{
++ unsigned int c;
++ for (c = 0; bitfield; c++)
++ bitfield &= bitfield - 1;
++ return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
++}
++
++int mptcp_sub_len_remove_addr_align(u16 bitfield)
++{
++ return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
++}
++EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
++
++/* get the data-seq and end-data-seq and store them again in the
++ * tcp_skb_cb
++ */
++static int mptcp_reconstruct_mapping(struct sk_buff *skb)
++{
++ const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
++ u32 *p32;
++ u16 *p16;
++
++ if (!mpdss->M)
++ return 1;
++
++ /* Move the pointer to the data-seq */
++ p32 = (u32 *)mpdss;
++ p32++;
++ if (mpdss->A) {
++ p32++;
++ if (mpdss->a)
++ p32++;
++ }
++
++ TCP_SKB_CB(skb)->seq = ntohl(*p32);
++
++ /* Get the data_len to calculate the end_data_seq */
++ p32++;
++ p32++;
++ p16 = (u16 *)p32;
++ TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
++
++ return 0;
++}
++
++static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
++{
++ struct sk_buff *skb_it;
++
++ skb_it = tcp_write_queue_head(meta_sk);
++
++ tcp_for_write_queue_from(skb_it, meta_sk) {
++ if (skb_it == tcp_send_head(meta_sk))
++ break;
++
++ if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
++ TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
++ break;
++ }
++ }
++}
++
++/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
++ * coming from the meta-retransmit-timer
++ */
++static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
++ struct sock *sk, int clone_it)
++{
++ struct sk_buff *skb, *skb1;
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ u32 seq, end_seq;
++
++ if (clone_it) {
++ /* pskb_copy is necessary here, because the TCP/IP-headers
++ * will be changed when it's going to be reinjected on another
++ * subflow.
++ */
++ skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
++ } else {
++ __skb_unlink(orig_skb, &sk->sk_write_queue);
++ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++ sk->sk_wmem_queued -= orig_skb->truesize;
++ sk_mem_uncharge(sk, orig_skb->truesize);
++ skb = orig_skb;
++ }
++ if (unlikely(!skb))
++ return;
++
++ if (sk && mptcp_reconstruct_mapping(skb)) {
++ __kfree_skb(skb);
++ return;
++ }
++
++ skb->sk = meta_sk;
++
++ /* If it reached already the destination, we don't have to reinject it */
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++ __kfree_skb(skb);
++ return;
++ }
++
++ /* Only reinject segments that are fully covered by the mapping */
++ if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
++ TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++ __kfree_skb(skb);
++
++ /* Ok, now we have to look for the full mapping in the meta
++ * send-queue :S
++ */
++ tcp_for_write_queue(skb, meta_sk) {
++ /* Not yet at the mapping? */
++ if (before(TCP_SKB_CB(skb)->seq, seq))
++ continue;
++ /* We have passed by the mapping */
++ if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
++ return;
++
++ __mptcp_reinject_data(skb, meta_sk, NULL, 1);
++ }
++ return;
++ }
++
++ /* Segment goes back to the MPTCP-layer. So, we need to zero the
++ * path_mask/dss.
++ */
++ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++ /* We need to find out the path-mask from the meta-write-queue
++ * to properly select a subflow.
++ */
++ mptcp_find_and_set_pathmask(meta_sk, skb);
++
++ /* If it's empty, just add */
++ if (skb_queue_empty(&mpcb->reinject_queue)) {
++ skb_queue_head(&mpcb->reinject_queue, skb);
++ return;
++ }
++
++ /* Find place to insert skb - or even we can 'drop' it, as the
++ * data is already covered by other skb's in the reinject-queue.
++ *
++ * This is inspired by code from tcp_data_queue.
++ */
++
++ skb1 = skb_peek_tail(&mpcb->reinject_queue);
++ seq = TCP_SKB_CB(skb)->seq;
++ while (1) {
++ if (!after(TCP_SKB_CB(skb1)->seq, seq))
++ break;
++ if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
++ skb1 = NULL;
++ break;
++ }
++ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++ }
++
++ /* Do skb overlap to previous one? */
++ end_seq = TCP_SKB_CB(skb)->end_seq;
++ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* All the bits are present. Don't reinject */
++ __kfree_skb(skb);
++ return;
++ }
++ if (seq == TCP_SKB_CB(skb1)->seq) {
++ if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
++ skb1 = NULL;
++ else
++ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++ }
++ }
++ if (!skb1)
++ __skb_queue_head(&mpcb->reinject_queue, skb);
++ else
++ __skb_queue_after(&mpcb->reinject_queue, skb1, skb);
++
++ /* And clean segments covered by new one as whole. */
++ while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
++ skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
++
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++ break;
++
++ __skb_unlink(skb1, &mpcb->reinject_queue);
++ __kfree_skb(skb1);
++ }
++ return;
++}
++
++/* Inserts data into the reinject queue */
++void mptcp_reinject_data(struct sock *sk, int clone_it)
++{
++ struct sk_buff *skb_it, *tmp;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = tp->meta_sk;
++
++ /* It has already been closed - there is really no point in reinjecting */
++ if (meta_sk->sk_state == TCP_CLOSE)
++ return;
++
++ skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
++ /* Subflow syn's and fin's are not reinjected.
++ *
++ * As well as empty subflow-fins with a data-fin.
++ * They are reinjected below (without the subflow-fin-flag)
++ */
++ if (tcb->tcp_flags & TCPHDR_SYN ||
++ (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
++ (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
++ continue;
++
++ __mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
++ }
++
++ skb_it = tcp_write_queue_tail(meta_sk);
++ /* If sk has sent the empty data-fin, we have to reinject it too. */
++ if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
++ TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
++ __mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
++ }
++
++ mptcp_push_pending_frames(meta_sk);
++
++ tp->pf = 1;
++}
++EXPORT_SYMBOL(mptcp_reinject_data);
++
++static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
++ struct sock *subsk)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *sk_it;
++ int all_empty = 1, all_acked;
++
++ /* In infinite mapping we always try to combine */
++ if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
++ subsk->sk_shutdown |= SEND_SHUTDOWN;
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ return;
++ }
++
++ /* Don't combine, if they didn't combine - otherwise we end up in
++ * TIME_WAIT, even if our app is smart enough to avoid it
++ */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++ if (!mpcb->dfin_combined)
++ return;
++ }
++
++ /* If no other subflow has data to send, we can combine */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (!mptcp_sk_can_send(sk_it))
++ continue;
++
++ if (!tcp_write_queue_empty(sk_it))
++ all_empty = 0;
++ }
++
++ /* If all data has been DATA_ACKed, we can combine.
++ * -1, because the data_fin consumed one byte
++ */
++ all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
++
++ if ((all_empty || all_acked) && tcp_close_state(subsk)) {
++ subsk->sk_shutdown |= SEND_SHUTDOWN;
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ }
++}
++
++static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
++ __be32 *ptr)
++{
++ const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ __be32 *start = ptr;
++ __u16 data_len;
++
++ *ptr++ = htonl(tcb->seq); /* data_seq */
++
++ /* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
++ if (mptcp_is_data_fin(skb) && skb->len == 0)
++ *ptr++ = 0; /* subseq */
++ else
++ *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
++
++ if (tcb->mptcp_flags & MPTCPHDR_INF)
++ data_len = 0;
++ else
++ data_len = tcb->end_seq - tcb->seq;
++
++ if (tp->mpcb->dss_csum && data_len) {
++ __be16 *p16 = (__be16 *)ptr;
++ __be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
++ __wsum csum;
++
++ *ptr = htonl(((data_len) << 16) |
++ (TCPOPT_EOL << 8) |
++ (TCPOPT_EOL));
++ csum = csum_partial(ptr - 2, 12, skb->csum);
++ p16++;
++ *p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
++ } else {
++ *ptr++ = htonl(((data_len) << 16) |
++ (TCPOPT_NOP << 8) |
++ (TCPOPT_NOP));
++ }
++
++ return ptr - start;
++}
++
++static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
++ __be32 *ptr)
++{
++ struct mp_dss *mdss = (struct mp_dss *)ptr;
++ __be32 *start = ptr;
++
++ mdss->kind = TCPOPT_MPTCP;
++ mdss->sub = MPTCP_SUB_DSS;
++ mdss->rsv1 = 0;
++ mdss->rsv2 = 0;
++ mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
++ mdss->m = 0;
++ mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
++ mdss->a = 0;
++ mdss->A = 1;
++ mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
++ ptr++;
++
++ *ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++ return ptr - start;
++}
++
++/* RFC6824 states that once a particular subflow mapping has been sent
++ * out it must never be changed. However, packets may be split while
++ * they are in the retransmission queue (due to SACK or ACKs) and that
++ * arguably means that we would change the mapping (e.g. it splits it,
++ * our sends out a subset of the initial mapping).
++ *
++ * Furthermore, the skb checksum is not always preserved across splits
++ * (e.g. mptcp_fragment) which would mean that we need to recompute
++ * the DSS checksum in this case.
++ *
++ * To avoid this we save the initial DSS mapping which allows us to
++ * send the same DSS mapping even for fragmented retransmits.
++ */
++static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
++{
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ __be32 *ptr = (__be32 *)tcb->dss;
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++ ptr += mptcp_write_dss_mapping(tp, skb, ptr);
++}
++
++/* Write the saved DSS mapping to the header */
++static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
++ __be32 *ptr)
++{
++ __be32 *start = ptr;
++
++ memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
++
++ /* update the data_ack */
++ start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++ /* dss is in a union with inet_skb_parm and
++ * the IP layer expects zeroed IPCB fields.
++ */
++ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++ return mptcp_dss_len/sizeof(*ptr);
++}
++
++static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct sock *meta_sk = mptcp_meta_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++ struct tcp_skb_cb *tcb;
++ struct sk_buff *subskb = NULL;
++
++ if (!reinject)
++ TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
++ MPTCPHDR_SEQ64_INDEX : 0);
++
++ subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
++ if (!subskb)
++ return false;
++
++ /* At the subflow-level we need to call again tcp_init_tso_segs. We
++ * force this, by setting gso_segs to 0. It has been set to 1 prior to
++ * the call to mptcp_skb_entail.
++ */
++ skb_shinfo(subskb)->gso_segs = 0;
++
++ TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
++
++ if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
++ skb->ip_summed == CHECKSUM_PARTIAL) {
++ subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
++ subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
++ }
++
++ tcb = TCP_SKB_CB(subskb);
++
++ if (tp->mpcb->send_infinite_mapping &&
++ !tp->mpcb->infinite_mapping_snd &&
++ !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
++ tp->mptcp->fully_established = 1;
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mptcp->infinite_cutoff_seq = tp->write_seq;
++ tcb->mptcp_flags |= MPTCPHDR_INF;
++ }
++
++ if (mptcp_is_data_fin(subskb))
++ mptcp_combine_dfin(subskb, meta_sk, sk);
++
++ mptcp_save_dss_data_seq(tp, subskb);
++
++ tcb->seq = tp->write_seq;
++ tcb->sacked = 0; /* reset the sacked field: from the point of view
++ * of this subflow, we are sending a brand new
++ * segment
++ */
++ /* Take into account seg len */
++ tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
++ tcb->end_seq = tp->write_seq;
++
++ /* If it's a non-payload DATA_FIN (also no subflow-fin), the
++ * segment is not part of the subflow but on a meta-only-level.
++ */
++ if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
++ tcp_add_write_queue_tail(sk, subskb);
++ sk->sk_wmem_queued += subskb->truesize;
++ sk_mem_charge(sk, subskb->truesize);
++ } else {
++ int err;
++
++ /* Necessary to initialize for tcp_transmit_skb. mss of 1, as
++ * skb->len = 0 will force tso_segs to 1.
++ */
++ tcp_init_tso_segs(sk, subskb, 1);
++ /* Empty data-fins are sent immediatly on the subflow */
++ TCP_SKB_CB(subskb)->when = tcp_time_stamp;
++ err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
++
++ /* It has not been queued, we can free it now. */
++ kfree_skb(subskb);
++
++ if (err)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ tp->mptcp->second_packet = 1;
++ tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
++ }
++
++ return true;
++}
++
++/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
++ * might need to undo some operations done by tcp_fragment.
++ */
++static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
++ gfp_t gfp, int reinject)
++{
++ int ret, diff, old_factor;
++ struct sk_buff *buff;
++ u8 flags;
++
++ if (skb_headlen(skb) < len)
++ diff = skb->len - len;
++ else
++ diff = skb->data_len;
++ old_factor = tcp_skb_pcount(skb);
++
++ /* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
++ * At the MPTCP-level we do not care about the absolute value. All we
++ * care about is that it is set to 1 for accurate packets_out
++ * accounting.
++ */
++ ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
++ if (ret)
++ return ret;
++
++ buff = skb->next;
++
++ flags = TCP_SKB_CB(skb)->mptcp_flags;
++ TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
++ TCP_SKB_CB(buff)->mptcp_flags = flags;
++ TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
++
++ /* If reinject == 1, the buff will be added to the reinject
++ * queue, which is currently not part of memory accounting. So
++ * undo the changes done by tcp_fragment and update the
++ * reinject queue. Also, undo changes to the packet counters.
++ */
++ if (reinject == 1) {
++ int undo = buff->truesize - diff;
++ meta_sk->sk_wmem_queued -= undo;
++ sk_mem_uncharge(meta_sk, undo);
++
++ tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
++ meta_sk->sk_write_queue.qlen--;
++
++ if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
++ undo = old_factor - tcp_skb_pcount(skb) -
++ tcp_skb_pcount(buff);
++ if (undo)
++ tcp_adjust_pcount(meta_sk, skb, -undo);
++ }
++ }
++
++ return 0;
++}
++
++/* Inspired by tcp_write_wakeup */
++int mptcp_write_wakeup(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++ struct sock *sk_it;
++ int ans = 0;
++
++ if (meta_sk->sk_state == TCP_CLOSE)
++ return -1;
++
++ skb = tcp_send_head(meta_sk);
++ if (skb &&
++ before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
++ unsigned int mss;
++ unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
++ struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
++ struct tcp_sock *subtp;
++ if (!subsk)
++ goto window_probe;
++ subtp = tcp_sk(subsk);
++ mss = tcp_current_mss(subsk);
++
++ seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
++ tcp_wnd_end(subtp) - subtp->write_seq);
++
++ if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
++ meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
++
++ /* We are probing the opening of a window
++ * but the window size is != 0
++ * must have been a result SWS avoidance ( sender )
++ */
++ if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
++ skb->len > mss) {
++ seg_size = min(seg_size, mss);
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++ if (mptcp_fragment(meta_sk, skb, seg_size,
++ GFP_ATOMIC, 0))
++ return -1;
++ } else if (!tcp_skb_pcount(skb)) {
++ /* see mptcp_write_xmit on why we use UINT_MAX */
++ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++ }
++
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++ if (!mptcp_skb_entail(subsk, skb, 0))
++ return -1;
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
++ TCP_SKB_CB(skb)->seq);
++ tcp_event_new_data_sent(meta_sk, skb);
++
++ __tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
++
++ return 0;
++ } else {
++window_probe:
++ if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
++ meta_tp->snd_una + 0xFFFF)) {
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ if (mptcp_sk_can_send_ack(sk_it))
++ tcp_xmit_probe_skb(sk_it, 1);
++ }
++ }
++
++ /* At least one of the tcp_xmit_probe_skb's has to succeed */
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ int ret;
++
++ if (!mptcp_sk_can_send_ack(sk_it))
++ continue;
++
++ ret = tcp_xmit_probe_skb(sk_it, 0);
++ if (unlikely(ret > 0))
++ ans = ret;
++ }
++ return ans;
++ }
++}
++
++bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
++ struct sock *subsk = NULL;
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb;
++ unsigned int sent_pkts;
++ int reinject = 0;
++ unsigned int sublimit;
++
++ sent_pkts = 0;
++
++ while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
++ &sublimit))) {
++ unsigned int limit;
++
++ subtp = tcp_sk(subsk);
++ mss_now = tcp_current_mss(subsk);
++
++ if (reinject == 1) {
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++ /* Segment already reached the peer, take the next one */
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ __kfree_skb(skb);
++ continue;
++ }
++ }
++
++ /* If the segment was cloned (e.g. a meta retransmission),
++ * the header must be expanded/copied so that there is no
++ * corruption of TSO information.
++ */
++ if (skb_unclone(skb, GFP_ATOMIC))
++ break;
++
++ if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
++ break;
++
++ /* Force tso_segs to 1 by using UINT_MAX.
++ * We actually don't care about the exact number of segments
++ * emitted on the subflow. We need just to set tso_segs, because
++ * we still need an accurate packets_out count in
++ * tcp_event_new_data_sent.
++ */
++ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++
++ /* Check for nagle, irregardless of tso_segs. If the segment is
++ * actually larger than mss_now (TSO segment), then
++ * tcp_nagle_check will have partial == false and always trigger
++ * the transmission.
++ * tcp_write_xmit has a TSO-level nagle check which is not
++ * subject to the MPTCP-level. It is based on the properties of
++ * the subflow, not the MPTCP-level.
++ */
++ if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
++ (tcp_skb_is_last(meta_sk, skb) ?
++ nonagle : TCP_NAGLE_PUSH))))
++ break;
++
++ limit = mss_now;
++ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
++ * tcp_write_xmit. Otherwise split-point would return 0.
++ */
++ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++ /* We limit the size of the skb so that it fits into the
++ * window. Call tcp_mss_split_point to avoid duplicating
++ * code.
++ * We really only care about fitting the skb into the
++ * window. That's why we use UINT_MAX. If the skb does
++ * not fit into the cwnd_quota or the NIC's max-segs
++ * limitation, it will be split by the subflow's
++ * tcp_write_xmit which does the appropriate call to
++ * tcp_mss_split_point.
++ */
++ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++ UINT_MAX / mss_now,
++ nonagle);
++
++ if (sublimit)
++ limit = min(limit, sublimit);
++
++ if (skb->len > limit &&
++ unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
++ break;
++
++ if (!mptcp_skb_entail(subsk, skb, reinject))
++ break;
++ /* Nagle is handled at the MPTCP-layer, so
++ * always push on the subflow
++ */
++ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ if (!reinject) {
++ mptcp_check_sndseq_wrap(meta_tp,
++ TCP_SKB_CB(skb)->end_seq -
++ TCP_SKB_CB(skb)->seq);
++ tcp_event_new_data_sent(meta_sk, skb);
++ }
++
++ tcp_minshall_update(meta_tp, mss_now, skb);
++ sent_pkts += tcp_skb_pcount(skb);
++
++ if (reinject > 0) {
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ kfree_skb(skb);
++ }
++
++ if (push_one)
++ break;
++ }
++
++ return !meta_tp->packets_out && tcp_send_head(meta_sk);
++}
++
++void mptcp_write_space(struct sock *sk)
++{
++ mptcp_push_pending_frames(mptcp_meta_sk(sk));
++}
++
++u32 __mptcp_select_window(struct sock *sk)
++{
++ struct inet_connection_sock *icsk = inet_csk(sk);
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ int mss, free_space, full_space, window;
++
++ /* MSS for the peer's data. Previous versions used mss_clamp
++ * here. I don't know if the value based on our guesses
++ * of peer's MSS is better for the performance. It's more correct
++ * but may be worse for the performance because of rcv_mss
++ * fluctuations. --SAW 1998/11/1
++ */
++ mss = icsk->icsk_ack.rcv_mss;
++ free_space = tcp_space(sk);
++ full_space = min_t(int, meta_tp->window_clamp,
++ tcp_full_space(sk));
++
++ if (mss > full_space)
++ mss = full_space;
++
++ if (free_space < (full_space >> 1)) {
++ icsk->icsk_ack.quick = 0;
++
++ if (tcp_memory_pressure)
++ /* TODO this has to be adapted when we support different
++ * MSS's among the subflows.
++ */
++ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
++ 4U * meta_tp->advmss);
++
++ if (free_space < mss)
++ return 0;
++ }
++
++ if (free_space > meta_tp->rcv_ssthresh)
++ free_space = meta_tp->rcv_ssthresh;
++
++ /* Don't do rounding if we are using window scaling, since the
++ * scaled window will not line up with the MSS boundary anyway.
++ */
++ window = meta_tp->rcv_wnd;
++ if (tp->rx_opt.rcv_wscale) {
++ window = free_space;
++
++ /* Advertise enough space so that it won't get scaled away.
++ * Import case: prevent zero window announcement if
++ * 1<<rcv_wscale > mss.
++ */
++ if (((window >> tp->rx_opt.rcv_wscale) << tp->
++ rx_opt.rcv_wscale) != window)
++ window = (((window >> tp->rx_opt.rcv_wscale) + 1)
++ << tp->rx_opt.rcv_wscale);
++ } else {
++ /* Get the largest window that is a nice multiple of mss.
++ * Window clamp already applied above.
++ * If our current window offering is within 1 mss of the
++ * free space we just keep it. This prevents the divide
++ * and multiply from happening most of the time.
++ * We also don't do any window rounding when the free space
++ * is too small.
++ */
++ if (window <= free_space - mss || window > free_space)
++ window = (free_space / mss) * mss;
++ else if (mss == full_space &&
++ free_space > window + (full_space >> 1))
++ window = free_space;
++ }
++
++ return window;
++}
++
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++ unsigned *remaining)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++
++ opts->options |= OPTION_MPTCP;
++ if (is_master_tp(tp)) {
++ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
++ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++ opts->mp_capable.sender_key = tp->mptcp_loc_key;
++ opts->dss_csum = !!sysctl_mptcp_checksum;
++ } else {
++ const struct mptcp_cb *mpcb = tp->mpcb;
++
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
++ *remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
++ opts->mp_join_syns.token = mpcb->mptcp_rem_token;
++ opts->mp_join_syns.low_prio = tp->mptcp->low_prio;
++ opts->addr_id = tp->mptcp->loc_id;
++ opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
++ }
++}
++
++void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts, unsigned *remaining)
++{
++ struct mptcp_request_sock *mtreq;
++ mtreq = mptcp_rsk(req);
++
++ opts->options |= OPTION_MPTCP;
++ /* MPCB not yet set - thus it's a new MPTCP-session */
++ if (!mtreq->is_sub) {
++ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
++ opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
++ opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
++ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++ } else {
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
++ opts->mp_join_syns.sender_truncated_mac =
++ mtreq->mptcp_hash_tmac;
++ opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
++ opts->mp_join_syns.low_prio = mtreq->low_prio;
++ opts->addr_id = mtreq->loc_id;
++ *remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
++ }
++}
++
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++ struct tcp_out_options *opts, unsigned *size)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
++
++ /* We are coming from tcp_current_mss with the meta_sk as an argument.
++ * It does not make sense to check for the options, because when the
++ * segment gets sent, another subflow will be chosen.
++ */
++ if (!skb && is_meta_sk(sk))
++ return;
++
++ /* In fallback mp_fail-mode, we have to repeat it until the fallback
++ * has been done by the sender
++ */
++ if (unlikely(tp->mptcp->send_mp_fail)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_FAIL;
++ *size += MPTCP_SUB_LEN_FAIL;
++ return;
++ }
++
++ if (unlikely(tp->send_mp_fclose)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_FCLOSE;
++ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++ *size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
++ return;
++ }
++
++ /* 1. If we are the sender of the infinite-mapping, we need the
++ * MPTCPHDR_INF-flag, because a retransmission of the
++ * infinite-announcment still needs the mptcp-option.
++ *
++ * We need infinite_cutoff_seq, because retransmissions from before
++ * the infinite-cutoff-moment still need the MPTCP-signalling to stay
++ * consistent.
++ *
++ * 2. If we are the receiver of the infinite-mapping, we always skip
++ * mptcp-options, because acknowledgments from before the
++ * infinite-mapping point have already been sent out.
++ *
++ * I know, the whole infinite-mapping stuff is ugly...
++ *
++ * TODO: Handle wrapped data-sequence numbers
++ * (even if it's very unlikely)
++ */
++ if (unlikely(mpcb->infinite_mapping_snd) &&
++ ((mpcb->send_infinite_mapping && tcb &&
++ mptcp_is_data_seq(skb) &&
++ !(tcb->mptcp_flags & MPTCPHDR_INF) &&
++ !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
++ !mpcb->send_infinite_mapping))
++ return;
++
++ if (unlikely(tp->mptcp->include_mpc)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_CAPABLE |
++ OPTION_TYPE_ACK;
++ *size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
++ opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
++ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++ opts->dss_csum = mpcb->dss_csum;
++
++ if (skb)
++ tp->mptcp->include_mpc = 0;
++ }
++ if (unlikely(tp->mptcp->pre_established)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
++ *size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
++ }
++
++ if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_DATA_ACK;
++ /* If !skb, we come from tcp_current_mss and thus we always
++ * assume that the DSS-option will be set for the data-packet.
++ */
++ if (skb && !mptcp_is_data_seq(skb)) {
++ *size += MPTCP_SUB_LEN_ACK_ALIGN;
++ } else {
++ /* Doesn't matter, if csum included or not. It will be
++ * either 10 or 12, and thus aligned = 12
++ */
++ *size += MPTCP_SUB_LEN_ACK_ALIGN +
++ MPTCP_SUB_LEN_SEQ_ALIGN;
++ }
++
++ *size += MPTCP_SUB_LEN_DSS_ALIGN;
++ }
++
++ if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
++ mpcb->pm_ops->addr_signal(sk, size, opts, skb);
++
++ if (unlikely(tp->mptcp->send_mp_prio) &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_PRIO;
++ if (skb)
++ tp->mptcp->send_mp_prio = 0;
++ *size += MPTCP_SUB_LEN_PRIO_ALIGN;
++ }
++
++ return;
++}
++
++u16 mptcp_select_window(struct sock *sk)
++{
++ u16 new_win = tcp_select_window(sk);
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
++
++ meta_tp->rcv_wnd = tp->rcv_wnd;
++ meta_tp->rcv_wup = meta_tp->rcv_nxt;
++
++ return new_win;
++}
++
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb)
++{
++ if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
++ struct mp_capable *mpc = (struct mp_capable *)ptr;
++
++ mpc->kind = TCPOPT_MPTCP;
++
++ if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
++ (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
++ mpc->sender_key = opts->mp_capable.sender_key;
++ mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
++ ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
++ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++ mpc->sender_key = opts->mp_capable.sender_key;
++ mpc->receiver_key = opts->mp_capable.receiver_key;
++ mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
++ ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
++ }
++
++ mpc->sub = MPTCP_SUB_CAPABLE;
++ mpc->ver = 0;
++ mpc->a = opts->dss_csum;
++ mpc->b = 0;
++ mpc->rsv = 0;
++ mpc->h = 1;
++ }
++
++ if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
++ struct mp_join *mpj = (struct mp_join *)ptr;
++
++ mpj->kind = TCPOPT_MPTCP;
++ mpj->sub = MPTCP_SUB_JOIN;
++ mpj->rsv = 0;
++
++ if (OPTION_TYPE_SYN & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
++ mpj->u.syn.token = opts->mp_join_syns.token;
++ mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
++ mpj->b = opts->mp_join_syns.low_prio;
++ mpj->addr_id = opts->addr_id;
++ ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
++ } else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
++ mpj->u.synack.mac =
++ opts->mp_join_syns.sender_truncated_mac;
++ mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
++ mpj->b = opts->mp_join_syns.low_prio;
++ mpj->addr_id = opts->addr_id;
++ ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
++ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
++ mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
++ memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
++ ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
++ }
++ }
++ if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++ mpadd->kind = TCPOPT_MPTCP;
++ if (opts->add_addr_v4) {
++ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
++ mpadd->sub = MPTCP_SUB_ADD_ADDR;
++ mpadd->ipver = 4;
++ mpadd->addr_id = opts->add_addr4.addr_id;
++ mpadd->u.v4.addr = opts->add_addr4.addr;
++ ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
++ } else if (opts->add_addr_v6) {
++ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
++ mpadd->sub = MPTCP_SUB_ADD_ADDR;
++ mpadd->ipver = 6;
++ mpadd->addr_id = opts->add_addr6.addr_id;
++ memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
++ sizeof(mpadd->u.v6.addr));
++ ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
++ }
++ }
++ if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
++ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++ u8 *addrs_id;
++ int id, len, len_align;
++
++ len = mptcp_sub_len_remove_addr(opts->remove_addrs);
++ len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
++
++ mprem->kind = TCPOPT_MPTCP;
++ mprem->len = len;
++ mprem->sub = MPTCP_SUB_REMOVE_ADDR;
++ mprem->rsv = 0;
++ addrs_id = &mprem->addrs_id;
++
++ mptcp_for_each_bit_set(opts->remove_addrs, id)
++ *(addrs_id++) = id;
++
++ /* Fill the rest with NOP's */
++ if (len_align > len) {
++ int i;
++ for (i = 0; i < len_align - len; i++)
++ *(addrs_id++) = TCPOPT_NOP;
++ }
++
++ ptr += len_align >> 2;
++ }
++ if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
++ struct mp_fail *mpfail = (struct mp_fail *)ptr;
++
++ mpfail->kind = TCPOPT_MPTCP;
++ mpfail->len = MPTCP_SUB_LEN_FAIL;
++ mpfail->sub = MPTCP_SUB_FAIL;
++ mpfail->rsv1 = 0;
++ mpfail->rsv2 = 0;
++ mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
++
++ ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
++ }
++ if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
++ struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
++
++ mpfclose->kind = TCPOPT_MPTCP;
++ mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
++ mpfclose->sub = MPTCP_SUB_FCLOSE;
++ mpfclose->rsv1 = 0;
++ mpfclose->rsv2 = 0;
++ mpfclose->key = opts->mp_capable.receiver_key;
++
++ ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
++ }
++
++ if (OPTION_DATA_ACK & opts->mptcp_options) {
++ if (!mptcp_is_data_seq(skb))
++ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++ else
++ ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
++ }
++ if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
++ struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++ mpprio->kind = TCPOPT_MPTCP;
++ mpprio->len = MPTCP_SUB_LEN_PRIO;
++ mpprio->sub = MPTCP_SUB_PRIO;
++ mpprio->rsv = 0;
++ mpprio->b = tp->mptcp->low_prio;
++ mpprio->addr_id = TCPOPT_NOP;
++
++ ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
++ }
++}
++
++/* Sends the datafin */
++void mptcp_send_fin(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
++ int mss_now;
++
++ if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
++ meta_tp->mpcb->passive_close = 1;
++
++ /* Optimization, tack on the FIN if we have a queue of
++ * unsent frames. But be careful about outgoing SACKS
++ * and IP options.
++ */
++ mss_now = mptcp_current_mss(meta_sk);
++
++ if (tcp_send_head(meta_sk) != NULL) {
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++ TCP_SKB_CB(skb)->end_seq++;
++ meta_tp->write_seq++;
++ } else {
++ /* Socket is locked, keep trying until memory is available. */
++ for (;;) {
++ skb = alloc_skb_fclone(MAX_TCP_HEADER,
++ meta_sk->sk_allocation);
++ if (skb)
++ break;
++ yield();
++ }
++ /* Reserve space for headers and prepare control bits. */
++ skb_reserve(skb, MAX_TCP_HEADER);
++
++ tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
++ TCP_SKB_CB(skb)->end_seq++;
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++ tcp_queue_skb(meta_sk, skb);
++ }
++ __tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
++}
++
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
++
++ if (!mpcb->cnt_subflows)
++ return;
++
++ WARN_ON(meta_tp->send_mp_fclose);
++
++ /* First - select a socket */
++ sk = mptcp_select_ack_sock(meta_sk);
++
++ /* May happen if no subflow is in an appropriate state */
++ if (!sk)
++ return;
++
++ /* We are in infinite mode - just send a reset */
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
++ sk->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk->sk_state))
++ tcp_send_active_reset(sk, priority);
++ mptcp_sub_force_close(sk);
++ return;
++ }
++
++
++ tcp_sk(sk)->send_mp_fclose = 1;
++ /** Reset all other subflows */
++
++ /* tcp_done must be handled with bh disabled */
++ if (!in_serving_softirq())
++ local_bh_disable();
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++
++ sk_it->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk_it->sk_state))
++ tcp_send_active_reset(sk_it, GFP_ATOMIC);
++ mptcp_sub_force_close(sk_it);
++ }
++
++ if (!in_serving_softirq())
++ local_bh_enable();
++
++ tcp_send_ack(sk);
++ inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
++
++ meta_tp->send_mp_fclose = 1;
++}
++
++static void mptcp_ack_retransmit_timer(struct sock *sk)
++{
++ struct sk_buff *skb;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
++ goto out; /* Routing failure or similar */
++
++ if (!tp->retrans_stamp)
++ tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++ if (tcp_write_timeout(sk)) {
++ tp->mptcp->pre_established = 0;
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
++ goto out;
++ }
++
++ skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
++ if (skb == NULL) {
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ return;
++ }
++
++ /* Reserve space for headers and prepare control bits */
++ skb_reserve(skb, MAX_TCP_HEADER);
++ tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
++
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++ if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
++ /* Retransmission failed because of local congestion,
++ * do not backoff.
++ */
++ if (!icsk->icsk_retransmits)
++ icsk->icsk_retransmits = 1;
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ return;
++ }
++
++
++ icsk->icsk_retransmits++;
++ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
++ __sk_dst_reset(sk);
++
++out:;
++}
++
++void mptcp_ack_handler(unsigned long data)
++{
++ struct sock *sk = (struct sock *)data;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ /* Try again later */
++ sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
++ jiffies + (HZ / 20));
++ goto out_unlock;
++ }
++
++ if (sk->sk_state == TCP_CLOSE)
++ goto out_unlock;
++ if (!tcp_sk(sk)->mptcp->pre_established)
++ goto out_unlock;
++
++ mptcp_ack_retransmit_timer(sk);
++
++ sk_mem_reclaim(sk);
++
++out_unlock:
++ bh_unlock_sock(meta_sk);
++ sock_put(sk);
++}
++
++/* Similar to tcp_retransmit_skb
++ *
++ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
++ * meta-level.
++ */
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *subsk;
++ unsigned int limit, mss_now;
++ int err = -1;
++
++ /* Do not sent more than we queued. 1/4 is reserved for possible
++ * copying overhead: fragmentation, tunneling, mangling etc.
++ *
++ * This is a meta-retransmission thus we check on the meta-socket.
++ */
++ if (atomic_read(&meta_sk->sk_wmem_alloc) >
++ min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
++ return -EAGAIN;
++ }
++
++ /* We need to make sure that the retransmitted segment can be sent on a
++ * subflow right now. If it is too big, it needs to be fragmented.
++ */
++ subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
++ if (!subsk) {
++ /* We want to increase icsk_retransmits, thus return 0, so that
++ * mptcp_retransmit_timer enters the desired branch.
++ */
++ err = 0;
++ goto failed;
++ }
++ mss_now = tcp_current_mss(subsk);
++
++ /* If the segment was cloned (e.g. a meta retransmission), the header
++ * must be expanded/copied so that there is no corruption of TSO
++ * information.
++ */
++ if (skb_unclone(skb, GFP_ATOMIC)) {
++ err = -ENOMEM;
++ goto failed;
++ }
++
++ /* Must have been set by mptcp_write_xmit before */
++ BUG_ON(!tcp_skb_pcount(skb));
++
++ limit = mss_now;
++ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
++ * tcp_write_xmit. Otherwise split-point would return 0.
++ */
++ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++ UINT_MAX / mss_now,
++ TCP_NAGLE_OFF);
++
++ if (skb->len > limit &&
++ unlikely(mptcp_fragment(meta_sk, skb, limit,
++ GFP_ATOMIC, 0)))
++ goto failed;
++
++ if (!mptcp_skb_entail(subsk, skb, -1))
++ goto failed;
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ /* Update global TCP statistics. */
++ TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
++
++ /* Diff to tcp_retransmit_skb */
++
++ /* Save stamp of the first retransmit. */
++ if (!meta_tp->retrans_stamp)
++ meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
++
++ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++
++ return 0;
++
++failed:
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
++ return err;
++}
++
++/* Similar to tcp_retransmit_timer
++ *
++ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
++ * and that we don't have an srtt estimation at the meta-level.
++ */
++void mptcp_retransmit_timer(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ int err;
++
++ /* In fallback, retransmission is handled at the subflow-level */
++ if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
++ mpcb->send_infinite_mapping)
++ return;
++
++ WARN_ON(tcp_write_queue_empty(meta_sk));
++
++ if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
++ !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
++ /* Receiver dastardly shrinks window. Our retransmits
++ * become zero probes, but we should not timeout this
++ * connection. If the socket is an orphan, time it out,
++ * we cannot allow such beasts to hang infinitely.
++ */
++ struct inet_sock *meta_inet = inet_sk(meta_sk);
++ if (meta_sk->sk_family == AF_INET) {
++ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++ &meta_inet->inet_daddr,
++ ntohs(meta_inet->inet_dport),
++ meta_inet->inet_num, meta_tp->snd_una,
++ meta_tp->snd_nxt);
++ }
++#if IS_ENABLED(CONFIG_IPV6)
++ else if (meta_sk->sk_family == AF_INET6) {
++ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++ &meta_sk->sk_v6_daddr,
++ ntohs(meta_inet->inet_dport),
++ meta_inet->inet_num, meta_tp->snd_una,
++ meta_tp->snd_nxt);
++ }
++#endif
++ if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
++ tcp_write_err(meta_sk);
++ return;
++ }
++
++ mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++ goto out_reset_timer;
++ }
++
++ if (tcp_write_timeout(meta_sk))
++ return;
++
++ if (meta_icsk->icsk_retransmits == 0)
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
++
++ meta_icsk->icsk_ca_state = TCP_CA_Loss;
++
++ err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++ if (err > 0) {
++ /* Retransmission failed because of local congestion,
++ * do not backoff.
++ */
++ if (!meta_icsk->icsk_retransmits)
++ meta_icsk->icsk_retransmits = 1;
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++ min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
++ TCP_RTO_MAX);
++ return;
++ }
++
++ /* Increase the timeout each time we retransmit. Note that
++ * we do not increase the rtt estimate. rto is initialized
++ * from rtt, but increases here. Jacobson (SIGCOMM 88) suggests
++ * that doubling rto each time is the least we can get away with.
++ * In KA9Q, Karn uses this for the first few times, and then
++ * goes to quadratic. netBSD doubles, but only goes up to *64,
++ * and clamps at 1 to 64 sec afterwards. Note that 120 sec is
++ * defined in the protocol as the maximum possible RTT. I guess
++ * we'll have to use something other than TCP to talk to the
++ * University of Mars.
++ *
++ * PAWS allows us longer timeouts and large windows, so once
++ * implemented ftp to mars will work nicely. We will have to fix
++ * the 120 second clamps though!
++ */
++ meta_icsk->icsk_backoff++;
++ meta_icsk->icsk_retransmits++;
++
++out_reset_timer:
++ /* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
++ * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
++ * might be increased if the stream oscillates between thin and thick,
++ * thus the old value might already be too high compared to the value
++ * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
++ * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
++ * exponential backoff behaviour to avoid continue hammering
++ * linear-timeout retransmissions into a black hole
++ */
++ if (meta_sk->sk_state == TCP_ESTABLISHED &&
++ (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
++ tcp_stream_is_thin(meta_tp) &&
++ meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
++ meta_icsk->icsk_backoff = 0;
++ /* We cannot do the same as in tcp_write_timer because the
++ * srtt is not set here.
++ */
++ mptcp_set_rto(meta_sk);
++ } else {
++ /* Use normal (exponential) backoff */
++ meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
++ }
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
++
++ return;
++}
++
++/* Modify values to an mptcp-level for the initial window of new subflows */
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ *window_clamp = mpcb->orig_window_clamp;
++ __space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
++
++ tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
++ wscale_ok, rcv_wscale, init_rcv_wnd, sk);
++}
++
++static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
++ unsigned int (*mss_cb)(struct sock *sk))
++{
++ struct sock *sk;
++ u64 rate = 0;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ int this_mss;
++ u64 this_rate;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ /* Do not consider subflows without a RTT estimation yet
++ * otherwise this_rate >>> rate.
++ */
++ if (unlikely(!tp->srtt_us))
++ continue;
++
++ this_mss = mss_cb(sk);
++
++ /* If this_mss is smaller than mss, it means that a segment will
++ * be splitted in two (or more) when pushed on this subflow. If
++ * you consider that mss = 1428 and this_mss = 1420 then two
++ * segments will be generated: a 1420-byte and 8-byte segment.
++ * The latter will introduce a large overhead as for a single
++ * data segment 2 slots will be used in the congestion window.
++ * Therefore reducing by ~2 the potential throughput of this
++ * subflow. Indeed, 1428 will be send while 2840 could have been
++ * sent if mss == 1420 reducing the throughput by 2840 / 1428.
++ *
++ * The following algorithm take into account this overhead
++ * when computing the potential throughput that MPTCP can
++ * achieve when generating mss-byte segments.
++ *
++ * The formulae is the following:
++ * \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
++ * Where ratio is computed as follows:
++ * \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
++ *
++ * ratio gives the reduction factor of the theoretical
++ * throughput a subflow can achieve if MPTCP uses a specific
++ * MSS value.
++ */
++ this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
++ max(tp->snd_cwnd, tp->packets_out),
++ (u64)tp->srtt_us *
++ DIV_ROUND_UP(mss, this_mss) * this_mss);
++ rate += this_rate;
++ }
++
++ return rate;
++}
++
++static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
++ unsigned int (*mss_cb)(struct sock *sk))
++{
++ unsigned int mss = 0;
++ u64 rate = 0;
++ struct sock *sk;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ int this_mss;
++ u64 this_rate;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ this_mss = mss_cb(sk);
++
++ /* Same mss values will produce the same throughput. */
++ if (this_mss == mss)
++ continue;
++
++ /* See whether using this mss value can theoretically improve
++ * the performances.
++ */
++ this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
++ if (this_rate >= rate) {
++ mss = this_mss;
++ rate = this_rate;
++ }
++ }
++
++ return mss;
++}
++
++unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++ unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
++
++ /* If no subflow is available, we take a default-mss from the
++ * meta-socket.
++ */
++ return !mss ? tcp_current_mss(meta_sk) : mss;
++}
++
++static unsigned int mptcp_select_size_mss(struct sock *sk)
++{
++ return tcp_sk(sk)->mss_cache;
++}
++
++int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++ unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
++
++ if (sg) {
++ if (mptcp_sk_can_gso(meta_sk)) {
++ mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
++ } else {
++ int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
++
++ if (mss >= pgbreak &&
++ mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
++ mss = pgbreak;
++ }
++ }
++
++ return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
++}
++
++int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++ const struct sock *sk;
++ u32 rtt_max = tp->srtt_us;
++ u64 bw_est;
++
++ if (!tp->srtt_us)
++ return tp->reordering + 1;
++
++ mptcp_for_each_sk(tp->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ if (rtt_max < tcp_sk(sk)->srtt_us)
++ rtt_max = tcp_sk(sk)->srtt_us;
++ }
++
++ bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
++ (u64)tp->srtt_us);
++
++ return max_t(unsigned int, (u32)(bw_est >> 16),
++ tp->reordering + 1);
++}
++
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++ int large_allowed)
++{
++ struct sock *sk;
++ u32 xmit_size_goal = 0;
++
++ if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ int this_size_goal;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
++ if (this_size_goal > xmit_size_goal)
++ xmit_size_goal = this_size_goal;
++ }
++ }
++
++ return max(xmit_size_goal, mss_now);
++}
++
++/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++ if (skb_cloned(skb)) {
++ if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
++ return -ENOMEM;
++ }
++
++ __pskb_trim_head(skb, len);
++
++ TCP_SKB_CB(skb)->seq += len;
++ skb->ip_summed = CHECKSUM_PARTIAL;
++
++ skb->truesize -= len;
++ sk->sk_wmem_queued -= len;
++ sk_mem_uncharge(sk, len);
++ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++
++ /* Any change of skb->len requires recalculation of tso factor. */
++ if (tcp_skb_pcount(skb) > 1)
++ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
++
++ return 0;
++}
+diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
+new file mode 100644
+index 000000000000..9542f950729f
+--- /dev/null
++++ b/net/mptcp/mptcp_pm.c
+@@ -0,0 +1,169 @@
++/*
++ * MPTCP implementation - MPTCP-subflow-management
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_pm_list_lock);
++static LIST_HEAD(mptcp_pm_list);
++
++static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++struct mptcp_pm_ops mptcp_pm_default = {
++ .get_local_id = mptcp_default_id, /* We do not care */
++ .name = "default",
++ .owner = THIS_MODULE,
++};
++
++static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
++{
++ struct mptcp_pm_ops *e;
++
++ list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
++ if (strcmp(e->name, name) == 0)
++ return e;
++ }
++
++ return NULL;
++}
++
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
++{
++ int ret = 0;
++
++ if (!pm->get_local_id)
++ return -EINVAL;
++
++ spin_lock(&mptcp_pm_list_lock);
++ if (mptcp_pm_find(pm->name)) {
++ pr_notice("%s already registered\n", pm->name);
++ ret = -EEXIST;
++ } else {
++ list_add_tail_rcu(&pm->list, &mptcp_pm_list);
++ pr_info("%s registered\n", pm->name);
++ }
++ spin_unlock(&mptcp_pm_list_lock);
++
++ return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
++
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
++{
++ spin_lock(&mptcp_pm_list_lock);
++ list_del_rcu(&pm->list);
++ spin_unlock(&mptcp_pm_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
++
++void mptcp_get_default_path_manager(char *name)
++{
++ struct mptcp_pm_ops *pm;
++
++ BUG_ON(list_empty(&mptcp_pm_list));
++
++ rcu_read_lock();
++ pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
++ strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
++ rcu_read_unlock();
++}
++
++int mptcp_set_default_path_manager(const char *name)
++{
++ struct mptcp_pm_ops *pm;
++ int ret = -ENOENT;
++
++ spin_lock(&mptcp_pm_list_lock);
++ pm = mptcp_pm_find(name);
++#ifdef CONFIG_MODULES
++ if (!pm && capable(CAP_NET_ADMIN)) {
++ spin_unlock(&mptcp_pm_list_lock);
++
++ request_module("mptcp_%s", name);
++ spin_lock(&mptcp_pm_list_lock);
++ pm = mptcp_pm_find(name);
++ }
++#endif
++
++ if (pm) {
++ list_move(&pm->list, &mptcp_pm_list);
++ ret = 0;
++ } else {
++ pr_info("%s is not available\n", name);
++ }
++ spin_unlock(&mptcp_pm_list_lock);
++
++ return ret;
++}
++
++void mptcp_init_path_manager(struct mptcp_cb *mpcb)
++{
++ struct mptcp_pm_ops *pm;
++
++ rcu_read_lock();
++ list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
++ if (try_module_get(pm->owner)) {
++ mpcb->pm_ops = pm;
++ break;
++ }
++ }
++ rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
++{
++ module_put(mpcb->pm_ops->owner);
++}
++
++/* Fallback to the default path-manager. */
++void mptcp_fallback_default(struct mptcp_cb *mpcb)
++{
++ struct mptcp_pm_ops *pm;
++
++ mptcp_cleanup_path_manager(mpcb);
++ pm = mptcp_pm_find("default");
++
++ /* Cannot fail - it's the default module */
++ try_module_get(pm->owner);
++ mpcb->pm_ops = pm;
++}
++EXPORT_SYMBOL_GPL(mptcp_fallback_default);
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_path_manager_default(void)
++{
++ return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
++}
++late_initcall(mptcp_path_manager_default);
+diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
+new file mode 100644
+index 000000000000..93278f684069
+--- /dev/null
++++ b/net/mptcp/mptcp_rr.c
+@@ -0,0 +1,301 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static unsigned char num_segments __read_mostly = 1;
++module_param(num_segments, byte, 0644);
++MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
++
++static bool cwnd_limited __read_mostly = 1;
++module_param(cwnd_limited, bool, 0644);
++MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
++
++struct rrsched_priv {
++ unsigned char quota;
++};
++
++static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
++{
++ return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
++ bool zero_wnd_test, bool cwnd_test)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ unsigned int space, in_flight;
++
++ /* Set of states for which we are allowed to send data */
++ if (!mptcp_sk_can_send(sk))
++ return false;
++
++ /* We do not send data on this subflow unless it is
++ * fully established, i.e. the 4th ack has been received.
++ */
++ if (tp->mptcp->pre_established)
++ return false;
++
++ if (tp->pf)
++ return false;
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++ /* If SACK is disabled, and we got a loss, TCP does not exit
++ * the loss-state until something above high_seq has been acked.
++ * (see tcp_try_undo_recovery)
++ *
++ * high_seq is the snd_nxt at the moment of the RTO. As soon
++ * as we have an RTO, we won't push data on the subflow.
++ * Thus, snd_una can never go beyond high_seq.
++ */
++ if (!tcp_is_reno(tp))
++ return false;
++ else if (tp->snd_una != tp->high_seq)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ /* Make sure that we send in-order data */
++ if (skb && tp->mptcp->second_packet &&
++ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++ return false;
++ }
++
++ if (!cwnd_test)
++ goto zero_wnd_test;
++
++ in_flight = tcp_packets_in_flight(tp);
++ /* Not even a single spot in the cwnd */
++ if (in_flight >= tp->snd_cwnd)
++ return false;
++
++ /* Now, check if what is queued in the subflow's send-queue
++ * already fills the cwnd.
++ */
++ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++ if (tp->write_seq - tp->snd_nxt > space)
++ return false;
++
++zero_wnd_test:
++ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++ return false;
++
++ return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++ /* If the skb has already been enqueued in this sk, try to find
++ * another one.
++ */
++ return skb &&
++ /* Has the skb already been enqueued into this subsocket? */
++ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* We just look for any subflow that is available */
++static struct sock *rr_get_available_subflow(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk, *bestsk = NULL, *backupsk = NULL;
++
++ /* Answer data_fin on same subflow!!! */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++ skb && mptcp_is_data_fin(skb)) {
++ mptcp_for_each_sk(mpcb, sk) {
++ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++ mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++ return sk;
++ }
++ }
++
++ /* First, find the best subflow */
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++ continue;
++
++ if (mptcp_rr_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ bestsk = sk;
++ }
++
++ if (bestsk) {
++ sk = bestsk;
++ } else if (backupsk) {
++ /* It has been sent on all subflows once - let's give it a
++ * chance again by restarting its pathmask.
++ */
++ if (skb)
++ TCP_SKB_CB(skb)->path_mask = 0;
++ sk = backupsk;
++ }
++
++ return sk;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sk_buff *skb = NULL;
++
++ *reinject = 0;
++
++ /* If we are in fallback-mode, just take from the meta-send-queue */
++ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++ return tcp_send_head(meta_sk);
++
++ skb = skb_peek(&mpcb->reinject_queue);
++
++ if (skb)
++ *reinject = 1;
++ else
++ skb = tcp_send_head(meta_sk);
++ return skb;
++}
++
++static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk_it, *choose_sk = NULL;
++ struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
++ unsigned char split = num_segments;
++ unsigned char iter = 0, full_subs = 0;
++
++ /* As we set it, we have to reset it as well. */
++ *limit = 0;
++
++ if (!skb)
++ return NULL;
++
++ if (*reinject) {
++ *subsk = rr_get_available_subflow(meta_sk, skb, false);
++ if (!*subsk)
++ return NULL;
++
++ return skb;
++ }
++
++retry:
++
++ /* First, we look for a subflow who is currently being used */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++ continue;
++
++ iter++;
++
++ /* Is this subflow currently being used? */
++ if (rsp->quota > 0 && rsp->quota < num_segments) {
++ split = num_segments - rsp->quota;
++ choose_sk = sk_it;
++ goto found;
++ }
++
++ /* Or, it's totally unused */
++ if (!rsp->quota) {
++ split = num_segments;
++ choose_sk = sk_it;
++ }
++
++ /* Or, it must then be fully used */
++ if (rsp->quota == num_segments)
++ full_subs++;
++ }
++
++ /* All considered subflows have a full quota, and we considered at
++ * least one.
++ */
++ if (iter && iter == full_subs) {
++ /* So, we restart this round by setting quota to 0 and retry
++ * to find a subflow.
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++ continue;
++
++ rsp->quota = 0;
++ }
++
++ goto retry;
++ }
++
++found:
++ if (choose_sk) {
++ unsigned int mss_now;
++ struct tcp_sock *choose_tp = tcp_sk(choose_sk);
++ struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
++
++ if (!mptcp_rr_is_available(choose_sk, skb, false, true))
++ return NULL;
++
++ *subsk = choose_sk;
++ mss_now = tcp_current_mss(*subsk);
++ *limit = split * mss_now;
++
++ if (skb->len > mss_now)
++ rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
++ else
++ rsp->quota++;
++
++ return skb;
++ }
++
++ return NULL;
++}
++
++static struct mptcp_sched_ops mptcp_sched_rr = {
++ .get_subflow = rr_get_available_subflow,
++ .next_segment = mptcp_rr_next_segment,
++ .name = "roundrobin",
++ .owner = THIS_MODULE,
++};
++
++static int __init rr_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
++
++ if (mptcp_register_scheduler(&mptcp_sched_rr))
++ return -1;
++
++ return 0;
++}
++
++static void rr_unregister(void)
++{
++ mptcp_unregister_scheduler(&mptcp_sched_rr);
++}
++
++module_init(rr_register);
++module_exit(rr_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
++MODULE_VERSION("0.89");
+diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
+new file mode 100644
+index 000000000000..6c7ff4eceac1
+--- /dev/null
++++ b/net/mptcp/mptcp_sched.c
+@@ -0,0 +1,493 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_sched_list_lock);
++static LIST_HEAD(mptcp_sched_list);
++
++struct defsched_priv {
++ u32 last_rbuf_opti;
++};
++
++static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
++{
++ return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ unsigned int mss_now, space, in_flight;
++
++ /* Set of states for which we are allowed to send data */
++ if (!mptcp_sk_can_send(sk))
++ return false;
++
++ /* We do not send data on this subflow unless it is
++ * fully established, i.e. the 4th ack has been received.
++ */
++ if (tp->mptcp->pre_established)
++ return false;
++
++ if (tp->pf)
++ return false;
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++ /* If SACK is disabled, and we got a loss, TCP does not exit
++ * the loss-state until something above high_seq has been acked.
++ * (see tcp_try_undo_recovery)
++ *
++ * high_seq is the snd_nxt at the moment of the RTO. As soon
++ * as we have an RTO, we won't push data on the subflow.
++ * Thus, snd_una can never go beyond high_seq.
++ */
++ if (!tcp_is_reno(tp))
++ return false;
++ else if (tp->snd_una != tp->high_seq)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ /* Make sure that we send in-order data */
++ if (skb && tp->mptcp->second_packet &&
++ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++ return false;
++ }
++
++ /* If TSQ is already throttling us, do not send on this subflow. When
++ * TSQ gets cleared the subflow becomes eligible again.
++ */
++ if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
++ return false;
++
++ in_flight = tcp_packets_in_flight(tp);
++ /* Not even a single spot in the cwnd */
++ if (in_flight >= tp->snd_cwnd)
++ return false;
++
++ /* Now, check if what is queued in the subflow's send-queue
++ * already fills the cwnd.
++ */
++ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++ if (tp->write_seq - tp->snd_nxt > space)
++ return false;
++
++ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++ return false;
++
++ mss_now = tcp_current_mss(sk);
++
++ /* Don't send on this subflow if we bypass the allowed send-window at
++ * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
++ * calculated end_seq (because here at this point end_seq is still at
++ * the meta-level).
++ */
++ if (skb && !zero_wnd_test &&
++ after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
++ return false;
++
++ return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++ /* If the skb has already been enqueued in this sk, try to find
++ * another one.
++ */
++ return skb &&
++ /* Has the skb already been enqueued into this subsocket? */
++ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* This is the scheduler. This function decides on which flow to send
++ * a given MSS. If all subflows are found to be busy, NULL is returned
++ * The flow is selected based on the shortest RTT.
++ * If all paths have full cong windows, we simply return NULL.
++ *
++ * Additionally, this function is aware of the backup-subflows.
++ */
++static struct sock *get_available_subflow(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
++ u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
++ int cnt_backups = 0;
++
++ /* if there is only one subflow, bypass the scheduling function */
++ if (mpcb->cnt_subflows == 1) {
++ bestsk = (struct sock *)mpcb->connection_list;
++ if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
++ bestsk = NULL;
++ return bestsk;
++ }
++
++ /* Answer data_fin on same subflow!!! */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++ skb && mptcp_is_data_fin(skb)) {
++ mptcp_for_each_sk(mpcb, sk) {
++ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++ mptcp_is_available(sk, skb, zero_wnd_test))
++ return sk;
++ }
++ }
++
++ /* First, find the best subflow */
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
++ cnt_backups++;
++
++ if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++ tp->srtt_us < lowprio_min_time_to_peer) {
++ if (!mptcp_is_available(sk, skb, zero_wnd_test))
++ continue;
++
++ if (mptcp_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ lowprio_min_time_to_peer = tp->srtt_us;
++ lowpriosk = sk;
++ } else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++ tp->srtt_us < min_time_to_peer) {
++ if (!mptcp_is_available(sk, skb, zero_wnd_test))
++ continue;
++
++ if (mptcp_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ min_time_to_peer = tp->srtt_us;
++ bestsk = sk;
++ }
++ }
++
++ if (mpcb->cnt_established == cnt_backups && lowpriosk) {
++ sk = lowpriosk;
++ } else if (bestsk) {
++ sk = bestsk;
++ } else if (backupsk) {
++ /* It has been sent on all subflows once - let's give it a
++ * chance again by restarting its pathmask.
++ */
++ if (skb)
++ TCP_SKB_CB(skb)->path_mask = 0;
++ sk = backupsk;
++ }
++
++ return sk;
++}
++
++static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
++{
++ struct sock *meta_sk;
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp_it;
++ struct sk_buff *skb_head;
++ struct defsched_priv *dsp = defsched_get_priv(tp);
++
++ if (tp->mpcb->cnt_subflows == 1)
++ return NULL;
++
++ meta_sk = mptcp_meta_sk(sk);
++ skb_head = tcp_write_queue_head(meta_sk);
++
++ if (!skb_head || skb_head == tcp_send_head(meta_sk))
++ return NULL;
++
++ /* If penalization is optional (coming from mptcp_next_segment() and
++ * We are not send-buffer-limited we do not penalize. The retransmission
++ * is just an optimization to fix the idle-time due to the delay before
++ * we wake up the application.
++ */
++ if (!penal && sk_stream_memory_free(meta_sk))
++ goto retrans;
++
++ /* Only penalize again after an RTT has elapsed */
++ if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
++ goto retrans;
++
++ /* Half the cwnd of the slow flow */
++ mptcp_for_each_tp(tp->mpcb, tp_it) {
++ if (tp_it != tp &&
++ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++ if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
++ tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
++ if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
++ tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
++
++ dsp->last_rbuf_opti = tcp_time_stamp;
++ }
++ break;
++ }
++ }
++
++retrans:
++
++ /* Segment not yet injected into this path? Take it!!! */
++ if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
++ bool do_retrans = false;
++ mptcp_for_each_tp(tp->mpcb, tp_it) {
++ if (tp_it != tp &&
++ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++ if (tp_it->snd_cwnd <= 4) {
++ do_retrans = true;
++ break;
++ }
++
++ if (4 * tp->srtt_us >= tp_it->srtt_us) {
++ do_retrans = false;
++ break;
++ } else {
++ do_retrans = true;
++ }
++ }
++ }
++
++ if (do_retrans && mptcp_is_available(sk, skb_head, false))
++ return skb_head;
++ }
++ return NULL;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sk_buff *skb = NULL;
++
++ *reinject = 0;
++
++ /* If we are in fallback-mode, just take from the meta-send-queue */
++ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++ return tcp_send_head(meta_sk);
++
++ skb = skb_peek(&mpcb->reinject_queue);
++
++ if (skb) {
++ *reinject = 1;
++ } else {
++ skb = tcp_send_head(meta_sk);
++
++ if (!skb && meta_sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
++ sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
++ struct sock *subsk = get_available_subflow(meta_sk, NULL,
++ false);
++ if (!subsk)
++ return NULL;
++
++ skb = mptcp_rcv_buf_optimization(subsk, 0);
++ if (skb)
++ *reinject = -1;
++ }
++ }
++ return skb;
++}
++
++static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit)
++{
++ struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
++ unsigned int mss_now;
++ struct tcp_sock *subtp;
++ u16 gso_max_segs;
++ u32 max_len, max_segs, window, needed;
++
++ /* As we set it, we have to reset it as well. */
++ *limit = 0;
++
++ if (!skb)
++ return NULL;
++
++ *subsk = get_available_subflow(meta_sk, skb, false);
++ if (!*subsk)
++ return NULL;
++
++ subtp = tcp_sk(*subsk);
++ mss_now = tcp_current_mss(*subsk);
++
++ if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
++ skb = mptcp_rcv_buf_optimization(*subsk, 1);
++ if (skb)
++ *reinject = -1;
++ else
++ return NULL;
++ }
++
++ /* No splitting required, as we will only send one single segment */
++ if (skb->len <= mss_now)
++ return skb;
++
++ /* The following is similar to tcp_mss_split_point, but
++ * we do not care about nagle, because we will anyways
++ * use TCP_NAGLE_PUSH, which overrides this.
++ *
++ * So, we first limit according to the cwnd/gso-size and then according
++ * to the subflow's window.
++ */
++
++ gso_max_segs = (*subsk)->sk_gso_max_segs;
++ if (!gso_max_segs) /* No gso supported on the subflow's NIC */
++ gso_max_segs = 1;
++ max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
++ if (!max_segs)
++ return NULL;
++
++ max_len = mss_now * max_segs;
++ window = tcp_wnd_end(subtp) - subtp->write_seq;
++
++ needed = min(skb->len, window);
++ if (max_len <= skb->len)
++ /* Take max_win, which is actually the cwnd/gso-size */
++ *limit = max_len;
++ else
++ /* Or, take the window */
++ *limit = needed;
++
++ return skb;
++}
++
++static void defsched_init(struct sock *sk)
++{
++ struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
++
++ dsp->last_rbuf_opti = tcp_time_stamp;
++}
++
++struct mptcp_sched_ops mptcp_sched_default = {
++ .get_subflow = get_available_subflow,
++ .next_segment = mptcp_next_segment,
++ .init = defsched_init,
++ .name = "default",
++ .owner = THIS_MODULE,
++};
++
++static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
++{
++ struct mptcp_sched_ops *e;
++
++ list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
++ if (strcmp(e->name, name) == 0)
++ return e;
++ }
++
++ return NULL;
++}
++
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
++{
++ int ret = 0;
++
++ if (!sched->get_subflow || !sched->next_segment)
++ return -EINVAL;
++
++ spin_lock(&mptcp_sched_list_lock);
++ if (mptcp_sched_find(sched->name)) {
++ pr_notice("%s already registered\n", sched->name);
++ ret = -EEXIST;
++ } else {
++ list_add_tail_rcu(&sched->list, &mptcp_sched_list);
++ pr_info("%s registered\n", sched->name);
++ }
++ spin_unlock(&mptcp_sched_list_lock);
++
++ return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
++
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
++{
++ spin_lock(&mptcp_sched_list_lock);
++ list_del_rcu(&sched->list);
++ spin_unlock(&mptcp_sched_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
++
++void mptcp_get_default_scheduler(char *name)
++{
++ struct mptcp_sched_ops *sched;
++
++ BUG_ON(list_empty(&mptcp_sched_list));
++
++ rcu_read_lock();
++ sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
++ strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
++ rcu_read_unlock();
++}
++
++int mptcp_set_default_scheduler(const char *name)
++{
++ struct mptcp_sched_ops *sched;
++ int ret = -ENOENT;
++
++ spin_lock(&mptcp_sched_list_lock);
++ sched = mptcp_sched_find(name);
++#ifdef CONFIG_MODULES
++ if (!sched && capable(CAP_NET_ADMIN)) {
++ spin_unlock(&mptcp_sched_list_lock);
++
++ request_module("mptcp_%s", name);
++ spin_lock(&mptcp_sched_list_lock);
++ sched = mptcp_sched_find(name);
++ }
++#endif
++
++ if (sched) {
++ list_move(&sched->list, &mptcp_sched_list);
++ ret = 0;
++ } else {
++ pr_info("%s is not available\n", name);
++ }
++ spin_unlock(&mptcp_sched_list_lock);
++
++ return ret;
++}
++
++void mptcp_init_scheduler(struct mptcp_cb *mpcb)
++{
++ struct mptcp_sched_ops *sched;
++
++ rcu_read_lock();
++ list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
++ if (try_module_get(sched->owner)) {
++ mpcb->sched_ops = sched;
++ break;
++ }
++ }
++ rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
++{
++ module_put(mpcb->sched_ops->owner);
++}
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_scheduler_default(void)
++{
++ BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
++
++ return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
++}
++late_initcall(mptcp_scheduler_default);
+diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
+new file mode 100644
+index 000000000000..29ca1d868d17
+--- /dev/null
++++ b/net/mptcp/mptcp_wvegas.c
+@@ -0,0 +1,268 @@
++/*
++ * MPTCP implementation - WEIGHTED VEGAS
++ *
++ * Algorithm design:
++ * Yu Cao <cyAnalyst@126.com>
++ * Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
++ * Xiaoming Fu <fu@cs.uni-goettinggen.de>
++ *
++ * Implementation:
++ * Yu Cao <cyAnalyst@126.com>
++ * Enhuan Dong <deh13@mails.tsinghua.edu.cn>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++#include <linux/module.h>
++#include <linux/tcp.h>
++
++static int initial_alpha = 2;
++static int total_alpha = 10;
++static int gamma = 1;
++
++module_param(initial_alpha, int, 0644);
++MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
++module_param(total_alpha, int, 0644);
++MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
++module_param(gamma, int, 0644);
++MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
++
++#define MPTCP_WVEGAS_SCALE 16
++
++/* wVegas variables */
++struct wvegas {
++ u32 beg_snd_nxt; /* right edge during last RTT */
++ u8 doing_wvegas_now;/* if true, do wvegas for this RTT */
++
++ u16 cnt_rtt; /* # of RTTs measured within last RTT */
++ u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
++ u32 base_rtt; /* the min of all wVegas RTT measurements seen (in usec) */
++
++ u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
++ u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
++ int alpha; /* alpha for each subflows */
++
++ u32 queue_delay; /* queue delay*/
++};
++
++
++static inline u64 mptcp_wvegas_scale(u32 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++static void wvegas_enable(const struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->doing_wvegas_now = 1;
++
++ wvegas->beg_snd_nxt = tp->snd_nxt;
++
++ wvegas->cnt_rtt = 0;
++ wvegas->sampled_rtt = 0;
++
++ wvegas->instant_rate = 0;
++ wvegas->alpha = initial_alpha;
++ wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
++
++ wvegas->queue_delay = 0;
++}
++
++static inline void wvegas_disable(const struct sock *sk)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->doing_wvegas_now = 0;
++}
++
++static void mptcp_wvegas_init(struct sock *sk)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->base_rtt = 0x7fffffff;
++ wvegas_enable(sk);
++}
++
++static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
++{
++ return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
++}
++
++static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++ u32 vrtt;
++
++ if (rtt_us < 0)
++ return;
++
++ vrtt = rtt_us + 1;
++
++ if (vrtt < wvegas->base_rtt)
++ wvegas->base_rtt = vrtt;
++
++ wvegas->sampled_rtt += vrtt;
++ wvegas->cnt_rtt++;
++}
++
++static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
++{
++ if (ca_state == TCP_CA_Open)
++ wvegas_enable(sk);
++ else
++ wvegas_disable(sk);
++}
++
++static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++ if (event == CA_EVENT_CWND_RESTART) {
++ mptcp_wvegas_init(sk);
++ } else if (event == CA_EVENT_LOSS) {
++ struct wvegas *wvegas = inet_csk_ca(sk);
++ wvegas->instant_rate = 0;
++ }
++}
++
++static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
++{
++ return min(tp->snd_ssthresh, tp->snd_cwnd - 1);
++}
++
++static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
++{
++ u64 total_rate = 0;
++ struct sock *sub_sk;
++ const struct wvegas *wvegas = inet_csk_ca(sk);
++
++ if (!mpcb)
++ return wvegas->weight;
++
++
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
++
++ /* sampled_rtt is initialized by 0 */
++ if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
++ total_rate += sub_wvegas->instant_rate;
++ }
++
++ if (total_rate && wvegas->instant_rate)
++ return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
++ else
++ return wvegas->weight;
++}
++
++static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ if (!wvegas->doing_wvegas_now) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ if (after(ack, wvegas->beg_snd_nxt)) {
++ wvegas->beg_snd_nxt = tp->snd_nxt;
++
++ if (wvegas->cnt_rtt <= 2) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ } else {
++ u32 rtt, diff, q_delay;
++ u64 target_cwnd;
++
++ rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
++ target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
++
++ diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
++
++ if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
++ tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
++ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++
++ } else if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ tcp_slow_start(tp, acked);
++ } else {
++ if (diff >= wvegas->alpha) {
++ wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
++ wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
++ wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
++ }
++ if (diff > wvegas->alpha) {
++ tp->snd_cwnd--;
++ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++ } else if (diff < wvegas->alpha) {
++ tp->snd_cwnd++;
++ }
++
++ /* Try to drain link queue if needed*/
++ q_delay = rtt - wvegas->base_rtt;
++ if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
++ wvegas->queue_delay = q_delay;
++
++ if (q_delay >= 2 * wvegas->queue_delay) {
++ u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
++ tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
++ wvegas->queue_delay = 0;
++ }
++ }
++
++ if (tp->snd_cwnd < 2)
++ tp->snd_cwnd = 2;
++ else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
++ tp->snd_cwnd = tp->snd_cwnd_clamp;
++
++ tp->snd_ssthresh = tcp_current_ssthresh(sk);
++ }
++
++ wvegas->cnt_rtt = 0;
++ wvegas->sampled_rtt = 0;
++ }
++ /* Use normal slow start */
++ else if (tp->snd_cwnd <= tp->snd_ssthresh)
++ tcp_slow_start(tp, acked);
++}
++
++
++static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
++ .init = mptcp_wvegas_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_wvegas_cong_avoid,
++ .pkts_acked = mptcp_wvegas_pkts_acked,
++ .set_state = mptcp_wvegas_state,
++ .cwnd_event = mptcp_wvegas_cwnd_event,
++
++ .owner = THIS_MODULE,
++ .name = "wvegas",
++};
++
++static int __init mptcp_wvegas_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
++ tcp_register_congestion_control(&mptcp_wvegas);
++ return 0;
++}
++
++static void __exit mptcp_wvegas_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_wvegas);
++}
++
++module_init(mptcp_wvegas_register);
++module_exit(mptcp_wvegas_unregister);
++
++MODULE_AUTHOR("Yu Cao, Enhuan Dong");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP wVegas");
++MODULE_VERSION("0.1");
diff --git a/4567_distro-Gentoo-Kconfig.patch b/4567_distro-Gentoo-Kconfig.patch
index 71dbf09..652e2a7 100644
--- a/4567_distro-Gentoo-Kconfig.patch
+++ b/4567_distro-Gentoo-Kconfig.patch
@@ -1,15 +1,15 @@
---- a/Kconfig 2014-04-02 09:45:05.389224541 -0400
-+++ b/Kconfig 2014-04-02 09:45:39.269224273 -0400
+--- a/Kconfig 2014-04-02 09:45:05.389224541 -0400
++++ b/Kconfig 2014-04-02 09:45:39.269224273 -0400
@@ -8,4 +8,6 @@ config SRCARCH
- string
- option env="SRCARCH"
-
+ string
+ option env="SRCARCH"
+
+source "distro/Kconfig"
+
source "arch/$SRCARCH/Kconfig"
---- /dev/null 2014-09-22 14:19:24.316977284 -0400
-+++ distro/Kconfig 2014-09-22 19:30:35.670959281 -0400
-@@ -0,0 +1,109 @@
+--- 1969-12-31 19:00:00.000000000 -0500
++++ b/distro/Kconfig 2014-04-02 09:57:03.539218861 -0400
+@@ -0,0 +1,108 @@
+menu "Gentoo Linux"
+
+config GENTOO_LINUX
@@ -34,8 +34,6 @@
+ select DEVTMPFS
+ select TMPFS
+
-+ select FHANDLE
-+
+ select MMU
+ select SHMEM
+
@@ -91,6 +89,7 @@
+ select CGROUPS
+ select EPOLL
+ select FANOTIFY
++ select FHANDLE
+ select INOTIFY_USER
+ select NET
+ select NET_NS
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-27 13:37 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-09-27 13:37 UTC (permalink / raw
To: gentoo-commits
commit: 1b28da13cd7150f66fae58043d3de661105a513a
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Sep 27 13:37:37 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Sep 27 13:37:37 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=1b28da13
Move mpctp patch to experimental
---
0000_README | 9 +-
5010_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 ++++++++++++++++++++++++++
2 files changed, 19235 insertions(+), 4 deletions(-)
diff --git a/0000_README b/0000_README
index d92e6b7..3cc9441 100644
--- a/0000_README
+++ b/0000_README
@@ -58,10 +58,6 @@ Patch: 2400_kcopy-patch-for-infiniband-driver.patch
From: Alexey Shvetsov <alexxy@gentoo.org>
Desc: Zero copy for infiniband psm userspace driver
-Patch: 2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
-From: http://multipath-tcp.org/
-Desc: Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
-
Patch: 2700_ThinkPad-30-brightness-control-fix.patch
From: Seth Forshee <seth.forshee@canonical.com>
Desc: ACPI: Disable Windows 8 compatibility for some Lenovo ThinkPads
@@ -101,3 +97,8 @@ Desc: BFQ v7r5 patch 2 for 3.16: BFQ Scheduler
Patch: 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
From: http://algo.ing.unimo.it/people/paolo/disk_sched/
Desc: BFQ v7r5 patch 3 for 3.16: Early Queue Merge (EQM)
+
+Patch: 5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
+From: http://multipath-tcp.org/
+Desc: Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
+
diff --git a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
new file mode 100644
index 0000000..3000da3
--- /dev/null
+++ b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
@@ -0,0 +1,19230 @@
+diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
+index 768a0fb67dd6..5a46d91a8df9 100644
+--- a/drivers/infiniband/hw/cxgb4/cm.c
++++ b/drivers/infiniband/hw/cxgb4/cm.c
+@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
+ */
+ memset(&tmp_opt, 0, sizeof(tmp_opt));
+ tcp_clear_options(&tmp_opt);
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
+
+ req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
+ memset(req, 0, sizeof(*req));
+diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
+index 2faef339d8f2..d86c853ffaad 100644
+--- a/include/linux/ipv6.h
++++ b/include/linux/ipv6.h
+@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ return inet_sk(__sk)->pinet6;
+ }
+
+-static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
+-{
+- struct request_sock *req = reqsk_alloc(ops);
+-
+- if (req)
+- inet_rsk(req)->pktopts = NULL;
+-
+- return req;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ return (struct raw6_sock *)sk;
+@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ return NULL;
+ }
+
+-static inline struct inet6_request_sock *
+- inet6_rsk(const struct request_sock *rsk)
+-{
+- return NULL;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ return NULL;
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index ec89301ada41..99ea4b0e3693 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
+ bool zero_okay,
+ __sum16 check)
+ {
+- if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
+- skb->csum_valid = 1;
++ if (skb_csum_unnecessary(skb)) {
++ return false;
++ } else if (zero_okay && !check) {
++ skb->ip_summed = CHECKSUM_UNNECESSARY;
+ return false;
+ }
+
+diff --git a/include/linux/tcp.h b/include/linux/tcp.h
+index a0513210798f..7bc2e078d6ca 100644
+--- a/include/linux/tcp.h
++++ b/include/linux/tcp.h
+@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
+ /* TCP Fast Open */
+ #define TCP_FASTOPEN_COOKIE_MIN 4 /* Min Fast Open Cookie size in bytes */
+ #define TCP_FASTOPEN_COOKIE_MAX 16 /* Max Fast Open Cookie size in bytes */
+-#define TCP_FASTOPEN_COOKIE_SIZE 8 /* the size employed by this impl. */
++#define TCP_FASTOPEN_COOKIE_SIZE 4 /* the size employed by this impl. */
+
+ /* TCP Fast Open Cookie as stored in memory */
+ struct tcp_fastopen_cookie {
+@@ -72,6 +72,51 @@ struct tcp_sack_block {
+ u32 end_seq;
+ };
+
++struct tcp_out_options {
++ u16 options; /* bit field of OPTION_* */
++ u8 ws; /* window scale, 0 to disable */
++ u8 num_sack_blocks;/* number of SACK blocks to include */
++ u8 hash_size; /* bytes in hash_location */
++ u16 mss; /* 0 to disable */
++ __u8 *hash_location; /* temporary pointer, overloaded */
++ __u32 tsval, tsecr; /* need to include OPTION_TS */
++ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
++#ifdef CONFIG_MPTCP
++ u16 mptcp_options; /* bit field of MPTCP related OPTION_* */
++ u8 dss_csum:1,
++ add_addr_v4:1,
++ add_addr_v6:1; /* dss-checksum required? */
++
++ union {
++ struct {
++ __u64 sender_key; /* sender's key for mptcp */
++ __u64 receiver_key; /* receiver's key for mptcp */
++ } mp_capable;
++
++ struct {
++ __u64 sender_truncated_mac;
++ __u32 sender_nonce;
++ /* random number of the sender */
++ __u32 token; /* token for mptcp */
++ u8 low_prio:1;
++ } mp_join_syns;
++ };
++
++ struct {
++ struct in_addr addr;
++ u8 addr_id;
++ } add_addr4;
++
++ struct {
++ struct in6_addr addr;
++ u8 addr_id;
++ } add_addr6;
++
++ u16 remove_addrs; /* list of address id */
++ u8 addr_id; /* address id (mp_join or add_address) */
++#endif /* CONFIG_MPTCP */
++};
++
+ /*These are used to set the sack_ok field in struct tcp_options_received */
+ #define TCP_SACK_SEEN (1 << 0) /*1 = peer is SACK capable, */
+ #define TCP_FACK_ENABLED (1 << 1) /*1 = FACK is enabled locally*/
+@@ -95,6 +140,9 @@ struct tcp_options_received {
+ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
+ };
+
++struct mptcp_cb;
++struct mptcp_tcp_sock;
++
+ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+ {
+ rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
+
+ struct tcp_request_sock {
+ struct inet_request_sock req;
+-#ifdef CONFIG_TCP_MD5SIG
+- /* Only used by TCP MD5 Signature so far. */
+ const struct tcp_request_sock_ops *af_specific;
+-#endif
+ struct sock *listener; /* needed for TFO */
+ u32 rcv_isn;
+ u32 snt_isn;
+@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
+ return (struct tcp_request_sock *)req;
+ }
+
++struct tcp_md5sig_key;
++
+ struct tcp_sock {
+ /* inet_connection_sock has to be the first member of tcp_sock */
+ struct inet_connection_sock inet_conn;
+@@ -326,6 +373,37 @@ struct tcp_sock {
+ * socket. Used to retransmit SYNACKs etc.
+ */
+ struct request_sock *fastopen_rsk;
++
++ /* MPTCP/TCP-specific callbacks */
++ const struct tcp_sock_ops *ops;
++
++ struct mptcp_cb *mpcb;
++ struct sock *meta_sk;
++ /* We keep these flags even if CONFIG_MPTCP is not checked, because
++ * it allows checking MPTCP capability just by checking the mpc flag,
++ * rather than adding ifdefs everywhere.
++ */
++ u16 mpc:1, /* Other end is multipath capable */
++ inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
++ send_mp_fclose:1,
++ request_mptcp:1, /* Did we send out an MP_CAPABLE?
++ * (this speeds up mptcp_doit() in tcp_recvmsg)
++ */
++ mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
++ pf:1, /* Potentially Failed state: when this flag is set, we
++ * stop using the subflow
++ */
++ mp_killed:1, /* Killed with a tcp_done in mptcp? */
++ was_meta_sk:1, /* This was a meta sk (in case of reuse) */
++ is_master_sk,
++ close_it:1, /* Must close socket in mptcp_data_ready? */
++ closing:1;
++ struct mptcp_tcp_sock *mptcp;
++#ifdef CONFIG_MPTCP
++ struct hlist_nulls_node tk_table;
++ u32 mptcp_loc_token;
++ u64 mptcp_loc_key;
++#endif /* CONFIG_MPTCP */
+ };
+
+ enum tsq_flags {
+@@ -337,6 +415,8 @@ enum tsq_flags {
+ TCP_MTU_REDUCED_DEFERRED, /* tcp_v{4|6}_err() could not call
+ * tcp_v{4|6}_mtu_reduced()
+ */
++ MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
++ MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
+ };
+
+ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
+@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
+ #ifdef CONFIG_TCP_MD5SIG
+ struct tcp_md5sig_key *tw_md5_key;
+ #endif
++ struct mptcp_tw *mptcp_tw;
+ };
+
+ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
+diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
+index 74af137304be..83f63033897a 100644
+--- a/include/net/inet6_connection_sock.h
++++ b/include/net/inet6_connection_sock.h
+@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
+
+ struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
+ const struct request_sock *req);
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++ const u32 rnd, const u32 synq_hsize);
+
+ struct request_sock *inet6_csk_search_req(const struct sock *sk,
+ struct request_sock ***prevp,
+diff --git a/include/net/inet_common.h b/include/net/inet_common.h
+index fe7994c48b75..780f229f46a8 100644
+--- a/include/net/inet_common.h
++++ b/include/net/inet_common.h
+@@ -1,6 +1,8 @@
+ #ifndef _INET_COMMON_H
+ #define _INET_COMMON_H
+
++#include <net/sock.h>
++
+ extern const struct proto_ops inet_stream_ops;
+ extern const struct proto_ops inet_dgram_ops;
+
+@@ -13,6 +15,8 @@ struct sock;
+ struct sockaddr;
+ struct socket;
+
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
+ int inet_release(struct socket *sock);
+ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len, int flags);
+diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
+index 7a4313887568..f62159e39839 100644
+--- a/include/net/inet_connection_sock.h
++++ b/include/net/inet_connection_sock.h
+@@ -30,6 +30,7 @@
+
+ struct inet_bind_bucket;
+ struct tcp_congestion_ops;
++struct tcp_options_received;
+
+ /*
+ * Pointers to address related TCP functions
+@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
+
+ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
+
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++ const u32 synq_hsize);
++
+ struct request_sock *inet_csk_search_req(const struct sock *sk,
+ struct request_sock ***prevp,
+ const __be16 rport,
+diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
+index b1edf17bec01..6a32d8d6b85e 100644
+--- a/include/net/inet_sock.h
++++ b/include/net/inet_sock.h
+@@ -86,10 +86,14 @@ struct inet_request_sock {
+ wscale_ok : 1,
+ ecn_ok : 1,
+ acked : 1,
+- no_srccheck: 1;
++ no_srccheck: 1,
++ mptcp_rqsk : 1,
++ saw_mpc : 1;
+ kmemcheck_bitfield_end(flags);
+- struct ip_options_rcu *opt;
+- struct sk_buff *pktopts;
++ union {
++ struct ip_options_rcu *opt;
++ struct sk_buff *pktopts;
++ };
+ u32 ir_mark;
+ };
+
+diff --git a/include/net/mptcp.h b/include/net/mptcp.h
+new file mode 100644
+index 000000000000..712780fc39e4
+--- /dev/null
++++ b/include/net/mptcp.h
+@@ -0,0 +1,1439 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_H
++#define _MPTCP_H
++
++#include <linux/inetdevice.h>
++#include <linux/ipv6.h>
++#include <linux/list.h>
++#include <linux/net.h>
++#include <linux/netpoll.h>
++#include <linux/skbuff.h>
++#include <linux/socket.h>
++#include <linux/tcp.h>
++#include <linux/kernel.h>
++
++#include <asm/byteorder.h>
++#include <asm/unaligned.h>
++#include <crypto/hash.h>
++#include <net/tcp.h>
++
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ #define ntohll(x) be64_to_cpu(x)
++ #define htonll(x) cpu_to_be64(x)
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ #define ntohll(x) (x)
++ #define htonll(x) (x)
++#endif
++
++struct mptcp_loc4 {
++ u8 loc4_id;
++ u8 low_prio:1;
++ struct in_addr addr;
++};
++
++struct mptcp_rem4 {
++ u8 rem4_id;
++ __be16 port;
++ struct in_addr addr;
++};
++
++struct mptcp_loc6 {
++ u8 loc6_id;
++ u8 low_prio:1;
++ struct in6_addr addr;
++};
++
++struct mptcp_rem6 {
++ u8 rem6_id;
++ __be16 port;
++ struct in6_addr addr;
++};
++
++struct mptcp_request_sock {
++ struct tcp_request_sock req;
++ /* hlist-nulls entry to the hash-table. Depending on whether this is a
++ * a new MPTCP connection or an additional subflow, the request-socket
++ * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
++ */
++ struct hlist_nulls_node hash_entry;
++
++ union {
++ struct {
++ /* Only on initial subflows */
++ u64 mptcp_loc_key;
++ u64 mptcp_rem_key;
++ u32 mptcp_loc_token;
++ };
++
++ struct {
++ /* Only on additional subflows */
++ struct mptcp_cb *mptcp_mpcb;
++ u32 mptcp_rem_nonce;
++ u32 mptcp_loc_nonce;
++ u64 mptcp_hash_tmac;
++ };
++ };
++
++ u8 loc_id;
++ u8 rem_id; /* Address-id in the MP_JOIN */
++ u8 dss_csum:1,
++ is_sub:1, /* Is this a new subflow? */
++ low_prio:1, /* Interface set to low-prio? */
++ rcv_low_prio:1;
++};
++
++struct mptcp_options_received {
++ u16 saw_mpc:1,
++ dss_csum:1,
++ drop_me:1,
++
++ is_mp_join:1,
++ join_ack:1,
++
++ saw_low_prio:2, /* 0x1 - low-prio set for this subflow
++ * 0x2 - low-prio set for another subflow
++ */
++ low_prio:1,
++
++ saw_add_addr:2, /* Saw at least one add_addr option:
++ * 0x1: IPv4 - 0x2: IPv6
++ */
++ more_add_addr:1, /* Saw one more add-addr. */
++
++ saw_rem_addr:1, /* Saw at least one rem_addr option */
++ more_rem_addr:1, /* Saw one more rem-addr. */
++
++ mp_fail:1,
++ mp_fclose:1;
++ u8 rem_id; /* Address-id in the MP_JOIN */
++ u8 prio_addr_id; /* Address-id in the MP_PRIO */
++
++ const unsigned char *add_addr_ptr; /* Pointer to add-address option */
++ const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
++
++ u32 data_ack;
++ u32 data_seq;
++ u16 data_len;
++
++ u32 mptcp_rem_token;/* Remote token */
++
++ /* Key inside the option (from mp_capable or fast_close) */
++ u64 mptcp_key;
++
++ u32 mptcp_recv_nonce;
++ u64 mptcp_recv_tmac;
++ u8 mptcp_recv_mac[20];
++};
++
++struct mptcp_tcp_sock {
++ struct tcp_sock *next; /* Next subflow socket */
++ struct hlist_node cb_list;
++ struct mptcp_options_received rx_opt;
++
++ /* Those three fields record the current mapping */
++ u64 map_data_seq;
++ u32 map_subseq;
++ u16 map_data_len;
++ u16 slave_sk:1,
++ fully_established:1,
++ establish_increased:1,
++ second_packet:1,
++ attached:1,
++ send_mp_fail:1,
++ include_mpc:1,
++ mapping_present:1,
++ map_data_fin:1,
++ low_prio:1, /* use this socket as backup */
++ rcv_low_prio:1, /* Peer sent low-prio option to us */
++ send_mp_prio:1, /* Trigger to send mp_prio on this socket */
++ pre_established:1; /* State between sending 3rd ACK and
++ * receiving the fourth ack of new subflows.
++ */
++
++ /* isn: needed to translate abs to relative subflow seqnums */
++ u32 snt_isn;
++ u32 rcv_isn;
++ u8 path_index;
++ u8 loc_id;
++ u8 rem_id;
++
++#define MPTCP_SCHED_SIZE 4
++ u8 mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
++
++ struct sk_buff *shortcut_ofoqueue; /* Shortcut to the current modified
++ * skb in the ofo-queue.
++ */
++
++ int init_rcv_wnd;
++ u32 infinite_cutoff_seq;
++ struct delayed_work work;
++ u32 mptcp_loc_nonce;
++ struct tcp_sock *tp; /* Where is my daddy? */
++ u32 last_end_data_seq;
++
++ /* MP_JOIN subflow: timer for retransmitting the 3rd ack */
++ struct timer_list mptcp_ack_timer;
++
++ /* HMAC of the third ack */
++ char sender_mac[20];
++};
++
++struct mptcp_tw {
++ struct list_head list;
++ u64 loc_key;
++ u64 rcv_nxt;
++ struct mptcp_cb __rcu *mpcb;
++ u8 meta_tw:1,
++ in_list:1;
++};
++
++#define MPTCP_PM_NAME_MAX 16
++struct mptcp_pm_ops {
++ struct list_head list;
++
++ /* Signal the creation of a new MPTCP-session. */
++ void (*new_session)(const struct sock *meta_sk);
++ void (*release_sock)(struct sock *meta_sk);
++ void (*fully_established)(struct sock *meta_sk);
++ void (*new_remote_address)(struct sock *meta_sk);
++ int (*get_local_id)(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio);
++ void (*addr_signal)(struct sock *sk, unsigned *size,
++ struct tcp_out_options *opts, struct sk_buff *skb);
++ void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
++ sa_family_t family, __be16 port, u8 id);
++ void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
++ void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
++ void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
++
++ char name[MPTCP_PM_NAME_MAX];
++ struct module *owner;
++};
++
++#define MPTCP_SCHED_NAME_MAX 16
++struct mptcp_sched_ops {
++ struct list_head list;
++
++ struct sock * (*get_subflow)(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test);
++ struct sk_buff * (*next_segment)(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit);
++ void (*init)(struct sock *sk);
++
++ char name[MPTCP_SCHED_NAME_MAX];
++ struct module *owner;
++};
++
++struct mptcp_cb {
++ /* list of sockets in this multipath connection */
++ struct tcp_sock *connection_list;
++ /* list of sockets that need a call to release_cb */
++ struct hlist_head callback_list;
++
++ /* High-order bits of 64-bit sequence numbers */
++ u32 snd_high_order[2];
++ u32 rcv_high_order[2];
++
++ u16 send_infinite_mapping:1,
++ in_time_wait:1,
++ list_rcvd:1, /* XXX TO REMOVE */
++ addr_signal:1, /* Path-manager wants us to call addr_signal */
++ dss_csum:1,
++ server_side:1,
++ infinite_mapping_rcv:1,
++ infinite_mapping_snd:1,
++ dfin_combined:1, /* Was the DFIN combined with subflow-fin? */
++ passive_close:1,
++ snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
++ rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
++
++ /* socket count in this connection */
++ u8 cnt_subflows;
++ u8 cnt_established;
++
++ struct mptcp_sched_ops *sched_ops;
++
++ struct sk_buff_head reinject_queue;
++ /* First cache-line boundary is here minus 8 bytes. But from the
++ * reinject-queue only the next and prev pointers are regularly
++ * accessed. Thus, the whole data-path is on a single cache-line.
++ */
++
++ u64 csum_cutoff_seq;
++
++ /***** Start of fields, used for connection closure */
++ spinlock_t tw_lock;
++ unsigned char mptw_state;
++ u8 dfin_path_index;
++
++ struct list_head tw_list;
++
++ /***** Start of fields, used for subflow establishment and closure */
++ atomic_t mpcb_refcnt;
++
++ /* Mutex needed, because otherwise mptcp_close will complain that the
++ * socket is owned by the user.
++ * E.g., mptcp_sub_close_wq is taking the meta-lock.
++ */
++ struct mutex mpcb_mutex;
++
++ /***** Start of fields, used for subflow establishment */
++ struct sock *meta_sk;
++
++ /* Master socket, also part of the connection_list, this
++ * socket is the one that the application sees.
++ */
++ struct sock *master_sk;
++
++ __u64 mptcp_loc_key;
++ __u64 mptcp_rem_key;
++ __u32 mptcp_loc_token;
++ __u32 mptcp_rem_token;
++
++#define MPTCP_PM_SIZE 608
++ u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
++ struct mptcp_pm_ops *pm_ops;
++
++ u32 path_index_bits;
++ /* Next pi to pick up in case a new path becomes available */
++ u8 next_path_index;
++
++ /* Original snd/rcvbuf of the initial subflow.
++ * Used for the new subflows on the server-side to allow correct
++ * autotuning
++ */
++ int orig_sk_rcvbuf;
++ int orig_sk_sndbuf;
++ u32 orig_window_clamp;
++
++ /* Timer for retransmitting SYN/ACK+MP_JOIN */
++ struct timer_list synack_timer;
++};
++
++#define MPTCP_SUB_CAPABLE 0
++#define MPTCP_SUB_LEN_CAPABLE_SYN 12
++#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN 12
++#define MPTCP_SUB_LEN_CAPABLE_ACK 20
++#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN 20
++
++#define MPTCP_SUB_JOIN 1
++#define MPTCP_SUB_LEN_JOIN_SYN 12
++#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN 12
++#define MPTCP_SUB_LEN_JOIN_SYNACK 16
++#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN 16
++#define MPTCP_SUB_LEN_JOIN_ACK 24
++#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN 24
++
++#define MPTCP_SUB_DSS 2
++#define MPTCP_SUB_LEN_DSS 4
++#define MPTCP_SUB_LEN_DSS_ALIGN 4
++
++/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
++ * as they are part of the DSS-option.
++ * To get the total length, just add the different options together.
++ */
++#define MPTCP_SUB_LEN_SEQ 10
++#define MPTCP_SUB_LEN_SEQ_CSUM 12
++#define MPTCP_SUB_LEN_SEQ_ALIGN 12
++
++#define MPTCP_SUB_LEN_SEQ_64 14
++#define MPTCP_SUB_LEN_SEQ_CSUM_64 16
++#define MPTCP_SUB_LEN_SEQ_64_ALIGN 16
++
++#define MPTCP_SUB_LEN_ACK 4
++#define MPTCP_SUB_LEN_ACK_ALIGN 4
++
++#define MPTCP_SUB_LEN_ACK_64 8
++#define MPTCP_SUB_LEN_ACK_64_ALIGN 8
++
++/* This is the "default" option-length we will send out most often.
++ * MPTCP DSS-header
++ * 32-bit data sequence number
++ * 32-bit data ack
++ *
++ * It is necessary to calculate the effective MSS we will be using when
++ * sending data.
++ */
++#define MPTCP_SUB_LEN_DSM_ALIGN (MPTCP_SUB_LEN_DSS_ALIGN + \
++ MPTCP_SUB_LEN_SEQ_ALIGN + \
++ MPTCP_SUB_LEN_ACK_ALIGN)
++
++#define MPTCP_SUB_ADD_ADDR 3
++#define MPTCP_SUB_LEN_ADD_ADDR4 8
++#define MPTCP_SUB_LEN_ADD_ADDR6 20
++#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN 8
++#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN 20
++
++#define MPTCP_SUB_REMOVE_ADDR 4
++#define MPTCP_SUB_LEN_REMOVE_ADDR 4
++
++#define MPTCP_SUB_PRIO 5
++#define MPTCP_SUB_LEN_PRIO 3
++#define MPTCP_SUB_LEN_PRIO_ADDR 4
++#define MPTCP_SUB_LEN_PRIO_ALIGN 4
++
++#define MPTCP_SUB_FAIL 6
++#define MPTCP_SUB_LEN_FAIL 12
++#define MPTCP_SUB_LEN_FAIL_ALIGN 12
++
++#define MPTCP_SUB_FCLOSE 7
++#define MPTCP_SUB_LEN_FCLOSE 12
++#define MPTCP_SUB_LEN_FCLOSE_ALIGN 12
++
++
++#define OPTION_MPTCP (1 << 5)
++
++#ifdef CONFIG_MPTCP
++
++/* Used for checking if the mptcp initialization has been successful */
++extern bool mptcp_init_failed;
++
++/* MPTCP options */
++#define OPTION_TYPE_SYN (1 << 0)
++#define OPTION_TYPE_SYNACK (1 << 1)
++#define OPTION_TYPE_ACK (1 << 2)
++#define OPTION_MP_CAPABLE (1 << 3)
++#define OPTION_DATA_ACK (1 << 4)
++#define OPTION_ADD_ADDR (1 << 5)
++#define OPTION_MP_JOIN (1 << 6)
++#define OPTION_MP_FAIL (1 << 7)
++#define OPTION_MP_FCLOSE (1 << 8)
++#define OPTION_REMOVE_ADDR (1 << 9)
++#define OPTION_MP_PRIO (1 << 10)
++
++/* MPTCP flags: both TX and RX */
++#define MPTCPHDR_SEQ 0x01 /* DSS.M option is present */
++#define MPTCPHDR_FIN 0x02 /* DSS.F option is present */
++#define MPTCPHDR_SEQ64_INDEX 0x04 /* index of seq in mpcb->snd_high_order */
++/* MPTCP flags: RX only */
++#define MPTCPHDR_ACK 0x08
++#define MPTCPHDR_SEQ64_SET 0x10 /* Did we received a 64-bit seq number? */
++#define MPTCPHDR_SEQ64_OFO 0x20 /* Is it not in our circular array? */
++#define MPTCPHDR_DSS_CSUM 0x40
++#define MPTCPHDR_JOIN 0x80
++/* MPTCP flags: TX only */
++#define MPTCPHDR_INF 0x08
++
++struct mptcp_option {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ver:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ver:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_capable {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ver:4,
++ sub:4;
++ __u8 h:1,
++ rsv:5,
++ b:1,
++ a:1;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ver:4;
++ __u8 a:1,
++ b:1,
++ rsv:5,
++ h:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u64 sender_key;
++ __u64 receiver_key;
++} __attribute__((__packed__));
++
++struct mp_join {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 b:1,
++ rsv:3,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:3,
++ b:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++ union {
++ struct {
++ u32 token;
++ u32 nonce;
++ } syn;
++ struct {
++ __u64 mac;
++ u32 nonce;
++ } synack;
++ struct {
++ __u8 mac[20];
++ } ack;
++ } u;
++} __attribute__((__packed__));
++
++struct mp_dss {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ A:1,
++ a:1,
++ M:1,
++ m:1,
++ F:1,
++ rsv2:3;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:3,
++ F:1,
++ m:1,
++ M:1,
++ a:1,
++ A:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_add_addr {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ipver:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ipver:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++ union {
++ struct {
++ struct in_addr addr;
++ __be16 port;
++ } v4;
++ struct {
++ struct in6_addr addr;
++ __be16 port;
++ } v6;
++ } u;
++} __attribute__((__packed__));
++
++struct mp_remove_addr {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 rsv:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ /* list of addr_id */
++ __u8 addrs_id;
++};
++
++struct mp_fail {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:8;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __be64 data_seq;
++} __attribute__((__packed__));
++
++struct mp_fclose {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:8;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u64 key;
++} __attribute__((__packed__));
++
++struct mp_prio {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 b:1,
++ rsv:3,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:3,
++ b:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++} __attribute__((__packed__));
++
++static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
++{
++ return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
++}
++
++#define MPTCP_APP 2
++
++extern int sysctl_mptcp_enabled;
++extern int sysctl_mptcp_checksum;
++extern int sysctl_mptcp_debug;
++extern int sysctl_mptcp_syn_retries;
++
++extern struct workqueue_struct *mptcp_wq;
++
++#define mptcp_debug(fmt, args...) \
++ do { \
++ if (unlikely(sysctl_mptcp_debug)) \
++ pr_err(__FILE__ ": " fmt, ##args); \
++ } while (0)
++
++/* Iterates over all subflows */
++#define mptcp_for_each_tp(mpcb, tp) \
++ for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
++
++#define mptcp_for_each_sk(mpcb, sk) \
++ for ((sk) = (struct sock *)(mpcb)->connection_list; \
++ sk; \
++ sk = (struct sock *)tcp_sk(sk)->mptcp->next)
++
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp) \
++ for (__sk = (struct sock *)(__mpcb)->connection_list, \
++ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
++ __sk; \
++ __sk = __temp, \
++ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
++
++/* Iterates over all bit set to 1 in a bitset */
++#define mptcp_for_each_bit_set(b, i) \
++ for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
++
++#define mptcp_for_each_bit_unset(b, i) \
++ mptcp_for_each_bit_set(~b, i)
++
++extern struct lock_class_key meta_key;
++extern struct lock_class_key meta_slock_key;
++extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
++
++/* This is needed to ensure that two subsequent key/nonce-generation result in
++ * different keys/nonces if the IPs and ports are the same.
++ */
++extern u32 mptcp_seed;
++
++#define MPTCP_HASH_SIZE 1024
++
++extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++extern spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
++
++/* Lock, protecting the two hash-tables that hold the token. Namely,
++ * mptcp_reqsk_tk_htb and tk_hashtable
++ */
++extern spinlock_t mptcp_tk_hashlock; /* hashtable protection */
++
++/* Request-sockets can be hashed in the tk_htb for collision-detection or in
++ * the regular htb for join-connections. We need to define different NULLS
++ * values so that we can correctly detect a request-socket that has been
++ * recycled. See also c25eb3bfb9729.
++ */
++#define MPTCP_REQSK_NULLS_BASE (1U << 29)
++
++
++void mptcp_data_ready(struct sock *sk);
++void mptcp_write_space(struct sock *sk);
++
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++ struct sock *sk);
++void mptcp_ofo_queue(struct sock *meta_sk);
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++ gfp_t flags);
++void mptcp_del_sock(struct sock *sk);
++void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
++void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
++void mptcp_update_sndbuf(const struct tcp_sock *tp);
++void mptcp_send_fin(struct sock *meta_sk);
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
++bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++ struct mptcp_options_received *mopt);
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++ struct mptcp_options_received *mopt,
++ const struct sk_buff *skb);
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++ unsigned *remaining);
++void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts,
++ unsigned *remaining);
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++ struct tcp_out_options *opts, unsigned *size);
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb);
++void mptcp_close(struct sock *meta_sk, long timeout);
++int mptcp_doit(struct sock *sk);
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev);
++struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt);
++u32 __mptcp_select_window(struct sock *sk);
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
++unsigned int mptcp_current_mss(struct sock *meta_sk);
++int mptcp_select_size(const struct sock *meta_sk, bool sg);
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++ u32 *hash_out);
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
++void mptcp_fin(struct sock *meta_sk);
++void mptcp_retransmit_timer(struct sock *meta_sk);
++int mptcp_write_wakeup(struct sock *meta_sk);
++void mptcp_sub_close_wq(struct work_struct *work);
++void mptcp_sub_close(struct sock *sk, unsigned long delay);
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
++void mptcp_fallback_meta_sk(struct sock *meta_sk);
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_ack_handler(unsigned long);
++int mptcp_check_rtt(const struct tcp_sock *tp, int time);
++int mptcp_check_snd_buf(const struct tcp_sock *tp);
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++ const struct sk_buff *skb);
++void __init mptcp_init(void);
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
++void mptcp_destroy_sock(struct sock *sk);
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++ const struct sk_buff *skb,
++ const struct mptcp_options_received *mopt);
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++ int large_allowed);
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
++void mptcp_time_wait(struct sock *sk, int state, int timeo);
++void mptcp_disconnect(struct sock *sk);
++bool mptcp_should_expand_sndbuf(const struct sock *sk);
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_tsq_flags(struct sock *sk);
++void mptcp_tsq_sub_deferred(struct sock *meta_sk);
++struct mp_join *mptcp_find_join(const struct sk_buff *skb);
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
++void mptcp_hash_remove(struct tcp_sock *meta_tp);
++struct sock *mptcp_hash_find(const struct net *net, const u32 token);
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
++int mptcp_do_join_short(struct sk_buff *skb,
++ const struct mptcp_options_received *mopt,
++ struct net *net);
++void mptcp_reqsk_destructor(struct request_sock *req);
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb);
++int mptcp_check_req(struct sk_buff *skb, struct net *net);
++void mptcp_connect_init(struct sock *sk);
++void mptcp_sub_force_close(struct sock *sk);
++int mptcp_sub_len_remove_addr_align(u16 bitfield);
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++ struct sk_buff *skb);
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
++void mptcp_init_congestion_control(struct sock *sk);
++
++/* MPTCP-path-manager registration/initialization functions */
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_init_path_manager(struct mptcp_cb *mpcb);
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
++void mptcp_fallback_default(struct mptcp_cb *mpcb);
++void mptcp_get_default_path_manager(char *name);
++int mptcp_set_default_path_manager(const char *name);
++extern struct mptcp_pm_ops mptcp_pm_default;
++
++/* MPTCP-scheduler registration/initialization functions */
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_init_scheduler(struct mptcp_cb *mpcb);
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
++void mptcp_get_default_scheduler(char *name);
++int mptcp_set_default_scheduler(const char *name);
++extern struct mptcp_sched_ops mptcp_sched_default;
++
++static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
++ unsigned long len)
++{
++ sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
++ jiffies + len);
++}
++
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
++{
++ sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
++}
++
++static inline bool is_mptcp_enabled(const struct sock *sk)
++{
++ if (!sysctl_mptcp_enabled || mptcp_init_failed)
++ return false;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++ return false;
++
++ return true;
++}
++
++static inline int mptcp_pi_to_flag(int pi)
++{
++ return 1 << (pi - 1);
++}
++
++static inline
++struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
++{
++ return (struct mptcp_request_sock *)req;
++}
++
++static inline
++struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
++{
++ return (struct request_sock *)req;
++}
++
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++ struct sock *sk_it;
++
++ if (tcp_sk(sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
++ if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
++ !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
++ return false;
++ }
++
++ return true;
++}
++
++static inline void mptcp_push_pending_frames(struct sock *meta_sk)
++{
++ /* We check packets out and send-head here. TCP only checks the
++ * send-head. But, MPTCP also checks packets_out, as this is an
++ * indication that we might want to do opportunistic reinjection.
++ */
++ if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
++ struct tcp_sock *tp = tcp_sk(meta_sk);
++
++ /* We don't care about the MSS, because it will be set in
++ * mptcp_write_xmit.
++ */
++ __tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
++ }
++}
++
++static inline void mptcp_send_reset(struct sock *sk)
++{
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++ mptcp_sub_force_close(sk);
++}
++
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
++}
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
++}
++
++/* Is it a data-fin while in infinite mapping mode?
++ * In infinite mode, a subflow-fin is in fact a data-fin.
++ */
++static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
++ const struct tcp_sock *tp)
++{
++ return mptcp_is_data_fin(skb) ||
++ (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
++}
++
++static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
++{
++ u64 data_seq_high = (u32)(data_seq >> 32);
++
++ if (mpcb->rcv_high_order[0] == data_seq_high)
++ return 0;
++ else if (mpcb->rcv_high_order[1] == data_seq_high)
++ return MPTCPHDR_SEQ64_INDEX;
++ else
++ return MPTCPHDR_SEQ64_OFO;
++}
++
++/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
++ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
++ */
++static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
++ u32 *data_seq,
++ struct mptcp_cb *mpcb)
++{
++ __u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
++
++ if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
++ u64 data_seq64 = get_unaligned_be64(ptr);
++
++ if (mpcb)
++ TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
++
++ *data_seq = (u32)data_seq64;
++ ptr++;
++ } else {
++ *data_seq = get_unaligned_be32(ptr);
++ }
++
++ return ptr;
++}
++
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++ return tcp_sk(sk)->meta_sk;
++}
++
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++ return tcp_sk(tp->meta_sk);
++}
++
++static inline int is_meta_tp(const struct tcp_sock *tp)
++{
++ return tp->mpcb && mptcp_meta_tp(tp) == tp;
++}
++
++static inline int is_meta_sk(const struct sock *sk)
++{
++ return sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP &&
++ mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
++}
++
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++ return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
++}
++
++static inline void mptcp_hash_request_remove(struct request_sock *req)
++{
++ int in_softirq = 0;
++
++ if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
++ return;
++
++ if (in_softirq()) {
++ spin_lock(&mptcp_reqsk_hlock);
++ in_softirq = 1;
++ } else {
++ spin_lock_bh(&mptcp_reqsk_hlock);
++ }
++
++ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++
++ if (in_softirq)
++ spin_unlock(&mptcp_reqsk_hlock);
++ else
++ spin_unlock_bh(&mptcp_reqsk_hlock);
++}
++
++static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
++{
++ mopt->saw_mpc = 0;
++ mopt->dss_csum = 0;
++ mopt->drop_me = 0;
++
++ mopt->is_mp_join = 0;
++ mopt->join_ack = 0;
++
++ mopt->saw_low_prio = 0;
++ mopt->low_prio = 0;
++
++ mopt->saw_add_addr = 0;
++ mopt->more_add_addr = 0;
++
++ mopt->saw_rem_addr = 0;
++ mopt->more_rem_addr = 0;
++
++ mopt->mp_fail = 0;
++ mopt->mp_fclose = 0;
++}
++
++static inline void mptcp_reset_mopt(struct tcp_sock *tp)
++{
++ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++ mopt->saw_low_prio = 0;
++ mopt->saw_add_addr = 0;
++ mopt->more_add_addr = 0;
++ mopt->saw_rem_addr = 0;
++ mopt->more_rem_addr = 0;
++ mopt->join_ack = 0;
++ mopt->mp_fail = 0;
++ mopt->mp_fclose = 0;
++}
++
++static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
++ const struct mptcp_cb *mpcb)
++{
++ return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
++ MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
++}
++
++static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
++ u32 data_seq_32)
++{
++ return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
++}
++
++static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
++{
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++ meta_tp->rcv_nxt);
++}
++
++static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
++{
++ if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++ mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
++ }
++}
++
++static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
++ u32 old_rcv_nxt)
++{
++ if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
++ mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
++ }
++}
++
++static inline int mptcp_sk_can_send(const struct sock *sk)
++{
++ return tcp_passive_fastopen(sk) ||
++ ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
++ !tcp_sk(sk)->mptcp->pre_established);
++}
++
++static inline int mptcp_sk_can_recv(const struct sock *sk)
++{
++ return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
++}
++
++static inline int mptcp_sk_can_send_ack(const struct sock *sk)
++{
++ return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
++ TCPF_CLOSE | TCPF_LISTEN)) &&
++ !tcp_sk(sk)->mptcp->pre_established;
++}
++
++/* Only support GSO if all subflows supports it */
++static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
++{
++ struct sock *sk;
++
++ if (tcp_sk(meta_sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++ if (!sk_can_gso(sk))
++ return false;
++ }
++ return true;
++}
++
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++ struct sock *sk;
++
++ if (tcp_sk(meta_sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++ if (!(sk->sk_route_caps & NETIF_F_SG))
++ return false;
++ }
++ return true;
++}
++
++static inline void mptcp_set_rto(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *sk_it;
++ struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
++ __u32 max_rto = 0;
++
++ /* We are in recovery-phase on the MPTCP-level. Do not update the
++ * RTO, because this would kill exponential backoff.
++ */
++ if (micsk->icsk_retransmits)
++ return;
++
++ mptcp_for_each_sk(tp->mpcb, sk_it) {
++ if (mptcp_sk_can_send(sk_it) &&
++ inet_csk(sk_it)->icsk_rto > max_rto)
++ max_rto = inet_csk(sk_it)->icsk_rto;
++ }
++ if (max_rto) {
++ micsk->icsk_rto = max_rto << 1;
++
++ /* A successfull rto-measurement - reset backoff counter */
++ micsk->icsk_backoff = 0;
++ }
++}
++
++static inline int mptcp_sysctl_syn_retries(void)
++{
++ return sysctl_mptcp_syn_retries;
++}
++
++static inline void mptcp_sub_close_passive(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
++
++ /* Only close, if the app did a send-shutdown (passive close), and we
++ * received the data-ack of the data-fin.
++ */
++ if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
++ mptcp_sub_close(sk, 0);
++}
++
++static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* If data has been acknowleged on the meta-level, fully_established
++ * will have been set before and thus we will not fall back to infinite
++ * mapping.
++ */
++ if (likely(tp->mptcp->fully_established))
++ return false;
++
++ if (!(flag & MPTCP_FLAG_DATA_ACKED))
++ return false;
++
++ /* Don't fallback twice ;) */
++ if (tp->mpcb->infinite_mapping_snd)
++ return false;
++
++ pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
++ __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
++ &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
++ __builtin_return_address(0));
++ if (!is_master_tp(tp))
++ return true;
++
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mpcb->infinite_mapping_rcv = 1;
++ tp->mptcp->fully_established = 1;
++
++ return false;
++}
++
++/* Find the first index whose bit in the bit-field == 0 */
++static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
++{
++ u8 base = mpcb->next_path_index;
++ int i;
++
++ /* Start at 1, because 0 is reserved for the meta-sk */
++ mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
++ if (i + base < 1)
++ continue;
++ if (i + base >= sizeof(mpcb->path_index_bits) * 8)
++ break;
++ i += base;
++ mpcb->path_index_bits |= (1 << i);
++ mpcb->next_path_index = i + 1;
++ return i;
++ }
++ mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
++ if (i >= sizeof(mpcb->path_index_bits) * 8)
++ break;
++ if (i < 1)
++ continue;
++ mpcb->path_index_bits |= (1 << i);
++ mpcb->next_path_index = i + 1;
++ return i;
++ }
++
++ return 0;
++}
++
++static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
++{
++ return sk->sk_family == AF_INET6 &&
++ ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
++}
++
++/* TCP and MPTCP mpc flag-depending functions */
++u16 mptcp_select_window(struct sock *sk);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_tcp_set_rto(struct sock *sk);
++
++/* TCP and MPTCP flag-depending functions */
++bool mptcp_prune_ofo_queue(struct sock *sk);
++
++#else /* CONFIG_MPTCP */
++#define mptcp_debug(fmt, args...) \
++ do { \
++ } while (0)
++
++/* Without MPTCP, we just do one iteration
++ * over the only socket available. This assumes that
++ * the sk/tp arg is the socket in that case.
++ */
++#define mptcp_for_each_sk(mpcb, sk)
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++ return false;
++}
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++ return false;
++}
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++ return NULL;
++}
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++ return NULL;
++}
++static inline int is_meta_sk(const struct sock *sk)
++{
++ return 0;
++}
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++ return 0;
++}
++static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
++static inline void mptcp_del_sock(const struct sock *sk) {}
++static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
++static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
++static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
++static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
++ const struct sock *sk) {}
++static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
++static inline void mptcp_set_rto(const struct sock *sk) {}
++static inline void mptcp_send_fin(const struct sock *meta_sk) {}
++static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb) {}
++static inline void mptcp_syn_options(const struct sock *sk,
++ struct tcp_out_options *opts,
++ unsigned *remaining) {}
++static inline void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts,
++ unsigned *remaining) {}
++
++static inline void mptcp_established_options(struct sock *sk,
++ struct sk_buff *skb,
++ struct tcp_out_options *opts,
++ unsigned *size) {}
++static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb) {}
++static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
++static inline int mptcp_doit(struct sock *sk)
++{
++ return 0;
++}
++static inline int mptcp_check_req_fastopen(struct sock *child,
++ struct request_sock *req)
++{
++ return 1;
++}
++static inline int mptcp_check_req_master(const struct sock *sk,
++ const struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev)
++{
++ return 1;
++}
++static inline struct sock *mptcp_check_req_child(struct sock *sk,
++ struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt)
++{
++ return NULL;
++}
++static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++ return 0;
++}
++static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++ return 0;
++}
++static inline void mptcp_sub_close_passive(struct sock *sk) {}
++static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
++{
++ return false;
++}
++static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
++static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++ return 0;
++}
++static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++ return 0;
++}
++static inline int mptcp_sysctl_syn_retries(void)
++{
++ return 0;
++}
++static inline void mptcp_send_reset(const struct sock *sk) {}
++static inline int mptcp_handle_options(struct sock *sk,
++ const struct tcphdr *th,
++ struct sk_buff *skb)
++{
++ return 0;
++}
++static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
++static inline void __init mptcp_init(void) {}
++static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++ return 0;
++}
++static inline bool mptcp_sk_can_gso(const struct sock *sk)
++{
++ return false;
++}
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++ return false;
++}
++static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
++ u32 mss_now, int large_allowed)
++{
++ return 0;
++}
++static inline void mptcp_destroy_sock(struct sock *sk) {}
++static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
++ struct sock **skptr,
++ struct sk_buff *skb,
++ const struct mptcp_options_received *mopt)
++{
++ return 0;
++}
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++ return false;
++}
++static inline int mptcp_init_tw_sock(struct sock *sk,
++ struct tcp_timewait_sock *tw)
++{
++ return 0;
++}
++static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
++static inline void mptcp_disconnect(struct sock *sk) {}
++static inline void mptcp_tsq_flags(struct sock *sk) {}
++static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
++static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
++static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
++static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct tcp_options_received *rx_opt,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb) {}
++static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb) {}
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_H */
+diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
+new file mode 100644
+index 000000000000..93ad97c77c5a
+--- /dev/null
++++ b/include/net/mptcp_v4.h
+@@ -0,0 +1,67 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef MPTCP_V4_H_
++#define MPTCP_V4_H_
++
++
++#include <linux/in.h>
++#include <linux/skbuff.h>
++#include <net/mptcp.h>
++#include <net/request_sock.h>
++#include <net/sock.h>
++
++extern struct request_sock_ops mptcp_request_sock_ops;
++extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++#ifdef CONFIG_MPTCP
++
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++ const __be32 laddr, const struct net *net);
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++ struct mptcp_rem4 *rem);
++int mptcp_pm_v4_init(void);
++void mptcp_pm_v4_undo(void);
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++
++#else
++
++static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
++ const struct sk_buff *skb)
++{
++ return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* MPTCP_V4_H_ */
+diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
+new file mode 100644
+index 000000000000..49a4f30ccd4d
+--- /dev/null
++++ b/include/net/mptcp_v6.h
+@@ -0,0 +1,69 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_V6_H
++#define _MPTCP_V6_H
++
++#include <linux/in6.h>
++#include <net/if_inet6.h>
++
++#include <net/mptcp.h>
++
++
++#ifdef CONFIG_MPTCP
++extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
++extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
++extern struct request_sock_ops mptcp6_request_sock_ops;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++ const struct in6_addr *laddr, const struct net *net);
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++ struct mptcp_rem6 *rem);
++int mptcp_pm_v6_init(void);
++void mptcp_pm_v6_undo(void);
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport);
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport);
++
++#else /* CONFIG_MPTCP */
++
++#define mptcp_v6_mapped ipv6_mapped
++
++static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_V6_H */
+diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
+index 361d26077196..bae95a11c531 100644
+--- a/include/net/net_namespace.h
++++ b/include/net/net_namespace.h
+@@ -16,6 +16,7 @@
+ #include <net/netns/packet.h>
+ #include <net/netns/ipv4.h>
+ #include <net/netns/ipv6.h>
++#include <net/netns/mptcp.h>
+ #include <net/netns/ieee802154_6lowpan.h>
+ #include <net/netns/sctp.h>
+ #include <net/netns/dccp.h>
+@@ -92,6 +93,9 @@ struct net {
+ #if IS_ENABLED(CONFIG_IPV6)
+ struct netns_ipv6 ipv6;
+ #endif
++#if IS_ENABLED(CONFIG_MPTCP)
++ struct netns_mptcp mptcp;
++#endif
+ #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
+ struct netns_ieee802154_lowpan ieee802154_lowpan;
+ #endif
+diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
+new file mode 100644
+index 000000000000..bad418b04cc8
+--- /dev/null
++++ b/include/net/netns/mptcp.h
+@@ -0,0 +1,44 @@
++/*
++ * MPTCP implementation - MPTCP namespace
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef __NETNS_MPTCP_H__
++#define __NETNS_MPTCP_H__
++
++#include <linux/compiler.h>
++
++enum {
++ MPTCP_PM_FULLMESH = 0,
++ MPTCP_PM_MAX
++};
++
++struct netns_mptcp {
++ void *path_managers[MPTCP_PM_MAX];
++};
++
++#endif /* __NETNS_MPTCP_H__ */
+diff --git a/include/net/request_sock.h b/include/net/request_sock.h
+index 7f830ff67f08..e79e87a8e1a6 100644
+--- a/include/net/request_sock.h
++++ b/include/net/request_sock.h
+@@ -164,7 +164,7 @@ struct request_sock_queue {
+ };
+
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+- unsigned int nr_table_entries);
++ unsigned int nr_table_entries, gfp_t flags);
+
+ void __reqsk_queue_destroy(struct request_sock_queue *queue);
+ void reqsk_queue_destroy(struct request_sock_queue *queue);
+diff --git a/include/net/sock.h b/include/net/sock.h
+index 156350745700..0e23cae8861f 100644
+--- a/include/net/sock.h
++++ b/include/net/sock.h
+@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
+
+ int sk_wait_data(struct sock *sk, long *timeo);
+
++/* START - needed for MPTCP */
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
++void sock_lock_init(struct sock *sk);
++
++extern struct lock_class_key af_callback_keys[AF_MAX];
++extern char *const af_family_clock_key_strings[AF_MAX+1];
++
++#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
++/* END - needed for MPTCP */
++
+ struct request_sock_ops;
+ struct timewait_sock_ops;
+ struct inet_hashinfo;
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 7286db80e8b8..ff92e74cd684 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TCPOPT_SACK 5 /* SACK Block */
+ #define TCPOPT_TIMESTAMP 8 /* Better RTT estimations/PAWS */
+ #define TCPOPT_MD5SIG 19 /* MD5 Signature (RFC2385) */
++#define TCPOPT_MPTCP 30
+ #define TCPOPT_EXP 254 /* Experimental */
+ /* Magic number to be after the option value for sharing TCP
+ * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
+@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TFO_SERVER_WO_SOCKOPT1 0x400
+ #define TFO_SERVER_WO_SOCKOPT2 0x800
+
++/* Flags from tcp_input.c for tcp_ack */
++#define FLAG_DATA 0x01 /* Incoming frame contained data. */
++#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
++#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
++#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
++#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
++#define FLAG_DATA_SACKED 0x20 /* New SACK. */
++#define FLAG_ECE 0x40 /* ECE in this ACK */
++#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
++#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
++#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
++#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
++#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
++#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
++#define MPTCP_FLAG_DATA_ACKED 0x8000
++
++#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
++#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
++#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
++#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
++
+ extern struct inet_timewait_death_row tcp_death_row;
+
+ /* sysctl variables for tcp */
+@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
+ #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+ #define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
+
++/**** START - Exports needed for MPTCP ****/
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
++
++struct mptcp_options_received;
++
++void tcp_enter_quickack_mode(struct sock *sk);
++int tcp_close_state(struct sock *sk);
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++ const struct sk_buff *skb);
++int tcp_xmit_probe_skb(struct sock *sk, int urgent);
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++ gfp_t gfp_mask);
++unsigned int tcp_mss_split_point(const struct sock *sk,
++ const struct sk_buff *skb,
++ unsigned int mss_now,
++ unsigned int max_segs,
++ int nonagle);
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss, int nonagle);
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss);
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now);
++void __pskb_trim_head(struct sk_buff *skb, int len);
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
++void tcp_reset(struct sock *sk);
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++ const u32 ack_seq, const u32 nwin);
++bool tcp_urg_mode(const struct tcp_sock *tp);
++void tcp_ack_probe(struct sock *sk);
++void tcp_rearm_rto(struct sock *sk);
++int tcp_write_timeout(struct sock *sk);
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++ unsigned int timeout, bool syn_set);
++void tcp_write_err(struct sock *sk);
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now);
++
++int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req);
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc);
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
++void tcp_v4_reqsk_destructor(struct request_sock *req);
++
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req);
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl, struct request_sock *req,
++ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
++void tcp_v6_destroy_sock(struct sock *sk);
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
++void tcp_v6_hash(struct sock *sk);
++struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req,
++ struct dst_entry *dst);
++void tcp_v6_reqsk_destructor(struct request_sock *req);
++
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
++ int large_allowed);
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
++
++void skb_clone_fraglist(struct sk_buff *skb);
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
++
++void inet_twsk_free(struct inet_timewait_sock *tw);
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
++/* These states need RST on ABORT according to RFC793 */
++static inline bool tcp_need_reset(int state)
++{
++ return (1 << state) &
++ (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
++ TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
++}
++
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
++ int hlen);
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++ bool *fragstolen);
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
++ struct sk_buff *from, bool *fragstolen);
++/**** END - Exports needed for MPTCP ****/
++
+ void tcp_tasklet_init(void);
+
+ void tcp_v4_err(struct sk_buff *skb, u32);
+@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ size_t len, int nonblock, int flags, int *addr_len);
+ void tcp_parse_options(const struct sk_buff *skb,
+ struct tcp_options_received *opt_rx,
++ struct mptcp_options_received *mopt_rx,
+ int estab, struct tcp_fastopen_cookie *foc);
+ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
+
+@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
+
+ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ u16 *mssp);
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
+-#else
+-static inline __u32 cookie_v4_init_sequence(struct sock *sk,
+- struct sk_buff *skb,
+- __u16 *mss)
+-{
+- return 0;
+-}
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mss);
+ #endif
+
+ __u32 cookie_init_timestamp(struct request_sock *req);
+@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
+ const struct tcphdr *th, u16 *mssp);
+ __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
+ __u16 *mss);
+-#else
+-static inline __u32 cookie_v6_init_sequence(struct sock *sk,
+- struct sk_buff *skb,
+- __u16 *mss)
+-{
+- return 0;
+-}
+ #endif
+ /* tcp_output.c */
+
+@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
+ void tcp_send_loss_probe(struct sock *sk);
+ bool tcp_schedule_loss_probe(struct sock *sk);
+
++u16 tcp_select_window(struct sock *sk);
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++
+ /* tcp_input.c */
+ void tcp_resume_early_retransmit(struct sock *sk);
+ void tcp_rearm_rto(struct sock *sk);
+ void tcp_reset(struct sock *sk);
++void tcp_set_rto(struct sock *sk);
++bool tcp_should_expand_sndbuf(const struct sock *sk);
++bool tcp_prune_ofo_queue(struct sock *sk);
+
+ /* tcp_timer.c */
+ void tcp_init_xmit_timers(struct sock *);
+@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
+ */
+ struct tcp_skb_cb {
+ union {
+- struct inet_skb_parm h4;
++ union {
++ struct inet_skb_parm h4;
+ #if IS_ENABLED(CONFIG_IPV6)
+- struct inet6_skb_parm h6;
++ struct inet6_skb_parm h6;
+ #endif
+- } header; /* For incoming frames */
++ } header; /* For incoming frames */
++#ifdef CONFIG_MPTCP
++ union { /* For MPTCP outgoing frames */
++ __u32 path_mask; /* paths that tried to send this skb */
++ __u32 dss[6]; /* DSS options */
++ };
++#endif
++ };
+ __u32 seq; /* Starting sequence number */
+ __u32 end_seq; /* SEQ + FIN + SYN + datalen */
+ __u32 when; /* used to compute rtt's */
++#ifdef CONFIG_MPTCP
++ __u8 mptcp_flags; /* flags for the MPTCP layer */
++ __u8 dss_off; /* Number of 4-byte words until
++ * seq-number */
++#endif
+ __u8 tcp_flags; /* TCP header flags. (tcp[13]) */
+
+ __u8 sacked; /* State flags for SACK/FACK. */
+@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
+ /* Determine a window scaling and initial window to offer. */
+ void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
+ __u32 *window_clamp, int wscale_ok,
+- __u8 *rcv_wscale, __u32 init_rcv_wnd);
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
+
+ static inline int tcp_win_from_space(int space)
+ {
+@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
+ space - (space>>sysctl_tcp_adv_win_scale);
+ }
+
++#ifdef CONFIG_MPTCP
++extern struct static_key mptcp_static_key;
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++ return static_key_false(&mptcp_static_key) && tp->mpc;
++}
++#else
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++ return 0;
++}
++#endif
++
+ /* Note: caller must be prepared to deal with negative returns */
+ static inline int tcp_space(const struct sock *sk)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = tcp_sk(sk)->meta_sk;
++
+ return tcp_win_from_space(sk->sk_rcvbuf -
+ atomic_read(&sk->sk_rmem_alloc));
+ }
+
+ static inline int tcp_full_space(const struct sock *sk)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = tcp_sk(sk)->meta_sk;
++
+ return tcp_win_from_space(sk->sk_rcvbuf);
+ }
+
+@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
+ ireq->wscale_ok = rx_opt->wscale_ok;
+ ireq->acked = 0;
+ ireq->ecn_ok = 0;
++ ireq->mptcp_rqsk = 0;
++ ireq->saw_mpc = 0;
+ ireq->ir_rmt_port = tcp_hdr(skb)->source;
+ ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
+ }
+@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
+ void tcp4_proc_exit(void);
+ #endif
+
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++ const struct tcp_request_sock_ops *af_ops,
++ struct sock *sk, struct sk_buff *skb);
++
+ /* TCP af-specific functions */
+ struct tcp_sock_af_ops {
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
+ #endif
+ };
+
++/* TCP/MPTCP-specific functions */
++struct tcp_sock_ops {
++ u32 (*__select_window)(struct sock *sk);
++ u16 (*select_window)(struct sock *sk);
++ void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
++ void (*init_buffer_space)(struct sock *sk);
++ void (*set_rto)(struct sock *sk);
++ bool (*should_expand_sndbuf)(const struct sock *sk);
++ void (*send_fin)(struct sock *sk);
++ bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++ void (*send_active_reset)(struct sock *sk, gfp_t priority);
++ int (*write_wakeup)(struct sock *sk);
++ bool (*prune_ofo_queue)(struct sock *sk);
++ void (*retransmit_timer)(struct sock *sk);
++ void (*time_wait)(struct sock *sk, int state, int timeo);
++ void (*cleanup_rbuf)(struct sock *sk, int copied);
++ void (*init_congestion_control)(struct sock *sk);
++};
++extern const struct tcp_sock_ops tcp_specific;
++
+ struct tcp_request_sock_ops {
++ u16 mss_clamp;
+ #ifdef CONFIG_TCP_MD5SIG
+ struct tcp_md5sig_key *(*md5_lookup) (struct sock *sk,
+ struct request_sock *req);
+@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
+ const struct request_sock *req,
+ const struct sk_buff *skb);
+ #endif
++ int (*init_req)(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb);
++#ifdef CONFIG_SYN_COOKIES
++ __u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mss);
++#endif
++ struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict);
++ __u32 (*init_seq)(const struct sk_buff *skb);
++ int (*send_synack)(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl, struct request_sock *req,
++ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++ void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
++ const unsigned long timeout);
+ };
+
++#ifdef CONFIG_SYN_COOKIES
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++ struct sock *sk, struct sk_buff *skb,
++ __u16 *mss)
++{
++ return ops->cookie_init_seq(sk, skb, mss);
++}
++#else
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++ struct sock *sk, struct sk_buff *skb,
++ __u16 *mss)
++{
++ return 0;
++}
++#endif
++
+ int tcpv4_offload_init(void);
+
+ void tcp_v4_init(void);
+diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
+index 9cf2394f0bcf..c2634b6ed854 100644
+--- a/include/uapi/linux/if.h
++++ b/include/uapi/linux/if.h
+@@ -109,6 +109,9 @@ enum net_device_flags {
+ #define IFF_DORMANT IFF_DORMANT
+ #define IFF_ECHO IFF_ECHO
+
++#define IFF_NOMULTIPATH 0x80000 /* Disable for MPTCP */
++#define IFF_MPBACKUP 0x100000 /* Use as backup path for MPTCP */
++
+ #define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
+ IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
+
+diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
+index 3b9718328d8b..487475681d84 100644
+--- a/include/uapi/linux/tcp.h
++++ b/include/uapi/linux/tcp.h
+@@ -112,6 +112,7 @@ enum {
+ #define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */
+ #define TCP_TIMESTAMP 24
+ #define TCP_NOTSENT_LOWAT 25 /* limit number of unsent bytes in write queue */
++#define MPTCP_ENABLED 26
+
+ struct tcp_repair_opt {
+ __u32 opt_code;
+diff --git a/net/Kconfig b/net/Kconfig
+index d92afe4204d9..96b58593ad5e 100644
+--- a/net/Kconfig
++++ b/net/Kconfig
+@@ -79,6 +79,7 @@ if INET
+ source "net/ipv4/Kconfig"
+ source "net/ipv6/Kconfig"
+ source "net/netlabel/Kconfig"
++source "net/mptcp/Kconfig"
+
+ endif # if INET
+
+diff --git a/net/Makefile b/net/Makefile
+index cbbbe6d657ca..244bac1435b1 100644
+--- a/net/Makefile
++++ b/net/Makefile
+@@ -20,6 +20,7 @@ obj-$(CONFIG_INET) += ipv4/
+ obj-$(CONFIG_XFRM) += xfrm/
+ obj-$(CONFIG_UNIX) += unix/
+ obj-$(CONFIG_NET) += ipv6/
++obj-$(CONFIG_MPTCP) += mptcp/
+ obj-$(CONFIG_PACKET) += packet/
+ obj-$(CONFIG_NET_KEY) += key/
+ obj-$(CONFIG_BRIDGE) += bridge/
+diff --git a/net/core/dev.c b/net/core/dev.c
+index 367a586d0c8a..215d2757fbf6 100644
+--- a/net/core/dev.c
++++ b/net/core/dev.c
+@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
+
+ dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
+ IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
+- IFF_AUTOMEDIA)) |
++ IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
+ (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
+ IFF_ALLMULTI));
+
+diff --git a/net/core/request_sock.c b/net/core/request_sock.c
+index 467f326126e0..909dfa13f499 100644
+--- a/net/core/request_sock.c
++++ b/net/core/request_sock.c
+@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
+ EXPORT_SYMBOL(sysctl_max_syn_backlog);
+
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+- unsigned int nr_table_entries)
++ unsigned int nr_table_entries,
++ gfp_t flags)
+ {
+ size_t lopt_size = sizeof(struct listen_sock);
+ struct listen_sock *lopt;
+@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
+ nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
+ lopt_size += nr_table_entries * sizeof(struct request_sock *);
+ if (lopt_size > PAGE_SIZE)
+- lopt = vzalloc(lopt_size);
++ lopt = __vmalloc(lopt_size,
++ flags | __GFP_HIGHMEM | __GFP_ZERO,
++ PAGE_KERNEL);
+ else
+- lopt = kzalloc(lopt_size, GFP_KERNEL);
++ lopt = kzalloc(lopt_size, flags);
+ if (lopt == NULL)
+ return -ENOMEM;
+
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..8abc5d60fbe3 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
+ skb_drop_list(&skb_shinfo(skb)->frag_list);
+ }
+
+-static void skb_clone_fraglist(struct sk_buff *skb)
++void skb_clone_fraglist(struct sk_buff *skb)
+ {
+ struct sk_buff *list;
+
+@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
+ skb->inner_mac_header += off;
+ }
+
+-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+ {
+ __copy_skb_header(new, old);
+
+diff --git a/net/core/sock.c b/net/core/sock.c
+index 026e01f70274..359295523177 100644
+--- a/net/core/sock.c
++++ b/net/core/sock.c
+@@ -136,6 +136,11 @@
+
+ #include <trace/events/sock.h>
+
++#ifdef CONFIG_MPTCP
++#include <net/mptcp.h>
++#include <net/inet_common.h>
++#endif
++
+ #ifdef CONFIG_INET
+ #include <net/tcp.h>
+ #endif
+@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
+ "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG" ,
+ "slock-AF_NFC" , "slock-AF_VSOCK" ,"slock-AF_MAX"
+ };
+-static const char *const af_family_clock_key_strings[AF_MAX+1] = {
++char *const af_family_clock_key_strings[AF_MAX+1] = {
+ "clock-AF_UNSPEC", "clock-AF_UNIX" , "clock-AF_INET" ,
+ "clock-AF_AX25" , "clock-AF_IPX" , "clock-AF_APPLETALK",
+ "clock-AF_NETROM", "clock-AF_BRIDGE" , "clock-AF_ATMPVC" ,
+@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
+ * sk_callback_lock locking rules are per-address-family,
+ * so split the lock classes by using a per-AF key:
+ */
+-static struct lock_class_key af_callback_keys[AF_MAX];
++struct lock_class_key af_callback_keys[AF_MAX];
+
+ /* Take into consideration the size of the struct sk_buff overhead in the
+ * determination of these values, since that is non-constant across
+@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
+ }
+ }
+
+-#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
+-
+ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
+ {
+ if (sk->sk_flags & flags) {
+@@ -1253,8 +1256,25 @@ lenout:
+ *
+ * (We also register the sk_lock with the lock validator.)
+ */
+-static inline void sock_lock_init(struct sock *sk)
+-{
++void sock_lock_init(struct sock *sk)
++{
++#ifdef CONFIG_MPTCP
++ /* Reclassify the lock-class for subflows */
++ if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
++ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
++ sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
++ &meta_slock_key,
++ "sk_lock-AF_INET-MPTCP",
++ &meta_key);
++
++ /* We don't yet have the mptcp-point.
++ * Thus we still need inet_sock_destruct
++ */
++ sk->sk_destruct = inet_sock_destruct;
++ return;
++ }
++#endif
++
+ sock_lock_init_class_and_name(sk,
+ af_family_slock_key_strings[sk->sk_family],
+ af_family_slock_keys + sk->sk_family,
+@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
+ }
+ EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
+
+-static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
+ int family)
+ {
+ struct sock *sk;
+diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
+index 4db3c2a1679c..04cb17d4b0ce 100644
+--- a/net/dccp/ipv6.c
++++ b/net/dccp/ipv6.c
+@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
+ goto drop;
+
+- req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
++ req = inet_reqsk_alloc(&dccp6_request_sock_ops);
+ if (req == NULL)
+ goto drop;
+
+diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
+index 05c57f0fcabe..630434db0085 100644
+--- a/net/ipv4/Kconfig
++++ b/net/ipv4/Kconfig
+@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
+ For further details see:
+ http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
+
++config TCP_CONG_COUPLED
++ tristate "MPTCP COUPLED CONGESTION CONTROL"
++ depends on MPTCP
++ default n
++ ---help---
++ MultiPath TCP Coupled Congestion Control
++ To enable it, just put 'coupled' in tcp_congestion_control
++
++config TCP_CONG_OLIA
++ tristate "MPTCP Opportunistic Linked Increase"
++ depends on MPTCP
++ default n
++ ---help---
++ MultiPath TCP Opportunistic Linked Increase Congestion Control
++ To enable it, just put 'olia' in tcp_congestion_control
++
++config TCP_CONG_WVEGAS
++ tristate "MPTCP WVEGAS CONGESTION CONTROL"
++ depends on MPTCP
++ default n
++ ---help---
++ wVegas congestion control for MPTCP
++ To enable it, just put 'wvegas' in tcp_congestion_control
++
+ choice
+ prompt "Default TCP congestion control"
+ default DEFAULT_CUBIC
+@@ -584,6 +608,15 @@ choice
+ config DEFAULT_WESTWOOD
+ bool "Westwood" if TCP_CONG_WESTWOOD=y
+
++ config DEFAULT_COUPLED
++ bool "Coupled" if TCP_CONG_COUPLED=y
++
++ config DEFAULT_OLIA
++ bool "Olia" if TCP_CONG_OLIA=y
++
++ config DEFAULT_WVEGAS
++ bool "Wvegas" if TCP_CONG_WVEGAS=y
++
+ config DEFAULT_RENO
+ bool "Reno"
+
+@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
+ default "vegas" if DEFAULT_VEGAS
+ default "westwood" if DEFAULT_WESTWOOD
+ default "veno" if DEFAULT_VENO
++ default "coupled" if DEFAULT_COUPLED
++ default "wvegas" if DEFAULT_WVEGAS
+ default "reno" if DEFAULT_RENO
+ default "cubic"
+
+diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
+index d156b3c5f363..4afd6d8d9028 100644
+--- a/net/ipv4/af_inet.c
++++ b/net/ipv4/af_inet.c
+@@ -104,6 +104,7 @@
+ #include <net/ip_fib.h>
+ #include <net/inet_connection_sock.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/ping.h>
+@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
+ * Create an inet socket.
+ */
+
+-static int inet_create(struct net *net, struct socket *sock, int protocol,
+- int kern)
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ struct sock *sk;
+ struct inet_protosw *answer;
+@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
+ lock_sock(sk2);
+
+ sock_rps_record_flow(sk2);
++
++ if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
++ struct sock *sk_it = sk2;
++
++ mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++
++ if (tcp_sk(sk2)->mpcb->master_sk) {
++ sk_it = tcp_sk(sk2)->mpcb->master_sk;
++
++ write_lock_bh(&sk_it->sk_callback_lock);
++ sk_it->sk_wq = newsock->wq;
++ sk_it->sk_socket = newsock;
++ write_unlock_bh(&sk_it->sk_callback_lock);
++ }
++ }
++
+ WARN_ON(!((1 << sk2->sk_state) &
+ (TCPF_ESTABLISHED | TCPF_SYN_RECV |
+ TCPF_CLOSE_WAIT | TCPF_CLOSE)));
+@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
+
+ ip_init();
+
++ /* We must initialize MPTCP before TCP. */
++ mptcp_init();
++
+ tcp_v4_init();
+
+ /* Setup TCP slab cache for open requests. */
+diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
+index 14d02ea905b6..7d734d8af19b 100644
+--- a/net/ipv4/inet_connection_sock.c
++++ b/net/ipv4/inet_connection_sock.c
+@@ -23,6 +23,7 @@
+ #include <net/route.h>
+ #include <net/tcp_states.h>
+ #include <net/xfrm.h>
++#include <net/mptcp.h>
+
+ #ifdef INET_CSK_DEBUG
+ const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
+@@ -465,8 +466,8 @@ no_route:
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
+
+-static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
+- const u32 rnd, const u32 synq_hsize)
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++ const u32 synq_hsize)
+ {
+ return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
+ }
+@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
+
+ lopt->clock_hand = i;
+
+- if (lopt->qlen)
++ if (lopt->qlen && !is_meta_sk(parent))
+ inet_csk_reset_keepalive_timer(parent, interval);
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
+@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
+ const struct request_sock *req,
+ const gfp_t priority)
+ {
+- struct sock *newsk = sk_clone_lock(sk, priority);
++ struct sock *newsk;
++
++ newsk = sk_clone_lock(sk, priority);
+
+ if (newsk != NULL) {
+ struct inet_connection_sock *newicsk = inet_csk(newsk);
+@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
+ {
+ struct inet_sock *inet = inet_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+- int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
++ int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
++ GFP_KERNEL);
+
+ if (rc != 0)
+ return rc;
+@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
+
+ while ((req = acc_req) != NULL) {
+ struct sock *child = req->sk;
++ bool mutex_taken = false;
+
+ acc_req = req->dl_next;
+
++ if (is_meta_sk(child)) {
++ mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
++ mutex_taken = true;
++ }
+ local_bh_disable();
+ bh_lock_sock(child);
+ WARN_ON(sock_owned_by_user(child));
+@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
+
+ bh_unlock_sock(child);
+ local_bh_enable();
++ if (mutex_taken)
++ mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
+ sock_put(child);
+
+ sk_acceptq_removed(sk);
+diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
+index c86624b36a62..0ff3fe004d62 100644
+--- a/net/ipv4/syncookies.c
++++ b/net/ipv4/syncookies.c
+@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ }
+ EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
+
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mssp)
+ {
+ const struct iphdr *iph = ip_hdr(skb);
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+
+ /* check for timestamp cookie support */
+ memset(&tcp_opt, 0, sizeof(tcp_opt));
+- tcp_parse_options(skb, &tcp_opt, 0, NULL);
++ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+
+ if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ goto out;
+@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ /* Try to redo what tcp_v4_send_synack did. */
+ req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
+
+- tcp_select_initial_window(tcp_full_space(sk), req->mss,
+- &req->rcv_wnd, &req->window_clamp,
+- ireq->wscale_ok, &rcv_wscale,
+- dst_metric(&rt->dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++ &req->rcv_wnd, &req->window_clamp,
++ ireq->wscale_ok, &rcv_wscale,
++ dst_metric(&rt->dst, RTAX_INITRWND), sk);
+
+ ireq->rcv_wscale = rcv_wscale;
+
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 9d2118e5fbc7..2cb89f886d45 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -271,6 +271,7 @@
+
+ #include <net/icmp.h>
+ #include <net/inet_common.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/xfrm.h>
+ #include <net/ip.h>
+@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+ return period;
+ }
+
++const struct tcp_sock_ops tcp_specific = {
++ .__select_window = __tcp_select_window,
++ .select_window = tcp_select_window,
++ .select_initial_window = tcp_select_initial_window,
++ .init_buffer_space = tcp_init_buffer_space,
++ .set_rto = tcp_set_rto,
++ .should_expand_sndbuf = tcp_should_expand_sndbuf,
++ .init_congestion_control = tcp_init_congestion_control,
++ .send_fin = tcp_send_fin,
++ .write_xmit = tcp_write_xmit,
++ .send_active_reset = tcp_send_active_reset,
++ .write_wakeup = tcp_write_wakeup,
++ .prune_ofo_queue = tcp_prune_ofo_queue,
++ .retransmit_timer = tcp_retransmit_timer,
++ .time_wait = tcp_time_wait,
++ .cleanup_rbuf = tcp_cleanup_rbuf,
++};
++
+ /* Address-family independent initialization for a tcp_sock.
+ *
+ * NOTE: A lot of things set to zero explicitly by call to
+@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
+ sk->sk_sndbuf = sysctl_tcp_wmem[1];
+ sk->sk_rcvbuf = sysctl_tcp_rmem[1];
+
++ tp->ops = &tcp_specific;
++
+ local_bh_disable();
+ sock_update_memcg(sk);
+ sk_sockets_allocated_inc(sk);
+@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+ int ret;
+
+ sock_rps_record_flow(sk);
++
++#ifdef CONFIG_MPTCP
++ if (mptcp(tcp_sk(sk))) {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++#endif
+ /*
+ * We can't seek on a socket input
+ */
+@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
+ return NULL;
+ }
+
+-static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
+- int large_allowed)
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 xmit_size_goal, old_size_goal;
+@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+ {
+ int mss_now;
+
+- mss_now = tcp_current_mss(sk);
+- *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ if (mptcp(tcp_sk(sk))) {
++ mss_now = mptcp_current_mss(sk);
++ *size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ } else {
++ mss_now = tcp_current_mss(sk);
++ *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ }
+
+ return mss_now;
+ }
+@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+ * is fully established.
+ */
+ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+- !tcp_passive_fastopen(sk)) {
++ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++ tp->mpcb->master_sk : sk)) {
+ if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ goto out_err;
+ }
+
++ if (mptcp(tp)) {
++ struct sock *sk_it = sk;
++
++ /* We must check this with socket-lock hold because we iterate
++ * over the subflows.
++ */
++ if (!mptcp_can_sendpage(sk)) {
++ ssize_t ret;
++
++ release_sock(sk);
++ ret = sock_no_sendpage(sk->sk_socket, page, offset,
++ size, flags);
++ lock_sock(sk);
++ return ret;
++ }
++
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++
+ clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+
+ mss_now = tcp_send_mss(sk, &size_goal, flags);
+@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+ {
+ ssize_t res;
+
+- if (!(sk->sk_route_caps & NETIF_F_SG) ||
+- !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
++ /* If MPTCP is enabled, we check it later after establishment */
++ if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
++ !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
+ return sock_no_sendpage(sk->sk_socket, page, offset, size,
+ flags);
+
+@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
+ const struct tcp_sock *tp = tcp_sk(sk);
+ int tmp = tp->mss_cache;
+
++ if (mptcp(tp))
++ return mptcp_select_size(sk, sg);
++
+ if (sg) {
+ if (sk_can_gso(sk)) {
+ /* Small frames wont use a full page:
+@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ * is fully established.
+ */
+ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+- !tcp_passive_fastopen(sk)) {
++ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++ tp->mpcb->master_sk : sk)) {
+ if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ goto do_error;
+ }
+
++ if (mptcp(tp)) {
++ struct sock *sk_it = sk;
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++
+ if (unlikely(tp->repair)) {
+ if (tp->repair_queue == TCP_RECV_QUEUE) {
+ copied = tcp_send_rcvq(sk, msg, size);
+@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
+ goto out_err;
+
+- sg = !!(sk->sk_route_caps & NETIF_F_SG);
++ if (mptcp(tp))
++ sg = mptcp_can_sg(sk);
++ else
++ sg = !!(sk->sk_route_caps & NETIF_F_SG);
+
+ while (--iovlen >= 0) {
+ size_t seglen = iov->iov_len;
+@@ -1183,8 +1251,15 @@ new_segment:
+
+ /*
+ * Check whether we can use HW checksum.
++ *
++ * If dss-csum is enabled, we do not do hw-csum.
++ * In case of non-mptcp we check the
++ * device-capabilities.
++ * In case of mptcp, hw-csum's will be handled
++ * later in mptcp_write_xmit.
+ */
+- if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
++ if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
++ (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ skb_entail(sk, skb);
+@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
+
+ /* Optimize, __tcp_select_window() is not cheap. */
+ if (2*rcv_window_now <= tp->window_clamp) {
+- __u32 new_window = __tcp_select_window(sk);
++ __u32 new_window = tp->ops->__select_window(sk);
+
+ /* Send ACK now, if this read freed lots of space
+ * in our buffer. Certainly, new_window is new window.
+@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ /* Clean up data we have read: This will do ACK frames. */
+ if (copied > 0) {
+ tcp_recv_skb(sk, seq, &offset);
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+ }
+ return copied;
+ }
+@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+
+ lock_sock(sk);
+
++#ifdef CONFIG_MPTCP
++ if (mptcp(tp)) {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++#endif
++
+ err = -ENOTCONN;
+ if (sk->sk_state == TCP_LISTEN)
+ goto out;
+@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ }
+ }
+
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+
+ if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
+ /* Install new reader */
+@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ if (tp->rcv_wnd == 0 &&
+ !skb_queue_empty(&sk->sk_async_wait_queue)) {
+ tcp_service_net_dma(sk, true);
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+ } else
+ dma_async_issue_pending(tp->ucopy.dma_chan);
+ }
+@@ -1993,7 +2076,7 @@ skip_copy:
+ */
+
+ /* Clean up data we have read: This will do ACK frames. */
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+
+ release_sock(sk);
+ return copied;
+@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
+ /* TCP_CLOSING */ TCP_CLOSING,
+ };
+
+-static int tcp_close_state(struct sock *sk)
++int tcp_close_state(struct sock *sk)
+ {
+ int next = (int)new_state[sk->sk_state];
+ int ns = next & TCP_STATE_MASK;
+@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
+ TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
+ /* Clear out any half completed packets. FIN if needed. */
+ if (tcp_close_state(sk))
+- tcp_send_fin(sk);
++ tcp_sk(sk)->ops->send_fin(sk);
+ }
+ }
+ EXPORT_SYMBOL(tcp_shutdown);
+@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
+ int data_was_unread = 0;
+ int state;
+
++ if (is_meta_sk(sk)) {
++ mptcp_close(sk, timeout);
++ return;
++ }
++
+ lock_sock(sk);
+ sk->sk_shutdown = SHUTDOWN_MASK;
+
+@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
+ /* Unread data was tossed, zap the connection. */
+ NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, sk->sk_allocation);
++ tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
+ } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
+ /* Check zero linger _after_ checking for unread data. */
+ sk->sk_prot->disconnect(sk, 0);
+@@ -2247,7 +2335,7 @@ adjudge_to_death:
+ struct tcp_sock *tp = tcp_sk(sk);
+ if (tp->linger2 < 0) {
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ NET_INC_STATS_BH(sock_net(sk),
+ LINUX_MIB_TCPABORTONLINGER);
+ } else {
+@@ -2257,7 +2345,8 @@ adjudge_to_death:
+ inet_csk_reset_keepalive_timer(sk,
+ tmo - TCP_TIMEWAIT_LEN);
+ } else {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
++ tmo);
+ goto out;
+ }
+ }
+@@ -2266,7 +2355,7 @@ adjudge_to_death:
+ sk_mem_reclaim(sk);
+ if (tcp_check_oom(sk, 0)) {
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
+ NET_INC_STATS_BH(sock_net(sk),
+ LINUX_MIB_TCPABORTONMEMORY);
+ }
+@@ -2291,15 +2380,6 @@ out:
+ }
+ EXPORT_SYMBOL(tcp_close);
+
+-/* These states need RST on ABORT according to RFC793 */
+-
+-static inline bool tcp_need_reset(int state)
+-{
+- return (1 << state) &
+- (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
+- TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
+-}
+-
+ int tcp_disconnect(struct sock *sk, int flags)
+ {
+ struct inet_sock *inet = inet_sk(sk);
+@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
+ /* The last check adjusts for discrepancy of Linux wrt. RFC
+ * states
+ */
+- tcp_send_active_reset(sk, gfp_any());
++ tp->ops->send_active_reset(sk, gfp_any());
+ sk->sk_err = ECONNRESET;
+ } else if (old_state == TCP_SYN_SENT)
+ sk->sk_err = ECONNRESET;
+@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
+ if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+ inet_reset_saddr(sk);
+
++ if (is_meta_sk(sk)) {
++ mptcp_disconnect(sk);
++ } else {
++ if (tp->inside_tk_table)
++ mptcp_hash_remove_bh(tp);
++ }
++
+ sk->sk_shutdown = 0;
+ sock_reset_flag(sk, SOCK_DONE);
+ tp->srtt_us = 0;
+@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ break;
+
+ case TCP_DEFER_ACCEPT:
++ /* An established MPTCP-connection (mptcp(tp) only returns true
++ * if the socket is established) should not use DEFER on new
++ * subflows.
++ */
++ if (mptcp(tp))
++ break;
+ /* Translate value in seconds to number of retransmits */
+ icsk->icsk_accept_queue.rskq_defer_accept =
+ secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
+ inet_csk_ack_scheduled(sk)) {
+ icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
+- tcp_cleanup_rbuf(sk, 1);
++ tp->ops->cleanup_rbuf(sk, 1);
+ if (!(val & 1))
+ icsk->icsk_ack.pingpong = 1;
+ }
+@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ tp->notsent_lowat = val;
+ sk->sk_write_space(sk);
+ break;
++#ifdef CONFIG_MPTCP
++ case MPTCP_ENABLED:
++ if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
++ if (val)
++ tp->mptcp_enabled = 1;
++ else
++ tp->mptcp_enabled = 0;
++ } else {
++ err = -EPERM;
++ }
++ break;
++#endif
+ default:
+ err = -ENOPROTOOPT;
+ break;
+@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
+ case TCP_NOTSENT_LOWAT:
+ val = tp->notsent_lowat;
+ break;
++#ifdef CONFIG_MPTCP
++ case MPTCP_ENABLED:
++ val = tp->mptcp_enabled;
++ break;
++#endif
+ default:
+ return -ENOPROTOOPT;
+ }
+@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
+ if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
+
++ WARN_ON(sk->sk_state == TCP_CLOSE);
+ tcp_set_state(sk, TCP_CLOSE);
++
+ tcp_clear_xmit_timers(sk);
++
+ if (req != NULL)
+ reqsk_fastopen_remove(sk, req, false);
+
+diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
+index 9771563ab564..5c230d96c4c1 100644
+--- a/net/ipv4/tcp_fastopen.c
++++ b/net/ipv4/tcp_fastopen.c
+@@ -7,6 +7,7 @@
+ #include <linux/rculist.h>
+ #include <net/inetpeer.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+
+ int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
+
+@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ {
+ struct tcp_sock *tp;
+ struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
+- struct sock *child;
++ struct sock *child, *meta_sk;
+
+ req->num_retrans = 0;
+ req->num_timeout = 0;
+@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ /* Add the child socket directly into the accept queue */
+ inet_csk_reqsk_queue_add(sk, req, child);
+
+- /* Now finish processing the fastopen child socket. */
+- inet_csk(child)->icsk_af_ops->rebuild_header(child);
+- tcp_init_congestion_control(child);
+- tcp_mtup_init(child);
+- tcp_init_metrics(child);
+- tcp_init_buffer_space(child);
+-
+ /* Queue the data carried in the SYN packet. We need to first
+ * bump skb's refcnt because the caller will attempt to free it.
+ *
+@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ tp->syn_data_acked = 1;
+ }
+ tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++
++ meta_sk = child;
++ if (!mptcp_check_req_fastopen(meta_sk, req)) {
++ child = tcp_sk(meta_sk)->mpcb->master_sk;
++ tp = tcp_sk(child);
++ }
++
++ /* Now finish processing the fastopen child socket. */
++ inet_csk(child)->icsk_af_ops->rebuild_header(child);
++ tp->ops->init_congestion_control(child);
++ tcp_mtup_init(child);
++ tcp_init_metrics(child);
++ tp->ops->init_buffer_space(child);
++
+ sk->sk_data_ready(sk);
+- bh_unlock_sock(child);
++ if (mptcp(tcp_sk(child)))
++ bh_unlock_sock(child);
++ bh_unlock_sock(meta_sk);
+ sock_put(child);
+ WARN_ON(req->sk == NULL);
+ return true;
+diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
+index 40639c288dc2..3273bb69f387 100644
+--- a/net/ipv4/tcp_input.c
++++ b/net/ipv4/tcp_input.c
+@@ -74,6 +74,9 @@
+ #include <linux/ipsec.h>
+ #include <asm/unaligned.h>
+ #include <net/netdma.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
+
+ int sysctl_tcp_timestamps __read_mostly = 1;
+ int sysctl_tcp_window_scaling __read_mostly = 1;
+@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
+ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
+ int sysctl_tcp_early_retrans __read_mostly = 3;
+
+-#define FLAG_DATA 0x01 /* Incoming frame contained data. */
+-#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
+-#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
+-#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
+-#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
+-#define FLAG_DATA_SACKED 0x20 /* New SACK. */
+-#define FLAG_ECE 0x40 /* ECE in this ACK */
+-#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
+-#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
+-#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
+-#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
+-#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
+-#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
+-
+-#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
+-#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
+-#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
+-#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
+-
+ #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
+ #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
+
+@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
+ icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
+ }
+
+-static void tcp_enter_quickack_mode(struct sock *sk)
++void tcp_enter_quickack_mode(struct sock *sk)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ tcp_incr_quickack(sk);
+@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ per_mss = roundup_pow_of_two(per_mss) +
+ SKB_DATA_ALIGN(sizeof(struct sk_buff));
+
+- nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
+- nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++ if (mptcp(tp)) {
++ nr_segs = mptcp_check_snd_buf(tp);
++ } else {
++ nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
++ nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++ }
+
+ /* Fast Recovery (RFC 5681 3.2) :
+ * Cubic needs 1.7 factor, rounded to 2 to include
+@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ */
+ sndmem = 2 * nr_segs * per_mss;
+
+- if (sk->sk_sndbuf < sndmem)
++ /* MPTCP: after this sndmem is the new contribution of the
++ * current subflow to the aggregated sndbuf */
++ if (sk->sk_sndbuf < sndmem) {
++ int old_sndbuf = sk->sk_sndbuf;
+ sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
++ /* MPTCP: ok, the subflow sndbuf has grown, reflect
++ * this in the aggregate buffer.*/
++ if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
++ mptcp_update_sndbuf(tp);
++ }
+ }
+
+ /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
+@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
+ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
+
+ /* Check #1 */
+- if (tp->rcv_ssthresh < tp->window_clamp &&
+- (int)tp->rcv_ssthresh < tcp_space(sk) &&
++ if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
++ (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
+ !sk_under_memory_pressure(sk)) {
+ int incr;
+
+@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ * will fit to rcvbuf in future.
+ */
+ if (tcp_win_from_space(skb->truesize) <= skb->len)
+- incr = 2 * tp->advmss;
++ incr = 2 * meta_tp->advmss;
+ else
+- incr = __tcp_grow_window(sk, skb);
++ incr = __tcp_grow_window(meta_sk, skb);
+
+ if (incr) {
+ incr = max_t(int, incr, 2 * skb->len);
+- tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
+- tp->window_clamp);
++ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
++ meta_tp->window_clamp);
+ inet_csk(sk)->icsk_ack.quick |= 1;
+ }
+ }
+@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
+ int copied;
+
+ time = tcp_time_stamp - tp->rcvq_space.time;
+- if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
++ if (mptcp(tp)) {
++ if (mptcp_check_rtt(tp, time))
++ return;
++ } else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+ return;
+
+ /* Number of bytes copied to user in last RTT */
+@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
+ /* Calculate rto without backoff. This is the second half of Van Jacobson's
+ * routine referred to above.
+ */
+-static void tcp_set_rto(struct sock *sk)
++void tcp_set_rto(struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ /* Old crap is replaced with new one. 8)
+@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
+ int len;
+ int in_sack;
+
+- if (!sk_can_gso(sk))
++ /* For MPTCP we cannot shift skb-data and remove one skb from the
++ * send-queue, because this will make us loose the DSS-option (which
++ * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
++ */
++ if (!sk_can_gso(sk) || mptcp(tp))
+ goto fallback;
+
+ /* Normally R but no L won't result in plain S */
+@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
+ return false;
+
+ tcp_rtt_estimator(sk, seq_rtt_us);
+- tcp_set_rto(sk);
++ tp->ops->set_rto(sk);
+
+ /* RFC6298: only reset backoff on valid RTT measurement. */
+ inet_csk(sk)->icsk_backoff = 0;
+@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
+ }
+
+ /* If we get here, the whole TSO packet has not been acked. */
+-static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 packets_acked;
+@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ */
+ if (!(scb->tcp_flags & TCPHDR_SYN)) {
+ flag |= FLAG_DATA_ACKED;
++ if (mptcp(tp) && mptcp_is_data_seq(skb))
++ flag |= MPTCP_FLAG_DATA_ACKED;
+ } else {
+ flag |= FLAG_SYN_ACKED;
+ tp->retrans_stamp = 0;
+@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ return flag;
+ }
+
+-static void tcp_ack_probe(struct sock *sk)
++void tcp_ack_probe(struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
+ /* Check that window update is acceptable.
+ * The function assumes that snd_una<=ack<=snd_next.
+ */
+-static inline bool tcp_may_update_window(const struct tcp_sock *tp,
+- const u32 ack, const u32 ack_seq,
+- const u32 nwin)
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++ const u32 ack_seq, const u32 nwin)
+ {
+ return after(ack, tp->snd_una) ||
+ after(ack_seq, tp->snd_wl1) ||
+@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
+ }
+
+ /* This routine deals with incoming acks, but not outgoing ones. */
+-static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
++static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+ sack_rtt_us);
+ acked -= tp->packets_out;
+
++ if (mptcp(tp)) {
++ if (mptcp_fallback_infinite(sk, flag)) {
++ pr_err("%s resetting flow\n", __func__);
++ mptcp_send_reset(sk);
++ goto invalid_ack;
++ }
++
++ mptcp_clean_rtx_infinite(skb, sk);
++ }
++
+ /* Advance cwnd if state allows */
+ if (tcp_may_raise_cwnd(sk, flag))
+ tcp_cong_avoid(sk, ack, acked);
+@@ -3512,8 +3528,9 @@ old_ack:
+ * the fast version below fails.
+ */
+ void tcp_parse_options(const struct sk_buff *skb,
+- struct tcp_options_received *opt_rx, int estab,
+- struct tcp_fastopen_cookie *foc)
++ struct tcp_options_received *opt_rx,
++ struct mptcp_options_received *mopt,
++ int estab, struct tcp_fastopen_cookie *foc)
+ {
+ const unsigned char *ptr;
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
+ */
+ break;
+ #endif
++ case TCPOPT_MPTCP:
++ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++ break;
+ case TCPOPT_EXP:
+ /* Fast Open option shares code 254 using a
+ * 16 bits magic number. It's valid only in
+@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
+ if (tcp_parse_aligned_timestamp(tp, th))
+ return true;
+ }
+-
+- tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
++ tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
++ 1, NULL);
+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+
+@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
+ dst = __sk_dst_get(sk);
+ if (!dst || !dst_metric(dst, RTAX_QUICKACK))
+ inet_csk(sk)->icsk_ack.pingpong = 1;
++ if (mptcp(tp))
++ mptcp_sub_close_passive(sk);
+ break;
+
+ case TCP_CLOSE_WAIT:
+@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
+ tcp_set_state(sk, TCP_CLOSING);
+ break;
+ case TCP_FIN_WAIT2:
++ if (mptcp(tp)) {
++ /* The socket will get closed by mptcp_data_ready.
++ * We first have to process all data-sequences.
++ */
++ tp->close_it = 1;
++ break;
++ }
+ /* Received a FIN -- send ACK and enter TIME_WAIT. */
+ tcp_send_ack(sk);
+- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ break;
+ default:
+ /* Only TCP_LISTEN and TCP_CLOSE are left, in these
+@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
+ if (!sock_flag(sk, SOCK_DEAD)) {
+ sk->sk_state_change(sk);
+
++ /* Don't wake up MPTCP-subflows */
++ if (mptcp(tp))
++ return;
++
+ /* Do not send POLL_HUP for half duplex close. */
+ if (sk->sk_shutdown == SHUTDOWN_MASK ||
+ sk->sk_state == TCP_CLOSE)
+@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
+ tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
+ }
+
+- if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
++ /* In case of MPTCP, the segment may be empty if it's a
++ * non-data DATA_FIN. (see beginning of tcp_data_queue)
++ */
++ if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
++ !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
+ SOCK_DEBUG(sk, "ofo packet was already received\n");
+ __skb_unlink(skb, &tp->out_of_order_queue);
+ __kfree_skb(skb);
+@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
+ }
+ }
+
+-static bool tcp_prune_ofo_queue(struct sock *sk);
+ static int tcp_prune_queue(struct sock *sk);
+
+ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ unsigned int size)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = mptcp_meta_sk(sk);
++
+ if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+ !sk_rmem_schedule(sk, skb, size)) {
+
+@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ return -1;
+
+ if (!sk_rmem_schedule(sk, skb, size)) {
+- if (!tcp_prune_ofo_queue(sk))
++ if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
+ return -1;
+
+ if (!sk_rmem_schedule(sk, skb, size))
+@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ * Better try to coalesce them right now to avoid future collapses.
+ * Returns true if caller should free @from instead of queueing it
+ */
+-static bool tcp_try_coalesce(struct sock *sk,
+- struct sk_buff *to,
+- struct sk_buff *from,
+- bool *fragstolen)
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
++ bool *fragstolen)
+ {
+ int delta;
+
+ *fragstolen = false;
+
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++ return false;
++
+ if (tcp_hdr(from)->fin)
+ return false;
+
+@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+
+ /* Do skb overlap to previous one? */
+ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+- if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* MPTCP allows non-data data-fin to be in the ofo-queue */
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
++ !(mptcp(tp) && end_seq == seq)) {
+ /* All the bits are present. Drop. */
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
+ __kfree_skb(skb);
+@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ end_seq);
+ break;
+ }
++ /* MPTCP allows non-data data-fin to be in the ofo-queue */
++ if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
++ continue;
+ __skb_unlink(skb1, &tp->out_of_order_queue);
+ tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
+ TCP_SKB_CB(skb1)->end_seq);
+@@ -4280,8 +4325,8 @@ end:
+ }
+ }
+
+-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
+- bool *fragstolen)
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++ bool *fragstolen)
+ {
+ int eaten;
+ struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
+ int eaten = -1;
+ bool fragstolen = false;
+
+- if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
++ /* If no data is present, but a data_fin is in the options, we still
++ * have to call mptcp_queue_skb later on. */
++ if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
++ !(mptcp(tp) && mptcp_is_data_fin(skb)))
+ goto drop;
+
+ skb_dst_drop(skb);
+@@ -4389,7 +4437,7 @@ queue_and_out:
+ eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
+ }
+ tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+- if (skb->len)
++ if (skb->len || mptcp_is_data_fin(skb))
+ tcp_event_data_recv(sk, skb);
+ if (th->fin)
+ tcp_fin(sk);
+@@ -4411,7 +4459,11 @@ queue_and_out:
+
+ if (eaten > 0)
+ kfree_skb_partial(skb, fragstolen);
+- if (!sock_flag(sk, SOCK_DEAD))
++ if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
++ /* MPTCP: we always have to call data_ready, because
++ * we may be about to receive a data-fin, which still
++ * must get queued.
++ */
+ sk->sk_data_ready(sk);
+ return;
+ }
+@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
+ next = skb_queue_next(list, skb);
+
+ __skb_unlink(skb, list);
++ if (mptcp(tcp_sk(sk)))
++ mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
+ __kfree_skb(skb);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
+
+@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
+ * Purge the out-of-order queue.
+ * Return true if queue was pruned.
+ */
+-static bool tcp_prune_ofo_queue(struct sock *sk)
++bool tcp_prune_ofo_queue(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ bool res = false;
+@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
+ /* Collapsing did not help, destructive actions follow.
+ * This must not ever occur. */
+
+- tcp_prune_ofo_queue(sk);
++ tp->ops->prune_ofo_queue(sk);
+
+ if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+ return 0;
+@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
+ return -1;
+ }
+
+-static bool tcp_should_expand_sndbuf(const struct sock *sk)
++/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
++ * As additional protections, we do not touch cwnd in retransmission phases,
++ * and if application hit its sndbuf limit recently.
++ */
++void tcp_cwnd_application_limited(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
++ sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
++ /* Limited by application or receiver window. */
++ u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
++ u32 win_used = max(tp->snd_cwnd_used, init_win);
++ if (win_used < tp->snd_cwnd) {
++ tp->snd_ssthresh = tcp_current_ssthresh(sk);
++ tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
++ }
++ tp->snd_cwnd_used = 0;
++ }
++ tp->snd_cwnd_stamp = tcp_time_stamp;
++}
++
++bool tcp_should_expand_sndbuf(const struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+- if (tcp_should_expand_sndbuf(sk)) {
++ if (tp->ops->should_expand_sndbuf(sk)) {
+ tcp_sndbuf_expand(sk);
+ tp->snd_cwnd_stamp = tcp_time_stamp;
+ }
+@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
+ {
+ if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
+ sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
+- if (sk->sk_socket &&
+- test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
++ if (mptcp(tcp_sk(sk)) ||
++ (sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
+ tcp_new_space(sk);
+ }
+ }
+@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+ /* ... and right edge of window advances far enough.
+ * (tcp_recvmsg() will send ACK otherwise). Or...
+ */
+- __tcp_select_window(sk) >= tp->rcv_wnd) ||
++ tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
+ /* We ACK each frame or... */
+ tcp_in_quickack_mode(sk) ||
+ /* We have out of order data. */
+@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
++ /* MPTCP urgent data is not yet supported */
++ if (mptcp(tp))
++ return;
++
+ /* Check if we get a new urgent pointer - normally not. */
+ if (th->urg)
+ tcp_check_urg(sk, th);
+@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
+ }
+
+ #ifdef CONFIG_NET_DMA
+-static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
+- int hlen)
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ int chunk = skb->len - hlen;
+@@ -5052,9 +5132,15 @@ syn_challenge:
+ goto discard;
+ }
+
++ /* If valid: post process the received MPTCP options. */
++ if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
++ goto discard;
++
+ return true;
+
+ discard:
++ if (mptcp(tp))
++ mptcp_reset_mopt(tp);
+ __kfree_skb(skb);
+ return false;
+ }
+@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+
+ tp->rx_opt.saw_tstamp = 0;
+
++ /* MPTCP: force slowpath. */
++ if (mptcp(tp))
++ goto slow_path;
++
+ /* pred_flags is 0xS?10 << 16 + snd_wnd
+ * if header_prediction is to be made
+ * 'S' will always be tp->tcp_header_len >> 2
+@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
+ }
+ if (copied_early)
+- tcp_cleanup_rbuf(sk, skb->len);
++ tp->ops->cleanup_rbuf(sk, skb->len);
+ }
+ if (!eaten) {
+ if (tcp_checksum_complete_user(sk, skb))
+@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
+
+ tcp_init_metrics(sk);
+
+- tcp_init_congestion_control(sk);
++ tp->ops->init_congestion_control(sk);
+
+ /* Prevent spurious tcp_cwnd_restart() on first data
+ * packet.
+ */
+ tp->lsndtime = tcp_time_stamp;
+
+- tcp_init_buffer_space(sk);
++ tp->ops->init_buffer_space(sk);
+
+ if (sock_flag(sk, SOCK_KEEPOPEN))
+ inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
+@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ /* Get original SYNACK MSS value if user MSS sets mss_clamp */
+ tcp_clear_options(&opt);
+ opt.user_mss = opt.mss_clamp = 0;
+- tcp_parse_options(synack, &opt, 0, NULL);
++ tcp_parse_options(synack, &opt, NULL, 0, NULL);
+ mss = opt.mss_clamp;
+ }
+
+@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+
+ tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
+
+- if (data) { /* Retransmit unacked data in SYN */
++ /* In mptcp case, we do not rely on "retransmit", but instead on
++ * "transmit", because if fastopen data is not acked, the retransmission
++ * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
++ */
++ if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
+ tcp_for_write_queue_from(data, sk) {
+ if (data == tcp_send_head(sk) ||
+ __tcp_retransmit_skb(sk, data))
+@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp_fastopen_cookie foc = { .len = -1 };
+ int saved_clamp = tp->rx_opt.mss_clamp;
++ struct mptcp_options_received mopt;
++ mptcp_init_mp_opt(&mopt);
+
+- tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
++ tcp_parse_options(skb, &tp->rx_opt,
++ mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+
+@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
+ tcp_ack(sk, skb, FLAG_SLOWPATH);
+
++ if (tp->request_mptcp || mptcp(tp)) {
++ int ret;
++ ret = mptcp_rcv_synsent_state_process(sk, &sk,
++ skb, &mopt);
++
++ /* May have changed if we support MPTCP */
++ tp = tcp_sk(sk);
++ icsk = inet_csk(sk);
++
++ if (ret == 1)
++ goto reset_and_undo;
++ if (ret == 2)
++ goto discard;
++ }
++
++ if (mptcp(tp) && !is_master_tp(tp)) {
++ /* Timer for repeating the ACK until an answer
++ * arrives. Used only when establishing an additional
++ * subflow inside of an MPTCP connection.
++ */
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ }
++
+ /* Ok.. it's good. Set up sequence numbers and
+ * move to established.
+ */
+@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tp->tcp_header_len = sizeof(struct tcphdr);
+ }
+
++ if (mptcp(tp)) {
++ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++ }
++
+ if (tcp_is_sack(tp) && sysctl_tcp_fack)
+ tcp_enable_fack(tp);
+
+@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_rcv_fastopen_synack(sk, skb, &foc))
+ return -1;
+
+- if (sk->sk_write_pending ||
++ /* With MPTCP we cannot send data on the third ack due to the
++ * lack of option-space to combine with an MP_CAPABLE.
++ */
++ if (!mptcp(tp) && (sk->sk_write_pending ||
+ icsk->icsk_accept_queue.rskq_defer_accept ||
+- icsk->icsk_ack.pingpong) {
++ icsk->icsk_ack.pingpong)) {
+ /* Save one ACK. Data will be ready after
+ * several ticks, if write_pending is set.
+ *
+@@ -5536,6 +5665,7 @@ discard:
+ tcp_paws_reject(&tp->rx_opt, 0))
+ goto discard_and_undo;
+
++ /* TODO - check this here for MPTCP */
+ if (th->syn) {
+ /* We see SYN without ACK. It is attempt of
+ * simultaneous connect with crossed SYNs.
+@@ -5552,6 +5682,11 @@ discard:
+ tp->tcp_header_len = sizeof(struct tcphdr);
+ }
+
++ if (mptcp(tp)) {
++ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++ }
++
+ tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
+ tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
+
+@@ -5610,6 +5745,7 @@ reset_and_undo:
+
+ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ const struct tcphdr *th, unsigned int len)
++ __releases(&sk->sk_lock.slock)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ case TCP_SYN_SENT:
+ queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
++ if (is_meta_sk(sk)) {
++ sk = tcp_sk(sk)->mpcb->master_sk;
++ tp = tcp_sk(sk);
++
++ /* Need to call it here, because it will announce new
++ * addresses, which can only be done after the third ack
++ * of the 3-way handshake.
++ */
++ mptcp_update_metasocket(sk, tp->meta_sk);
++ }
+ if (queued >= 0)
+ return queued;
+
+@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_urg(sk, skb, th);
+ __kfree_skb(skb);
+ tcp_data_snd_check(sk);
++ if (mptcp(tp) && is_master_tp(tp))
++ bh_unlock_sock(sk);
+ return 0;
+ }
+
+@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ synack_stamp = tp->lsndtime;
+ /* Make sure socket is routed, for correct metrics. */
+ icsk->icsk_af_ops->rebuild_header(sk);
+- tcp_init_congestion_control(sk);
++ tp->ops->init_congestion_control(sk);
+
+ tcp_mtup_init(sk);
+ tp->copied_seq = tp->rcv_nxt;
+- tcp_init_buffer_space(sk);
++ tp->ops->init_buffer_space(sk);
+ }
+ smp_mb();
+ tcp_set_state(sk, TCP_ESTABLISHED);
+@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ if (tp->rx_opt.tstamp_ok)
+ tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
++ if (mptcp(tp))
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
+
+ if (req) {
+ /* Re-arm the timer because data may have been sent out.
+@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ tcp_initialize_rcv_mss(sk);
+ tcp_fast_path_on(tp);
++ /* Send an ACK when establishing a new
++ * MPTCP subflow, i.e. using an MP_JOIN
++ * subtype.
++ */
++ if (mptcp(tp) && !is_master_tp(tp))
++ tcp_send_ack(sk);
+ break;
+
+ case TCP_FIN_WAIT1: {
+@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ tmo = tcp_fin_time(sk);
+ if (tmo > TCP_TIMEWAIT_LEN) {
+ inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
+- } else if (th->fin || sock_owned_by_user(sk)) {
++ } else if (th->fin || mptcp_is_data_fin(skb) ||
++ sock_owned_by_user(sk)) {
+ /* Bad case. We could lose such FIN otherwise.
+ * It is not a big problem, but it looks confusing
+ * and not so rare event. We still can lose it now,
+@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ */
+ inet_csk_reset_keepalive_timer(sk, tmo);
+ } else {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ goto discard;
+ }
+ break;
+@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ case TCP_CLOSING:
+ if (tp->snd_una == tp->write_seq) {
+- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ goto discard;
+ }
+ break;
+@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ goto discard;
+ }
+ break;
++ case TCP_CLOSE:
++ if (tp->mp_killed)
++ goto discard;
+ }
+
+ /* step 6: check the URG bit */
+@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ */
+ if (sk->sk_shutdown & RCV_SHUTDOWN) {
+ if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
+- after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
++ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++ !mptcp(tp)) {
++ /* In case of mptcp, the reset is handled by
++ * mptcp_rcv_state_process
++ */
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
+ tcp_reset(sk);
+ return 1;
+@@ -5877,3 +6041,154 @@ discard:
+ return 0;
+ }
+ EXPORT_SYMBOL(tcp_rcv_state_process);
++
++static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++
++ if (family == AF_INET)
++ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
++ &ireq->ir_rmt_addr, port);
++#if IS_ENABLED(CONFIG_IPV6)
++ else if (family == AF_INET6)
++ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
++ &ireq->ir_v6_rmt_addr, port);
++#endif
++}
++
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++ const struct tcp_request_sock_ops *af_ops,
++ struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_options_received tmp_opt;
++ struct request_sock *req;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct dst_entry *dst = NULL;
++ __u32 isn = TCP_SKB_CB(skb)->when;
++ bool want_cookie = false, fastopen;
++ struct flowi fl;
++ struct tcp_fastopen_cookie foc = { .len = -1 };
++ int err;
++
++
++ /* TW buckets are converted to open requests without
++ * limitations, they conserve resources and peer is
++ * evidently real one.
++ */
++ if ((sysctl_tcp_syncookies == 2 ||
++ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++ want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
++ if (!want_cookie)
++ goto drop;
++ }
++
++
++ /* Accept backlog is full. If we have already queued enough
++ * of warm entries in syn queue, drop request. It is better than
++ * clogging syn queue with openreqs with exponentially increasing
++ * timeout.
++ */
++ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
++ goto drop;
++ }
++
++ req = inet_reqsk_alloc(rsk_ops);
++ if (!req)
++ goto drop;
++
++ tcp_rsk(req)->af_specific = af_ops;
++
++ tcp_clear_options(&tmp_opt);
++ tmp_opt.mss_clamp = af_ops->mss_clamp;
++ tmp_opt.user_mss = tp->rx_opt.user_mss;
++ tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
++
++ if (want_cookie && !tmp_opt.saw_tstamp)
++ tcp_clear_options(&tmp_opt);
++
++ tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
++ tcp_openreq_init(req, &tmp_opt, skb);
++
++ if (af_ops->init_req(req, sk, skb))
++ goto drop_and_free;
++
++ if (security_inet_conn_request(sk, skb, req))
++ goto drop_and_free;
++
++ if (!want_cookie || tmp_opt.tstamp_ok)
++ TCP_ECN_create_request(req, skb, sock_net(sk));
++
++ if (want_cookie) {
++ isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
++ req->cookie_ts = tmp_opt.tstamp_ok;
++ } else if (!isn) {
++ /* VJ's idea. We save last timestamp seen
++ * from the destination in peer table, when entering
++ * state TIME-WAIT, and check against it before
++ * accepting new connection request.
++ *
++ * If "isn" is not zero, this request hit alive
++ * timewait bucket, so that all the necessary checks
++ * are made in the function processing timewait state.
++ */
++ if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
++ bool strict;
++
++ dst = af_ops->route_req(sk, &fl, req, &strict);
++ if (dst && strict &&
++ !tcp_peer_is_proven(req, dst, true)) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
++ goto drop_and_release;
++ }
++ }
++ /* Kill the following clause, if you dislike this way. */
++ else if (!sysctl_tcp_syncookies &&
++ (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
++ (sysctl_max_syn_backlog >> 2)) &&
++ !tcp_peer_is_proven(req, dst, false)) {
++ /* Without syncookies last quarter of
++ * backlog is filled with destinations,
++ * proven to be alive.
++ * It means that we continue to communicate
++ * to destinations, already remembered
++ * to the moment of synflood.
++ */
++ pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
++ rsk_ops->family);
++ goto drop_and_release;
++ }
++
++ isn = af_ops->init_seq(skb);
++ }
++ if (!dst) {
++ dst = af_ops->route_req(sk, &fl, req, NULL);
++ if (!dst)
++ goto drop_and_free;
++ }
++
++ tcp_rsk(req)->snt_isn = isn;
++ tcp_openreq_init_rwin(req, sk, dst);
++ fastopen = !want_cookie &&
++ tcp_try_fastopen(sk, skb, req, &foc, dst);
++ err = af_ops->send_synack(sk, dst, &fl, req,
++ skb_get_queue_mapping(skb), &foc);
++ if (!fastopen) {
++ if (err || want_cookie)
++ goto drop_and_free;
++
++ tcp_rsk(req)->listener = NULL;
++ af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
++ }
++
++ return 0;
++
++drop_and_release:
++ dst_release(dst);
++drop_and_free:
++ reqsk_free(req);
++drop:
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++ return 0;
++}
++EXPORT_SYMBOL(tcp_conn_request);
+diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
+index 77cccda1ad0c..c77017f600f1 100644
+--- a/net/ipv4/tcp_ipv4.c
++++ b/net/ipv4/tcp_ipv4.c
+@@ -67,6 +67,8 @@
+ #include <net/icmp.h>
+ #include <net/inet_hashtables.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/transp_v6.h>
+ #include <net/ipv6.h>
+ #include <net/inet_common.h>
+@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
+ struct inet_hashinfo tcp_hashinfo;
+ EXPORT_SYMBOL(tcp_hashinfo);
+
+-static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
+ {
+ return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
+ ip_hdr(skb)->saddr,
+@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ struct inet_sock *inet;
+ const int type = icmp_hdr(icmp_skb)->type;
+ const int code = icmp_hdr(icmp_skb)->code;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+ struct sk_buff *skb;
+ struct request_sock *fastopen;
+ __u32 seq, snd_una;
+@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ return;
+ }
+
+- bh_lock_sock(sk);
++ tp = tcp_sk(sk);
++ if (mptcp(tp))
++ meta_sk = mptcp_meta_sk(sk);
++ else
++ meta_sk = sk;
++
++ bh_lock_sock(meta_sk);
+ /* If too many ICMPs get dropped on busy
+ * servers this needs to be solved differently.
+ * We do take care of PMTU discovery (RFC1191) special case :
+ * we can receive locally generated ICMP messages while socket is held.
+ */
+- if (sock_owned_by_user(sk)) {
++ if (sock_owned_by_user(meta_sk)) {
+ if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
+ NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ }
+@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ }
+
+ icsk = inet_csk(sk);
+- tp = tcp_sk(sk);
+ seq = ntohl(th->seq);
+ /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ fastopen = tp->fastopen_rsk;
+@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ goto out;
+
+ tp->mtu_info = info;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_v4_mtu_reduced(sk);
+ } else {
+ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+ goto out;
+ }
+@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ !icsk->icsk_backoff || fastopen)
+ break;
+
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ break;
+
+ icsk->icsk_backoff--;
+@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ switch (sk->sk_state) {
+ struct request_sock *req, **prev;
+ case TCP_LISTEN:
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ goto out;
+
+ req = inet_csk_search_req(sk, &prev, th->dest,
+@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ if (fastopen && fastopen->sk == NULL)
+ break;
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ sk->sk_err = err;
+
+ sk->sk_error_report(sk);
+@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ */
+
+ inet = inet_sk(sk);
+- if (!sock_owned_by_user(sk) && inet->recverr) {
++ if (!sock_owned_by_user(meta_sk) && inet->recverr) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk);
+ } else { /* Only an error on timeout */
+@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ }
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
+ * Exception: precedence violation. We do not implement it in any case.
+ */
+
+-static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct {
+@@ -702,10 +711,10 @@ release_sk1:
+ outside socket context is ugly, certainly. What can I do?
+ */
+
+-static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ u32 win, u32 tsval, u32 tsecr, int oif,
+ struct tcp_md5sig_key *key,
+- int reply_flags, u8 tos)
++ int reply_flags, u8 tos, int mptcp)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct {
+@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ #ifdef CONFIG_TCP_MD5SIG
+ + (TCPOLEN_MD5SIG_ALIGNED >> 2)
+ #endif
++#ifdef CONFIG_MPTCP
++ + ((MPTCP_SUB_LEN_DSS >> 2) +
++ (MPTCP_SUB_LEN_ACK >> 2))
++#endif
+ ];
+ } rep;
+ struct ip_reply_arg arg;
+@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ ip_hdr(skb)->daddr, &rep.th);
+ }
+ #endif
++#ifdef CONFIG_MPTCP
++ if (mptcp) {
++ int offset = (tsecr) ? 3 : 0;
++ /* Construction of 32-bit data_ack */
++ rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
++ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++ (0x20 << 8) |
++ (0x01));
++ rep.opt[offset] = htonl(data_ack);
++
++ arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++ rep.th.doff = arg.iov[0].iov_len / 4;
++ }
++#endif /* CONFIG_MPTCP */
++
+ arg.flags = reply_flags;
+ arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
+ ip_hdr(skb)->saddr, /* XXX */
+@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ struct inet_timewait_sock *tw = inet_twsk(sk);
+ struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++ u32 data_ack = 0;
++ int mptcp = 0;
++
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++ mptcp = 1;
++ }
+
+ tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++ data_ack,
+ tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ tcp_time_stamp + tcptw->tw_ts_offset,
+ tcptw->tw_ts_recent,
+ tw->tw_bound_dev_if,
+ tcp_twsk_md5_key(tcptw),
+ tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
+- tw->tw_tos
++ tw->tw_tos, mptcp
+ );
+
+ inet_twsk_put(tw);
+ }
+
+-static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req)
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req)
+ {
+ /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ */
+ tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+- tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
++ tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
+ tcp_time_stamp,
+ req->ts_recent,
+ 0,
+ tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
+ AF_INET),
+ inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
+- ip_hdr(skb)->tos);
++ ip_hdr(skb)->tos, 0);
+ }
+
+ /*
+@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+ * This still operates on a request_sock only, not on a big
+ * socket.
+ */
+-static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+- struct request_sock *req,
+- u16 queue_mapping,
+- struct tcp_fastopen_cookie *foc)
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc)
+ {
+ const struct inet_request_sock *ireq = inet_rsk(req);
+ struct flowi4 fl4;
+@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+ return err;
+ }
+
+-static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
+-{
+- int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
+-
+- if (!res) {
+- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+- }
+- return res;
+-}
+-
+ /*
+ * IPv4 request_sock destructor.
+ */
+-static void tcp_v4_reqsk_destructor(struct request_sock *req)
++void tcp_v4_reqsk_destructor(struct request_sock *req)
+ {
+ kfree(inet_rsk(req)->opt);
+ }
+@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
+ /*
+ * Save and compile IPv4 options into the request_sock if needed.
+ */
+-static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
+ {
+ const struct ip_options *opt = &(IPCB(skb)->opt);
+ struct ip_options_rcu *dopt = NULL;
+@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+
+ #endif
+
++static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++
++ ireq->ir_loc_addr = ip_hdr(skb)->daddr;
++ ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
++ ireq->no_srccheck = inet_sk(sk)->transparent;
++ ireq->opt = tcp_v4_save_options(skb);
++ ireq->ir_mark = inet_request_mark(sk, skb);
++
++ return 0;
++}
++
++static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict)
++{
++ struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
++
++ if (strict) {
++ if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
++ *strict = true;
++ else
++ *strict = false;
++ }
++
++ return dst;
++}
++
+ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
+ .family = PF_INET,
+ .obj_size = sizeof(struct tcp_request_sock),
+- .rtx_syn_ack = tcp_v4_rtx_synack,
++ .rtx_syn_ack = tcp_rtx_synack,
+ .send_ack = tcp_v4_reqsk_send_ack,
+ .destructor = tcp_v4_reqsk_destructor,
+ .send_reset = tcp_v4_send_reset,
+ .syn_ack_timeout = tcp_syn_ack_timeout,
+ };
+
++const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
++ .mss_clamp = TCP_MSS_DEFAULT,
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+ .md5_lookup = tcp_v4_reqsk_md5_lookup,
+ .calc_md5_hash = tcp_v4_md5_hash_skb,
+-};
+ #endif
++ .init_req = tcp_v4_init_req,
++#ifdef CONFIG_SYN_COOKIES
++ .cookie_init_seq = cookie_v4_init_sequence,
++#endif
++ .route_req = tcp_v4_route_req,
++ .init_seq = tcp_v4_init_sequence,
++ .send_synack = tcp_v4_send_synack,
++ .queue_hash_add = inet_csk_reqsk_queue_hash_add,
++};
+
+ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+- struct tcp_options_received tmp_opt;
+- struct request_sock *req;
+- struct inet_request_sock *ireq;
+- struct tcp_sock *tp = tcp_sk(sk);
+- struct dst_entry *dst = NULL;
+- __be32 saddr = ip_hdr(skb)->saddr;
+- __be32 daddr = ip_hdr(skb)->daddr;
+- __u32 isn = TCP_SKB_CB(skb)->when;
+- bool want_cookie = false, fastopen;
+- struct flowi4 fl4;
+- struct tcp_fastopen_cookie foc = { .len = -1 };
+- int err;
+-
+ /* Never answer to SYNs send to broadcast or multicast */
+ if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+ goto drop;
+
+- /* TW buckets are converted to open requests without
+- * limitations, they conserve resources and peer is
+- * evidently real one.
+- */
+- if ((sysctl_tcp_syncookies == 2 ||
+- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+- want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
+- if (!want_cookie)
+- goto drop;
+- }
+-
+- /* Accept backlog is full. If we have already queued enough
+- * of warm entries in syn queue, drop request. It is better than
+- * clogging syn queue with openreqs with exponentially increasing
+- * timeout.
+- */
+- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+- goto drop;
+- }
+-
+- req = inet_reqsk_alloc(&tcp_request_sock_ops);
+- if (!req)
+- goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
+-#endif
+-
+- tcp_clear_options(&tmp_opt);
+- tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
+- tmp_opt.user_mss = tp->rx_opt.user_mss;
+- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+- if (want_cookie && !tmp_opt.saw_tstamp)
+- tcp_clear_options(&tmp_opt);
+-
+- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+- tcp_openreq_init(req, &tmp_opt, skb);
++ return tcp_conn_request(&tcp_request_sock_ops,
++ &tcp_request_sock_ipv4_ops, sk, skb);
+
+- ireq = inet_rsk(req);
+- ireq->ir_loc_addr = daddr;
+- ireq->ir_rmt_addr = saddr;
+- ireq->no_srccheck = inet_sk(sk)->transparent;
+- ireq->opt = tcp_v4_save_options(skb);
+- ireq->ir_mark = inet_request_mark(sk, skb);
+-
+- if (security_inet_conn_request(sk, skb, req))
+- goto drop_and_free;
+-
+- if (!want_cookie || tmp_opt.tstamp_ok)
+- TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+- if (want_cookie) {
+- isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+- req->cookie_ts = tmp_opt.tstamp_ok;
+- } else if (!isn) {
+- /* VJ's idea. We save last timestamp seen
+- * from the destination in peer table, when entering
+- * state TIME-WAIT, and check against it before
+- * accepting new connection request.
+- *
+- * If "isn" is not zero, this request hit alive
+- * timewait bucket, so that all the necessary checks
+- * are made in the function processing timewait state.
+- */
+- if (tmp_opt.saw_tstamp &&
+- tcp_death_row.sysctl_tw_recycle &&
+- (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
+- fl4.daddr == saddr) {
+- if (!tcp_peer_is_proven(req, dst, true)) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+- goto drop_and_release;
+- }
+- }
+- /* Kill the following clause, if you dislike this way. */
+- else if (!sysctl_tcp_syncookies &&
+- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+- (sysctl_max_syn_backlog >> 2)) &&
+- !tcp_peer_is_proven(req, dst, false)) {
+- /* Without syncookies last quarter of
+- * backlog is filled with destinations,
+- * proven to be alive.
+- * It means that we continue to communicate
+- * to destinations, already remembered
+- * to the moment of synflood.
+- */
+- LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
+- &saddr, ntohs(tcp_hdr(skb)->source));
+- goto drop_and_release;
+- }
+-
+- isn = tcp_v4_init_sequence(skb);
+- }
+- if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_isn = isn;
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_openreq_init_rwin(req, sk, dst);
+- fastopen = !want_cookie &&
+- tcp_try_fastopen(sk, skb, req, &foc, dst);
+- err = tcp_v4_send_synack(sk, dst, req,
+- skb_get_queue_mapping(skb), &foc);
+- if (!fastopen) {
+- if (err || want_cookie)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_rsk(req)->listener = NULL;
+- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+- }
+-
+- return 0;
+-
+-drop_and_release:
+- dst_release(dst);
+-drop_and_free:
+- reqsk_free(req);
+ drop:
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ return 0;
+@@ -1497,7 +1433,7 @@ put_and_exit:
+ }
+ EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
+
+-static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcphdr *th = tcp_hdr(skb);
+ const struct iphdr *iph = ip_hdr(skb);
+@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+
+ if (nsk) {
+ if (nsk->sk_state != TCP_TIME_WAIT) {
++ /* Don't lock again the meta-sk. It has been locked
++ * before mptcp_v4_do_rcv.
++ */
++ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++ bh_lock_sock(mptcp_meta_sk(nsk));
+ bh_lock_sock(nsk);
++
+ return nsk;
++
+ }
+ inet_twsk_put(inet_twsk(nsk));
+ return NULL;
+@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
+ goto discard;
+ #endif
+
++ if (is_meta_sk(sk))
++ return mptcp_v4_do_rcv(sk, skb);
++
+ if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
+ struct dst_entry *dst = sk->sk_rx_dst;
+
+@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
+ } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
+ wake_up_interruptible_sync_poll(sk_sleep(sk),
+ POLLIN | POLLRDNORM | POLLRDBAND);
+- if (!inet_csk_ack_scheduled(sk))
++ if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
+ (3 * tcp_rto_min(sk)) / 4,
+ TCP_RTO_MAX);
+@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ {
+ const struct iphdr *iph;
+ const struct tcphdr *th;
+- struct sock *sk;
++ struct sock *sk, *meta_sk = NULL;
+ int ret;
+ struct net *net = dev_net(skb->dev);
+
+@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ skb->len - th->doff * 4);
+ TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++ TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ TCP_SKB_CB(skb)->when = 0;
+ TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
+ TCP_SKB_CB(skb)->sacked = 0;
+
+ sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+- if (!sk)
+- goto no_tcp_socket;
+
+ process:
+- if (sk->sk_state == TCP_TIME_WAIT)
++ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ goto do_time_wait;
+
++#ifdef CONFIG_MPTCP
++ if (!sk && th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, NULL);
++
++ if (ret < 0) {
++ tcp_v4_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++
++ /* Is there a pending request sock for this segment ? */
++ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++ if (sk)
++ sock_put(sk);
++ return 0;
++ }
++#endif
++ if (!sk)
++ goto no_tcp_socket;
++
+ if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+@@ -1759,11 +1729,21 @@ process:
+ sk_mark_napi_id(sk, skb);
+ skb->dev = NULL;
+
+- bh_lock_sock_nested(sk);
++ if (mptcp(tcp_sk(sk))) {
++ meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk))
++ skb->sk = sk;
++ } else {
++ meta_sk = sk;
++ bh_lock_sock_nested(sk);
++ }
++
+ ret = 0;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+- struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp = tcp_sk(meta_sk);
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = net_dma_find_channel();
+ if (tp->ucopy.dma_chan)
+@@ -1771,16 +1751,16 @@ process:
+ else
+ #endif
+ {
+- if (!tcp_prequeue(sk, skb))
++ if (!tcp_prequeue(meta_sk, skb))
+ ret = tcp_v4_do_rcv(sk, skb);
+ }
+- } else if (unlikely(sk_add_backlog(sk, skb,
+- sk->sk_rcvbuf + sk->sk_sndbuf))) {
+- bh_unlock_sock(sk);
++ } else if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ goto discard_and_relse;
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+
+ sock_put(sk);
+
+@@ -1835,6 +1815,18 @@ do_time_wait:
+ sk = sk2;
+ goto process;
+ }
++#ifdef CONFIG_MPTCP
++ if (th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++ if (ret < 0) {
++ tcp_v4_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++#endif
+ /* Fall through to ACK */
+ }
+ case TCP_TW_ACK:
+@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
+
+ tcp_init_sock(sk);
+
+- icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v4_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv4_specific;
+
+ #ifdef CONFIG_TCP_MD5SIG
+ tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
+@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
+
+ tcp_cleanup_congestion_control(sk);
+
++ if (mptcp(tp))
++ mptcp_destroy_sock(sk);
++ if (tp->inside_tk_table)
++ mptcp_hash_remove(tp);
++
+ /* Cleanup up the write buffer. */
+ tcp_write_queue_purge(sk);
+
+@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
+ }
+ #endif /* CONFIG_PROC_FS */
+
++#ifdef CONFIG_MPTCP
++static void tcp_v4_clear_sk(struct sock *sk, int size)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* we do not want to clear tk_table field, because of RCU lookups */
++ sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
++
++ size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
++ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
++}
++#endif
++
+ struct proto tcp_prot = {
+ .name = "TCP",
+ .owner = THIS_MODULE,
+@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
+ .destroy_cgroup = tcp_destroy_cgroup,
+ .proto_cgroup = tcp_proto_cgroup,
+ #endif
++#ifdef CONFIG_MPTCP
++ .clear_sk = tcp_v4_clear_sk,
++#endif
+ };
+ EXPORT_SYMBOL(tcp_prot);
+
+diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
+index e68e0d4af6c9..ae6946857dff 100644
+--- a/net/ipv4/tcp_minisocks.c
++++ b/net/ipv4/tcp_minisocks.c
+@@ -18,11 +18,13 @@
+ * Jorge Cwik, <jorge@laser.satlink.net>
+ */
+
++#include <linux/kconfig.h>
+ #include <linux/mm.h>
+ #include <linux/module.h>
+ #include <linux/slab.h>
+ #include <linux/sysctl.h>
+ #include <linux/workqueue.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/inet_common.h>
+ #include <net/xfrm.h>
+@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ struct tcp_options_received tmp_opt;
+ struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
+ bool paws_reject = false;
++ struct mptcp_options_received mopt;
+
+ tmp_opt.saw_tstamp = 0;
+ if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ mptcp_init_mp_opt(&mopt);
++
++ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+
+ if (tmp_opt.saw_tstamp) {
+ tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
+@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ tmp_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
+ paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
+ }
++
++ if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
++ if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
++ goto kill_with_rst;
++ }
+ }
+
+ if (tw->tw_substate == TCP_FIN_WAIT2) {
+@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ if (!th->ack ||
+ !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
+ TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
++ /* If mptcp_is_data_fin() returns true, we are sure that
++ * mopt has been initialized - otherwise it would not
++ * be a DATA_FIN.
++ */
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
++ mptcp_is_data_fin(skb) &&
++ TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
++ mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
++ return TCP_TW_ACK;
++
+ inet_twsk_put(tw);
+ return TCP_TW_SUCCESS;
+ }
+@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
+ tcptw->tw_ts_offset = tp->tsoffset;
+
++ if (mptcp(tp)) {
++ if (mptcp_init_tw_sock(sk, tcptw)) {
++ inet_twsk_free(tw);
++ goto exit;
++ }
++ } else {
++ tcptw->mptcp_tw = NULL;
++ }
++
+ #if IS_ENABLED(CONFIG_IPV6)
+ if (tw->tw_family == PF_INET6) {
+ struct ipv6_pinfo *np = inet6_sk(sk);
+@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
+ }
+
++exit:
+ tcp_update_metrics(sk);
+ tcp_done(sk);
+ }
+
+ void tcp_twsk_destructor(struct sock *sk)
+ {
+-#ifdef CONFIG_TCP_MD5SIG
+ struct tcp_timewait_sock *twsk = tcp_twsk(sk);
+
++ if (twsk->mptcp_tw)
++ mptcp_twsk_destructor(twsk);
++#ifdef CONFIG_TCP_MD5SIG
+ if (twsk->tw_md5_key)
+ kfree_rcu(twsk->tw_md5_key, rcu);
+ #endif
+@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
+ req->window_clamp = tcp_full_space(sk);
+
+ /* tcp_full_space because it is guaranteed to be the first packet */
+- tcp_select_initial_window(tcp_full_space(sk),
+- mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
++ tp->ops->select_initial_window(tcp_full_space(sk),
++ mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
++ (ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
+ &req->rcv_wnd,
+ &req->window_clamp,
+ ireq->wscale_ok,
+ &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ dst_metric(dst, RTAX_INITRWND), sk);
+ ireq->rcv_wscale = rcv_wscale;
+ }
+ EXPORT_SYMBOL(tcp_openreq_init_rwin);
+@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
+ newtp->rx_opt.ts_recent_stamp = 0;
+ newtp->tcp_header_len = sizeof(struct tcphdr);
+ }
++ if (ireq->saw_mpc)
++ newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
+ newtp->tsoffset = 0;
+ #ifdef CONFIG_TCP_MD5SIG
+ newtp->md5sig_info = NULL; /*XXX*/
+@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ bool fastopen)
+ {
+ struct tcp_options_received tmp_opt;
++ struct mptcp_options_received mopt;
+ struct sock *child;
+ const struct tcphdr *th = tcp_hdr(skb);
+ __be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
+ bool paws_reject = false;
+
+- BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
++ BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
+
+ tmp_opt.saw_tstamp = 0;
++
++ mptcp_init_mp_opt(&mopt);
++
+ if (th->doff > (sizeof(struct tcphdr)>>2)) {
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+
+ if (tmp_opt.saw_tstamp) {
+ tmp_opt.ts_recent = req->ts_recent;
+@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ *
+ * Reset timer after retransmitting SYNACK, similar to
+ * the idea of fast retransmit in recovery.
++ *
++ * Fall back to TCP if MP_CAPABLE is not set.
+ */
++
++ if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
++ inet_rsk(req)->saw_mpc = false;
++
++
+ if (!inet_rtx_syn_ack(sk, req))
+ req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
+ TCP_RTO_MAX) + jiffies;
+@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ * socket is created, wait for troubles.
+ */
+ child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
++
+ if (child == NULL)
+ goto listen_overflow;
+
++ if (!is_meta_sk(sk)) {
++ int ret = mptcp_check_req_master(sk, child, req, prev);
++ if (ret < 0)
++ goto listen_overflow;
++
++ /* MPTCP-supported */
++ if (!ret)
++ return tcp_sk(child)->mpcb->master_sk;
++ } else {
++ return mptcp_check_req_child(sk, child, req, prev, &mopt);
++ }
+ inet_csk_reqsk_queue_unlink(sk, req, prev);
+ inet_csk_reqsk_queue_removed(sk, req);
+
+@@ -746,7 +804,17 @@ embryonic_reset:
+ tcp_reset(sk);
+ }
+ if (!fastopen) {
+- inet_csk_reqsk_queue_drop(sk, req, prev);
++ if (is_meta_sk(sk)) {
++ /* We want to avoid stoping the keepalive-timer and so
++ * avoid ending up in inet_csk_reqsk_queue_removed ...
++ */
++ inet_csk_reqsk_queue_unlink(sk, req, prev);
++ if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
++ mptcp_delete_synack_timer(sk);
++ reqsk_free(req);
++ } else {
++ inet_csk_reqsk_queue_drop(sk, req, prev);
++ }
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+ }
+ return NULL;
+@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ {
+ int ret = 0;
+ int state = child->sk_state;
++ struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
+
+- if (!sock_owned_by_user(child)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
+ skb->len);
+ /* Wakeup parent, send SIGIO */
+@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ * in main socket hash table and lock on listening
+ * socket does not protect us more.
+ */
+- __sk_add_backlog(child, skb);
++ if (mptcp(tcp_sk(child)))
++ skb->sk = child;
++ __sk_add_backlog(meta_sk, skb);
+ }
+
+- bh_unlock_sock(child);
++ if (mptcp(tcp_sk(child)))
++ bh_unlock_sock(child);
++ bh_unlock_sock(meta_sk);
+ sock_put(child);
+ return ret;
+ }
+diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
+index 179b51e6bda3..efd31b6c5784 100644
+--- a/net/ipv4/tcp_output.c
++++ b/net/ipv4/tcp_output.c
+@@ -36,6 +36,12 @@
+
+ #define pr_fmt(fmt) "TCP: " fmt
+
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++#include <net/ipv6.h>
+ #include <net/tcp.h>
+
+ #include <linux/compiler.h>
+@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
+ unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
+ EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
+
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+- int push_one, gfp_t gfp);
+-
+ /* Account for new data that has been sent to the network. */
+-static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
+ void tcp_select_initial_window(int __space, __u32 mss,
+ __u32 *rcv_wnd, __u32 *window_clamp,
+ int wscale_ok, __u8 *rcv_wscale,
+- __u32 init_rcv_wnd)
++ __u32 init_rcv_wnd, const struct sock *sk)
+ {
+ unsigned int space = (__space < 0 ? 0 : __space);
+
+@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
+ * value can be stuffed directly into th->window for an outgoing
+ * frame.
+ */
+-static u16 tcp_select_window(struct sock *sk)
++u16 tcp_select_window(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 old_win = tp->rcv_wnd;
+- u32 cur_win = tcp_receive_window(tp);
+- u32 new_win = __tcp_select_window(sk);
++ /* The window must never shrink at the meta-level. At the subflow we
++ * have to allow this. Otherwise we may announce a window too large
++ * for the current meta-level sk_rcvbuf.
++ */
++ u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
++ u32 new_win = tp->ops->__select_window(sk);
+
+ /* Never shrink the offered window */
+ if (new_win < cur_win) {
+@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
+ LINUX_MIB_TCPWANTZEROWINDOWADV);
+ new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+ }
++
+ tp->rcv_wnd = new_win;
+ tp->rcv_wup = tp->rcv_nxt;
+
+@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
+ /* Constructs common control bits of non-data skb. If SYN/FIN is present,
+ * auto increment end seqno.
+ */
+-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ {
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ TCP_SKB_CB(skb)->end_seq = seq;
+ }
+
+-static inline bool tcp_urg_mode(const struct tcp_sock *tp)
++bool tcp_urg_mode(const struct tcp_sock *tp)
+ {
+ return tp->snd_una != tp->snd_up;
+ }
+@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
+ #define OPTION_MD5 (1 << 2)
+ #define OPTION_WSCALE (1 << 3)
+ #define OPTION_FAST_OPEN_COOKIE (1 << 8)
+-
+-struct tcp_out_options {
+- u16 options; /* bit field of OPTION_* */
+- u16 mss; /* 0 to disable */
+- u8 ws; /* window scale, 0 to disable */
+- u8 num_sack_blocks; /* number of SACK blocks to include */
+- u8 hash_size; /* bytes in hash_location */
+- __u8 *hash_location; /* temporary pointer, overloaded */
+- __u32 tsval, tsecr; /* need to include OPTION_TS */
+- struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
+-};
++/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
+
+ /* Write previously computed TCP options to the packet.
+ *
+@@ -430,7 +428,7 @@ struct tcp_out_options {
+ * (but it may well be that other scenarios fail similarly).
+ */
+ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+- struct tcp_out_options *opts)
++ struct tcp_out_options *opts, struct sk_buff *skb)
+ {
+ u16 options = opts->options; /* mungable copy */
+
+@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+ }
+ ptr += (foc->len + 3) >> 2;
+ }
++
++ if (unlikely(OPTION_MPTCP & opts->options))
++ mptcp_options_write(ptr, tp, opts, skb);
+ }
+
+ /* Compute TCP options for SYN packets. This is not the final
+@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
+ if (unlikely(!(OPTION_TS & opts->options)))
+ remaining -= TCPOLEN_SACKPERM_ALIGNED;
+ }
++ if (tp->request_mptcp || mptcp(tp))
++ mptcp_syn_options(sk, opts, &remaining);
+
+ if (fastopen && fastopen->cookie.len >= 0) {
+ u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
+@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
+ }
+ }
+
++ if (ireq->saw_mpc)
++ mptcp_synack_options(req, opts, &remaining);
++
+ return MAX_TCP_OPTION_SPACE - remaining;
+ }
+
+@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
+ opts->tsecr = tp->rx_opt.ts_recent;
+ size += TCPOLEN_TSTAMP_ALIGNED;
+ }
++ if (mptcp(tp))
++ mptcp_established_options(sk, skb, opts, &size);
+
+ eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
+ if (unlikely(eff_sacks)) {
+- const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+- opts->num_sack_blocks =
+- min_t(unsigned int, eff_sacks,
+- (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
+- TCPOLEN_SACK_PERBLOCK);
+- size += TCPOLEN_SACK_BASE_ALIGNED +
+- opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
++ const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
++ if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
++ opts->num_sack_blocks = 0;
++ else
++ opts->num_sack_blocks =
++ min_t(unsigned int, eff_sacks,
++ (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
++ TCPOLEN_SACK_PERBLOCK);
++ if (opts->num_sack_blocks)
++ size += TCPOLEN_SACK_BASE_ALIGNED +
++ opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+ }
+
+ return size;
+@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
+ if ((1 << sk->sk_state) &
+ (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
+ TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
+- tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+- 0, GFP_ATOMIC);
++ tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
++ tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
+ }
+ /*
+ * One tasklet per cpu tries to send more skbs.
+@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
+ unsigned long flags;
+ struct list_head *q, *n;
+ struct tcp_sock *tp;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+
+ local_irq_save(flags);
+ list_splice_init(&tsq->head, &list);
+@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
+ list_del(&tp->tsq_node);
+
+ sk = (struct sock *)tp;
+- bh_lock_sock(sk);
++ meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++ bh_lock_sock(meta_sk);
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_tsq_handler(sk);
++ if (mptcp(tp))
++ tcp_tsq_handler(meta_sk);
+ } else {
++ if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
++ goto exit;
++
+ /* defer the work to tcp_release_cb() */
+ set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
++
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++exit:
++ bh_unlock_sock(meta_sk);
+
+ clear_bit(TSQ_QUEUED, &tp->tsq_flags);
+ sk_free(sk);
+@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
+ #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) | \
+ (1UL << TCP_WRITE_TIMER_DEFERRED) | \
+ (1UL << TCP_DELACK_TIMER_DEFERRED) | \
+- (1UL << TCP_MTU_REDUCED_DEFERRED))
++ (1UL << TCP_MTU_REDUCED_DEFERRED) | \
++ (1UL << MPTCP_PATH_MANAGER) | \
++ (1UL << MPTCP_SUB_DEFERRED))
++
+ /**
+ * tcp_release_cb - tcp release_sock() callback
+ * @sk: socket
+@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
+ sk->sk_prot->mtu_reduced(sk);
+ __sock_put(sk);
+ }
++ if (flags & (1UL << MPTCP_PATH_MANAGER)) {
++ if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
++ tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
++ __sock_put(sk);
++ }
++ if (flags & (1UL << MPTCP_SUB_DEFERRED))
++ mptcp_tsq_sub_deferred(sk);
+ }
+ EXPORT_SYMBOL(tcp_release_cb);
+
+@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
+ * We are working here with either a clone of the original
+ * SKB, or a fresh unique copy made by the retransmit engine.
+ */
+-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+- gfp_t gfp_mask)
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++ gfp_t gfp_mask)
+ {
+ const struct inet_connection_sock *icsk = inet_csk(sk);
+ struct inet_sock *inet;
+@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ */
+ th->window = htons(min(tp->rcv_wnd, 65535U));
+ } else {
+- th->window = htons(tcp_select_window(sk));
++ th->window = htons(tp->ops->select_window(sk));
+ }
+ th->check = 0;
+ th->urg_ptr = 0;
+@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ }
+ }
+
+- tcp_options_write((__be32 *)(th + 1), tp, &opts);
++ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
+ TCP_ECN_send(sk, skb, tcp_header_size);
+
+@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
+ * otherwise socket can stall.
+ */
+-static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ }
+
+ /* Initialize TSO segments for a packet. */
+-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+- unsigned int mss_now)
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now)
+ {
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+ /* Make sure we own this skb before messing gso_size/gso_segs */
+ WARN_ON_ONCE(skb_cloned(skb));
+
+- if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
++ if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
++ (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
+ /* Avoid the costly divide in the normal
+ * non-TSO case.
+ */
+@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
+ /* Pcount in the middle of the write queue got changed, we need to do various
+ * tweaks to fix counters
+ */
+-static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
+ * eventually). The difference is that pulled data not copied, but
+ * immediately discarded.
+ */
+-static void __pskb_trim_head(struct sk_buff *skb, int len)
++void __pskb_trim_head(struct sk_buff *skb, int len)
+ {
+ struct skb_shared_info *shinfo;
+ int i, k, eat;
+@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
+ /* Remove acked data from a packet in the transmit queue. */
+ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ {
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
++ return mptcp_trim_head(sk, skb, len);
++
+ if (skb_unclone(skb, GFP_ATOMIC))
+ return -ENOMEM;
+
+@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ if (tcp_skb_pcount(skb) > 1)
+ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
+
++#ifdef CONFIG_MPTCP
++ /* Some data got acked - we assume that the seq-number reached the dest.
++ * Anyway, our MPTCP-option has been trimmed above - we lost it here.
++ * Only remove the SEQ if the call does not come from a meta retransmit.
++ */
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++ TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
++#endif
++
+ return 0;
+ }
+
+@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
+
+ return mss_now;
+ }
++EXPORT_SYMBOL(tcp_current_mss);
+
+ /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
+ * As additional protections, we do not touch cwnd in retransmission phases,
+@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
+ * But we can avoid doing the divide again given we already have
+ * skb_pcount = skb->len / mss_now
+ */
+-static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
+- const struct sk_buff *skb)
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++ const struct sk_buff *skb)
+ {
+ if (skb->len < tcp_skb_pcount(skb) * mss_now)
+ tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
+ (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
+ }
+ /* Returns the portion of skb which can be sent right away */
+-static unsigned int tcp_mss_split_point(const struct sock *sk,
+- const struct sk_buff *skb,
+- unsigned int mss_now,
+- unsigned int max_segs,
+- int nonagle)
++unsigned int tcp_mss_split_point(const struct sock *sk,
++ const struct sk_buff *skb,
++ unsigned int mss_now,
++ unsigned int max_segs,
++ int nonagle)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ u32 partial, needed, window, max_len;
+@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
+ /* Can at least one segment of SKB be sent right now, according to the
+ * congestion window rules? If so, return how many segments are allowed.
+ */
+-static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+- const struct sk_buff *skb)
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
++ const struct sk_buff *skb)
+ {
+ u32 in_flight, cwnd;
+
+ /* Don't be strict about the congestion window for the final FIN. */
+- if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
++ if (skb &&
++ (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
+ tcp_skb_pcount(skb) == 1)
+ return 1;
+
+@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+ * This must be invoked the first time we consider transmitting
+ * SKB onto the wire.
+ */
+-static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+- unsigned int mss_now)
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now)
+ {
+ int tso_segs = tcp_skb_pcount(skb);
+
+@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+ /* Return true if the Nagle test allows this packet to be
+ * sent now.
+ */
+-static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
+- unsigned int cur_mss, int nonagle)
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss, int nonagle)
+ {
+ /* Nagle rule does not apply to frames, which sit in the middle of the
+ * write_queue (they have no chances to get new data).
+@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ return true;
+
+ /* Don't use the nagle rule for urgent data (or for the final FIN). */
+- if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
++ if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
++ mptcp_is_data_fin(skb))
+ return true;
+
+ if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
+@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ }
+
+ /* Does at least the first segment of SKB fit into the send window? */
+-static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
+- const struct sk_buff *skb,
+- unsigned int cur_mss)
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss)
+ {
+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
+
+@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
+ u32 send_win, cong_win, limit, in_flight;
+ int win_divisor;
+
+- if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
++ if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
+ goto send_now;
+
+ if (icsk->icsk_ca_state != TCP_CA_Open)
+@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
+ * Returns true, if no segments are in flight and we have queued segments,
+ * but cannot send anything now because of SWS or another problem.
+ */
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ int push_one, gfp_t gfp)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+
+ sent_pkts = 0;
+
+- if (!push_one) {
++ /* pmtu not yet supported with MPTCP. Should be possible, by early
++ * exiting the loop inside tcp_mtu_probe, making sure that only one
++ * single DSS-mapping gets probed.
++ */
++ if (!push_one && !mptcp(tp)) {
+ /* Do MTU probing. */
+ result = tcp_mtu_probe(sk);
+ if (!result) {
+@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
+ int err = -1;
+
+ if (tcp_send_head(sk) != NULL) {
+- err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
++ err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
++ GFP_ATOMIC);
+ goto rearm_timer;
+ }
+
+@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
+ if (unlikely(sk->sk_state == TCP_CLOSE))
+ return;
+
+- if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
+- sk_gfp_atomic(sk, GFP_ATOMIC)))
++ if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
++ sk_gfp_atomic(sk, GFP_ATOMIC)))
+ tcp_check_probe_timer(sk);
+ }
+
+@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
+
+ BUG_ON(!skb || skb->len < mss_now);
+
+- tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
++ tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
++ sk->sk_allocation);
+ }
+
+ /* This function returns the amount that we can raise the
+@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+ if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
+ return;
+
++ /* Currently not supported for MPTCP - but it should be possible */
++ if (mptcp(tp))
++ return;
++
+ tcp_for_write_queue_from_safe(skb, tmp, sk) {
+ if (!tcp_can_collapse(sk, skb))
+ break;
+@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+
+ /* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
+ th->window = htons(min(req->rcv_wnd, 65535U));
+- tcp_options_write((__be32 *)(th + 1), tp, &opts);
++ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ th->doff = (tcp_header_size >> 2);
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
+
+@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
+ (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
+ tp->window_clamp = tcp_full_space(sk);
+
+- tcp_select_initial_window(tcp_full_space(sk),
+- tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
+- &tp->rcv_wnd,
+- &tp->window_clamp,
+- sysctl_tcp_window_scaling,
+- &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk),
++ tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
++ &tp->rcv_wnd,
++ &tp->window_clamp,
++ sysctl_tcp_window_scaling,
++ &rcv_wscale,
++ dst_metric(dst, RTAX_INITRWND), sk);
+
+ tp->rx_opt.rcv_wscale = rcv_wscale;
+ tp->rcv_ssthresh = tp->rcv_wnd;
+@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
+ inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ inet_csk(sk)->icsk_retransmits = 0;
+ tcp_clear_retrans(tp);
++
++#ifdef CONFIG_MPTCP
++ if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
++ if (is_master_tp(tp)) {
++ tp->request_mptcp = 1;
++ mptcp_connect_init(sk);
++ } else if (tp->mptcp) {
++ struct inet_sock *inet = inet_sk(sk);
++
++ tp->mptcp->snt_isn = tp->write_seq;
++ tp->mptcp->init_rcv_wnd = tp->rcv_wnd;
++
++ /* Set nonce for new subflows */
++ if (sk->sk_family == AF_INET)
++ tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
++ inet->inet_saddr,
++ inet->inet_daddr,
++ inet->inet_sport,
++ inet->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
++ inet6_sk(sk)->saddr.s6_addr32,
++ sk->sk_v6_daddr.s6_addr32,
++ inet->inet_sport,
++ inet->inet_dport);
++#endif
++ }
++ }
++#endif
+ }
+
+ static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
+@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
+ TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
+ }
++EXPORT_SYMBOL(tcp_send_ack);
+
+ /* This routine sends a packet with an out of date sequence
+ * number. It assumes the other end will try to ack it.
+@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
+ * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
+ * out-of-date with SND.UNA-1 to probe window.
+ */
+-static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
++int tcp_xmit_probe_skb(struct sock *sk, int urgent)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct sk_buff *skb;
+@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
+ struct tcp_sock *tp = tcp_sk(sk);
+ int err;
+
+- err = tcp_write_wakeup(sk);
++ err = tp->ops->write_wakeup(sk);
+
+ if (tp->packets_out || !tcp_send_head(sk)) {
+ /* Cancel probe timer, if it is not required. */
+@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
+ TCP_RTO_MAX);
+ }
+ }
++
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
++{
++ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++ struct flowi fl;
++ int res;
++
++ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
++ if (!res) {
++ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
++ }
++ return res;
++}
++EXPORT_SYMBOL(tcp_rtx_synack);
+diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
+index 286227abed10..966b873cbf3e 100644
+--- a/net/ipv4/tcp_timer.c
++++ b/net/ipv4/tcp_timer.c
+@@ -20,6 +20,7 @@
+
+ #include <linux/module.h>
+ #include <linux/gfp.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+
+ int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
+@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+ int sysctl_tcp_orphan_retries __read_mostly;
+ int sysctl_tcp_thin_linear_timeouts __read_mostly;
+
+-static void tcp_write_err(struct sock *sk)
++void tcp_write_err(struct sock *sk)
+ {
+ sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
+ sk->sk_error_report(sk);
+@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
+ (!tp->snd_wnd && !tp->packets_out))
+ do_reset = 1;
+ if (do_reset)
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ tcp_done(sk);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
+ return 1;
+@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
+ * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
+ * syn_set flag is set.
+ */
+-static bool retransmits_timed_out(struct sock *sk,
+- unsigned int boundary,
+- unsigned int timeout,
+- bool syn_set)
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++ unsigned int timeout, bool syn_set)
+ {
+ unsigned int linear_backoff_thresh, start_ts;
+ unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
+ }
+
+ /* A write timeout has occurred. Process the after effects. */
+-static int tcp_write_timeout(struct sock *sk)
++int tcp_write_timeout(struct sock *sk)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
+ }
+ retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
+ syn_set = true;
++ /* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
++ if (tcp_sk(sk)->request_mptcp &&
++ icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
++ tcp_sk(sk)->request_mptcp = 0;
+ } else {
+ if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
+ /* Black hole detection */
+@@ -251,18 +254,22 @@ out:
+ static void tcp_delack_timer(unsigned long data)
+ {
+ struct sock *sk = (struct sock *)data;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+
+- bh_lock_sock(sk);
+- if (!sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_delack_timer_handler(sk);
+ } else {
+ inet_csk(sk)->icsk_ack.blocked = 1;
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
+ /* deleguate our work to tcp_release_cb() */
+ if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -479,6 +486,10 @@ out_reset_timer:
+ __sk_dst_reset(sk);
+
+ out:;
++ if (mptcp(tp)) {
++ mptcp_reinject_data(sk, 1);
++ mptcp_set_rto(sk);
++ }
+ }
+
+ void tcp_write_timer_handler(struct sock *sk)
+@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
+ break;
+ case ICSK_TIME_RETRANS:
+ icsk->icsk_pending = 0;
+- tcp_retransmit_timer(sk);
++ tcp_sk(sk)->ops->retransmit_timer(sk);
+ break;
+ case ICSK_TIME_PROBE0:
+ icsk->icsk_pending = 0;
+@@ -520,16 +531,19 @@ out:
+ static void tcp_write_timer(unsigned long data)
+ {
+ struct sock *sk = (struct sock *)data;
++ struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
+
+- bh_lock_sock(sk);
+- if (!sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_write_timer_handler(sk);
+ } else {
+ /* deleguate our work to tcp_release_cb() */
+ if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tcp_sk(sk)))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
+ struct sock *sk = (struct sock *) data;
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ u32 elapsed;
+
+ /* Only process if socket is not in use. */
+- bh_lock_sock(sk);
+- if (sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
+ /* Try again later. */
+ inet_csk_reset_keepalive_timer (sk, HZ/20);
+ goto out;
+@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
+ goto out;
+ }
+
++ if (tp->send_mp_fclose) {
++ /* MUST do this before tcp_write_timeout, because retrans_stamp
++ * may have been set to 0 in another part while we are
++ * retransmitting MP_FASTCLOSE. Then, we would crash, because
++ * retransmits_timed_out accesses the meta-write-queue.
++ *
++ * We make sure that the timestamp is != 0.
++ */
++ if (!tp->retrans_stamp)
++ tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++ if (tcp_write_timeout(sk))
++ goto out;
++
++ tcp_send_ack(sk);
++ icsk->icsk_retransmits++;
++
++ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++ elapsed = icsk->icsk_rto;
++ goto resched;
++ }
++
+ if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
+ if (tp->linger2 >= 0) {
+ const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
+
+ if (tmo > 0) {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ goto out;
+ }
+ }
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ goto death;
+ }
+
+@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
+ icsk->icsk_probes_out > 0) ||
+ (icsk->icsk_user_timeout == 0 &&
+ icsk->icsk_probes_out >= keepalive_probes(tp))) {
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ tcp_write_err(sk);
+ goto out;
+ }
+- if (tcp_write_wakeup(sk) <= 0) {
++ if (tp->ops->write_wakeup(sk) <= 0) {
+ icsk->icsk_probes_out++;
+ elapsed = keepalive_intvl_when(tp);
+ } else {
+@@ -642,7 +679,7 @@ death:
+ tcp_done(sk);
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
+index 5667b3003af9..7139c2973fd2 100644
+--- a/net/ipv6/addrconf.c
++++ b/net/ipv6/addrconf.c
+@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
+
+ kfree_rcu(ifp, rcu);
+ }
++EXPORT_SYMBOL(inet6_ifa_finish_destroy);
+
+ static void
+ ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
+diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
+index 7cb4392690dd..7057afbca4df 100644
+--- a/net/ipv6/af_inet6.c
++++ b/net/ipv6/af_inet6.c
+@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
+ return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
+ }
+
+-static int inet6_create(struct net *net, struct socket *sock, int protocol,
+- int kern)
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ struct inet_sock *inet;
+ struct ipv6_pinfo *np;
+diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
+index a245e5ddffbd..99c892b8992d 100644
+--- a/net/ipv6/inet6_connection_sock.c
++++ b/net/ipv6/inet6_connection_sock.c
+@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
+ /*
+ * request_sock (formerly open request) hash tables.
+ */
+-static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
+- const u32 rnd, const u32 synq_hsize)
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++ const u32 rnd, const u32 synq_hsize)
+ {
+ u32 c;
+
+diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
+index edb58aff4ae7..ea4d9fda0927 100644
+--- a/net/ipv6/ipv6_sockglue.c
++++ b/net/ipv6/ipv6_sockglue.c
+@@ -48,6 +48,8 @@
+ #include <net/addrconf.h>
+ #include <net/inet_common.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/xfrm.h>
+@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+ sock_prot_inuse_add(net, &tcp_prot, 1);
+ local_bh_enable();
+ sk->sk_prot = &tcp_prot;
+- icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v4_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv4_specific;
+ sk->sk_socket->ops = &inet_stream_ops;
+ sk->sk_family = PF_INET;
+ tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
+diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
+index a822b880689b..b2b38869d795 100644
+--- a/net/ipv6/syncookies.c
++++ b/net/ipv6/syncookies.c
+@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+
+ /* check for timestamp cookie support */
+ memset(&tcp_opt, 0, sizeof(tcp_opt));
+- tcp_parse_options(skb, &tcp_opt, 0, NULL);
++ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+
+ if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ goto out;
+
+ ret = NULL;
+- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
++ req = inet_reqsk_alloc(&tcp6_request_sock_ops);
+ if (!req)
+ goto out;
+
+@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ }
+
+ req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
+- tcp_select_initial_window(tcp_full_space(sk), req->mss,
+- &req->rcv_wnd, &req->window_clamp,
+- ireq->wscale_ok, &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++ &req->rcv_wnd, &req->window_clamp,
++ ireq->wscale_ok, &rcv_wscale,
++ dst_metric(dst, RTAX_INITRWND), sk);
+
+ ireq->rcv_wscale = rcv_wscale;
+
+diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
+index 229239ad96b1..fda94d71666e 100644
+--- a/net/ipv6/tcp_ipv6.c
++++ b/net/ipv6/tcp_ipv6.c
+@@ -63,6 +63,8 @@
+ #include <net/inet_common.h>
+ #include <net/secure_seq.h>
+ #include <net/tcp_memcontrol.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
+ #include <net/busy_poll.h>
+
+ #include <linux/proc_fs.h>
+@@ -71,12 +73,6 @@
+ #include <linux/crypto.h>
+ #include <linux/scatterlist.h>
+
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req);
+-
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
+-
+ static const struct inet_connection_sock_af_ops ipv6_mapped;
+ static const struct inet_connection_sock_af_ops ipv6_specific;
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
+ }
+ #endif
+
+-static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct dst_entry *dst = skb_dst(skb);
+ const struct rt6_info *rt = (const struct rt6_info *)dst;
+@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+ }
+
+-static void tcp_v6_hash(struct sock *sk)
++void tcp_v6_hash(struct sock *sk)
+ {
+ if (sk->sk_state != TCP_CLOSE) {
+- if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
++ if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
++ inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
+ tcp_prot.hash(sk);
+ return;
+ }
+@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
+ }
+ }
+
+-static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ {
+ return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
+ ipv6_hdr(skb)->saddr.s6_addr32,
+@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ tcp_hdr(skb)->source);
+ }
+
+-static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ int addr_len)
+ {
+ struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
+@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ sin.sin_port = usin->sin6_port;
+ sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
+
+- icsk->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_mapped;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_mapped;
+ sk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ tp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+
+ if (err) {
+ icsk->icsk_ext_hdr_len = exthdrlen;
+- icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_specific;
+ sk->sk_backlog_rcv = tcp_v6_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ tp->af_specific = &tcp_sock_ipv6_specific;
+@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
+ const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
+ struct ipv6_pinfo *np;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+ int err;
+ struct tcp_sock *tp;
+ struct request_sock *fastopen;
+@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ return;
+ }
+
+- bh_lock_sock(sk);
+- if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
++ tp = tcp_sk(sk);
++ if (mptcp(tp))
++ meta_sk = mptcp_meta_sk(sk);
++ else
++ meta_sk = sk;
++
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
+ NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+
+ if (sk->sk_state == TCP_CLOSE)
+@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+ }
+
+- tp = tcp_sk(sk);
+ seq = ntohl(th->seq);
+ /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ fastopen = tp->fastopen_rsk;
+@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+
+ tp->mtu_info = ntohl(info);
+- if (!sock_owned_by_user(sk))
++ if (!sock_owned_by_user(meta_sk))
+ tcp_v6_mtu_reduced(sk);
+- else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
++ else {
++ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
+ &tp->tsq_flags))
+- sock_hold(sk);
++ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
++ }
+ goto out;
+ }
+
+@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ switch (sk->sk_state) {
+ struct request_sock *req, **prev;
+ case TCP_LISTEN:
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ goto out;
+
+ req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
+@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ if (fastopen && fastopen->sk == NULL)
+ break;
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk); /* Wake people up to see the error (see connect in sock.c) */
+
+@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+ }
+
+- if (!sock_owned_by_user(sk) && np->recverr) {
++ if (!sock_owned_by_user(meta_sk) && np->recverr) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk);
+ } else
+ sk->sk_err_soft = err;
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+
+-static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+- struct flowi6 *fl6,
+- struct request_sock *req,
+- u16 queue_mapping,
+- struct tcp_fastopen_cookie *foc)
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc)
+ {
+ struct inet_request_sock *ireq = inet_rsk(req);
+ struct ipv6_pinfo *np = inet6_sk(sk);
++ struct flowi6 *fl6 = &fl->u.ip6;
+ struct sk_buff *skb;
+ int err = -ENOMEM;
+
+@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+ skb_set_queue_mapping(skb, queue_mapping);
+ err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
+ err = net_xmit_eval(err);
++ if (!tcp_rsk(req)->snt_synack && !err)
++ tcp_rsk(req)->snt_synack = tcp_time_stamp;
+ }
+
+ done:
+ return err;
+ }
+
+-static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ {
+- struct flowi6 fl6;
++ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++ struct flowi fl;
+ int res;
+
+- res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
++ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
+ if (!res) {
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ return res;
+ }
+
+-static void tcp_v6_reqsk_destructor(struct request_sock *req)
++void tcp_v6_reqsk_destructor(struct request_sock *req)
+ {
+ kfree_skb(inet_rsk(req)->pktopts);
+ }
+@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ }
+ #endif
+
++static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++ struct ipv6_pinfo *np = inet6_sk(sk);
++
++ ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
++ ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
++
++ ireq->ir_iif = sk->sk_bound_dev_if;
++ ireq->ir_mark = inet_request_mark(sk, skb);
++
++ /* So that link locals have meaning */
++ if (!sk->sk_bound_dev_if &&
++ ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
++ ireq->ir_iif = inet6_iif(skb);
++
++ if (!TCP_SKB_CB(skb)->when &&
++ (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
++ np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
++ np->rxopt.bits.rxohlim || np->repflow)) {
++ atomic_inc(&skb->users);
++ ireq->pktopts = skb;
++ }
++
++ return 0;
++}
++
++static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict)
++{
++ if (strict)
++ *strict = true;
++ return inet6_csk_route_req(sk, &fl->u.ip6, req);
++}
++
+ struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
+ .family = AF_INET6,
+ .obj_size = sizeof(struct tcp6_request_sock),
+- .rtx_syn_ack = tcp_v6_rtx_synack,
++ .rtx_syn_ack = tcp_rtx_synack,
+ .send_ack = tcp_v6_reqsk_send_ack,
+ .destructor = tcp_v6_reqsk_destructor,
+ .send_reset = tcp_v6_send_reset,
+ .syn_ack_timeout = tcp_syn_ack_timeout,
+ };
+
++const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
++ .mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) -
++ sizeof(struct ipv6hdr),
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
+ .md5_lookup = tcp_v6_reqsk_md5_lookup,
+ .calc_md5_hash = tcp_v6_md5_hash_skb,
+-};
+ #endif
++ .init_req = tcp_v6_init_req,
++#ifdef CONFIG_SYN_COOKIES
++ .cookie_init_seq = cookie_v6_init_sequence,
++#endif
++ .route_req = tcp_v6_route_req,
++ .init_seq = tcp_v6_init_sequence,
++ .send_synack = tcp_v6_send_synack,
++ .queue_hash_add = inet6_csk_reqsk_queue_hash_add,
++};
+
+-static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+- u32 tsval, u32 tsecr, int oif,
+- struct tcp_md5sig_key *key, int rst, u8 tclass,
+- u32 label)
++static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
++ u32 data_ack, u32 win, u32 tsval, u32 tsecr,
++ int oif, struct tcp_md5sig_key *key, int rst,
++ u8 tclass, u32 label, int mptcp)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct tcphdr *t1;
+@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ if (key)
+ tot_len += TCPOLEN_MD5SIG_ALIGNED;
+ #endif
+-
++#ifdef CONFIG_MPTCP
++ if (mptcp)
++ tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++#endif
+ buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
+ GFP_ATOMIC);
+ if (buff == NULL)
+@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ tcp_v6_md5_hash_hdr((__u8 *)topt, key,
+ &ipv6_hdr(skb)->saddr,
+ &ipv6_hdr(skb)->daddr, t1);
++ topt += 4;
++ }
++#endif
++#ifdef CONFIG_MPTCP
++ if (mptcp) {
++ /* Construction of 32-bit data_ack */
++ *topt++ = htonl((TCPOPT_MPTCP << 24) |
++ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++ (0x20 << 8) |
++ (0x01));
++ *topt++ = htonl(data_ack);
+ }
+ #endif
+
+@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ kfree_skb(buff);
+ }
+
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ u32 seq = 0, ack_seq = 0;
+@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ (th->doff << 2);
+
+ oif = sk ? sk->sk_bound_dev_if : 0;
+- tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
++ tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
+
+ #ifdef CONFIG_TCP_MD5SIG
+ release_sk1:
+@@ -902,45 +983,52 @@ release_sk1:
+ #endif
+ }
+
+-static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ u32 win, u32 tsval, u32 tsecr, int oif,
+ struct tcp_md5sig_key *key, u8 tclass,
+- u32 label)
++ u32 label, int mptcp)
+ {
+- tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
+- label);
++ tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
++ key, 0, tclass, label, mptcp);
+ }
+
+ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ struct inet_timewait_sock *tw = inet_twsk(sk);
+ struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++ u32 data_ack = 0;
++ int mptcp = 0;
+
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++ mptcp = 1;
++ }
+ tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++ data_ack,
+ tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ tcp_time_stamp + tcptw->tw_ts_offset,
+ tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
+- tw->tw_tclass, (tw->tw_flowlabel << 12));
++ tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
+
+ inet_twsk_put(tw);
+ }
+
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req)
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req)
+ {
+ /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ */
+ tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+- tcp_rsk(req)->rcv_nxt,
++ tcp_rsk(req)->rcv_nxt, 0,
+ req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
+ tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
+- 0, 0);
++ 0, 0, 0);
+ }
+
+
+-static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ struct request_sock *req, **prev;
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+
+ if (nsk) {
+ if (nsk->sk_state != TCP_TIME_WAIT) {
++ /* Don't lock again the meta-sk. It has been locked
++ * before mptcp_v6_do_rcv.
++ */
++ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++ bh_lock_sock(mptcp_meta_sk(nsk));
+ bh_lock_sock(nsk);
++
+ return nsk;
+ }
+ inet_twsk_put(inet_twsk(nsk));
+@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ return sk;
+ }
+
+-/* FIXME: this is substantially similar to the ipv4 code.
+- * Can some kind of merge be done? -- erics
+- */
+-static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+- struct tcp_options_received tmp_opt;
+- struct request_sock *req;
+- struct inet_request_sock *ireq;
+- struct ipv6_pinfo *np = inet6_sk(sk);
+- struct tcp_sock *tp = tcp_sk(sk);
+- __u32 isn = TCP_SKB_CB(skb)->when;
+- struct dst_entry *dst = NULL;
+- struct tcp_fastopen_cookie foc = { .len = -1 };
+- bool want_cookie = false, fastopen;
+- struct flowi6 fl6;
+- int err;
+-
+ if (skb->protocol == htons(ETH_P_IP))
+ return tcp_v4_conn_request(sk, skb);
+
+ if (!ipv6_unicast_destination(skb))
+ goto drop;
+
+- if ((sysctl_tcp_syncookies == 2 ||
+- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+- want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
+- if (!want_cookie)
+- goto drop;
+- }
+-
+- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+- goto drop;
+- }
+-
+- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
+- if (req == NULL)
+- goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
+-#endif
+-
+- tcp_clear_options(&tmp_opt);
+- tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
+- tmp_opt.user_mss = tp->rx_opt.user_mss;
+- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+- if (want_cookie && !tmp_opt.saw_tstamp)
+- tcp_clear_options(&tmp_opt);
++ return tcp_conn_request(&tcp6_request_sock_ops,
++ &tcp_request_sock_ipv6_ops, sk, skb);
+
+- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+- tcp_openreq_init(req, &tmp_opt, skb);
+-
+- ireq = inet_rsk(req);
+- ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
+- ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
+- if (!want_cookie || tmp_opt.tstamp_ok)
+- TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+- ireq->ir_iif = sk->sk_bound_dev_if;
+- ireq->ir_mark = inet_request_mark(sk, skb);
+-
+- /* So that link locals have meaning */
+- if (!sk->sk_bound_dev_if &&
+- ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
+- ireq->ir_iif = inet6_iif(skb);
+-
+- if (!isn) {
+- if (ipv6_opt_accepted(sk, skb) ||
+- np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
+- np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
+- np->repflow) {
+- atomic_inc(&skb->users);
+- ireq->pktopts = skb;
+- }
+-
+- if (want_cookie) {
+- isn = cookie_v6_init_sequence(sk, skb, &req->mss);
+- req->cookie_ts = tmp_opt.tstamp_ok;
+- goto have_isn;
+- }
+-
+- /* VJ's idea. We save last timestamp seen
+- * from the destination in peer table, when entering
+- * state TIME-WAIT, and check against it before
+- * accepting new connection request.
+- *
+- * If "isn" is not zero, this request hit alive
+- * timewait bucket, so that all the necessary checks
+- * are made in the function processing timewait state.
+- */
+- if (tmp_opt.saw_tstamp &&
+- tcp_death_row.sysctl_tw_recycle &&
+- (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
+- if (!tcp_peer_is_proven(req, dst, true)) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+- goto drop_and_release;
+- }
+- }
+- /* Kill the following clause, if you dislike this way. */
+- else if (!sysctl_tcp_syncookies &&
+- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+- (sysctl_max_syn_backlog >> 2)) &&
+- !tcp_peer_is_proven(req, dst, false)) {
+- /* Without syncookies last quarter of
+- * backlog is filled with destinations,
+- * proven to be alive.
+- * It means that we continue to communicate
+- * to destinations, already remembered
+- * to the moment of synflood.
+- */
+- LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
+- &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
+- goto drop_and_release;
+- }
+-
+- isn = tcp_v6_init_sequence(skb);
+- }
+-have_isn:
+-
+- if (security_inet_conn_request(sk, skb, req))
+- goto drop_and_release;
+-
+- if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_isn = isn;
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_openreq_init_rwin(req, sk, dst);
+- fastopen = !want_cookie &&
+- tcp_try_fastopen(sk, skb, req, &foc, dst);
+- err = tcp_v6_send_synack(sk, dst, &fl6, req,
+- skb_get_queue_mapping(skb), &foc);
+- if (!fastopen) {
+- if (err || want_cookie)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->listener = NULL;
+- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+- }
+- return 0;
+-
+-drop_and_release:
+- dst_release(dst);
+-drop_and_free:
+- reqsk_free(req);
+ drop:
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ return 0; /* don't send reset */
+ }
+
+-static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req,
+- struct dst_entry *dst)
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req,
++ struct dst_entry *dst)
+ {
+ struct inet_request_sock *ireq;
+ struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
+@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+
+ newsk->sk_v6_rcv_saddr = newnp->saddr;
+
+- inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(newsk))
++ inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
++ else
++#endif
++ inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
+ newsk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -1329,7 +1292,7 @@ out:
+ * This is because we cannot sleep with the original spinlock
+ * held.
+ */
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ {
+ struct ipv6_pinfo *np = inet6_sk(sk);
+ struct tcp_sock *tp;
+@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ goto discard;
+ #endif
+
++ if (is_meta_sk(sk))
++ return mptcp_v6_do_rcv(sk, skb);
++
+ if (sk_filter(sk, skb))
+ goto discard;
+
+@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ {
+ const struct tcphdr *th;
+ const struct ipv6hdr *hdr;
+- struct sock *sk;
++ struct sock *sk, *meta_sk = NULL;
+ int ret;
+ struct net *net = dev_net(skb->dev);
+
+@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ skb->len - th->doff*4);
+ TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++ TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ TCP_SKB_CB(skb)->when = 0;
+ TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
+ TCP_SKB_CB(skb)->sacked = 0;
+
+ sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+- if (!sk)
+- goto no_tcp_socket;
+
+ process:
+- if (sk->sk_state == TCP_TIME_WAIT)
++ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ goto do_time_wait;
+
++#ifdef CONFIG_MPTCP
++ if (!sk && th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, NULL);
++
++ if (ret < 0) {
++ tcp_v6_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++
++ /* Is there a pending request sock for this segment ? */
++ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++ if (sk)
++ sock_put(sk);
++ return 0;
++ }
++#endif
++
++ if (!sk)
++ goto no_tcp_socket;
++
+ if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+@@ -1529,11 +1520,21 @@ process:
+ sk_mark_napi_id(sk, skb);
+ skb->dev = NULL;
+
+- bh_lock_sock_nested(sk);
++ if (mptcp(tcp_sk(sk))) {
++ meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk))
++ skb->sk = sk;
++ } else {
++ meta_sk = sk;
++ bh_lock_sock_nested(sk);
++ }
++
+ ret = 0;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+- struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp = tcp_sk(meta_sk);
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = net_dma_find_channel();
+ if (tp->ucopy.dma_chan)
+@@ -1541,16 +1542,17 @@ process:
+ else
+ #endif
+ {
+- if (!tcp_prequeue(sk, skb))
++ if (!tcp_prequeue(meta_sk, skb))
+ ret = tcp_v6_do_rcv(sk, skb);
+ }
+- } else if (unlikely(sk_add_backlog(sk, skb,
+- sk->sk_rcvbuf + sk->sk_sndbuf))) {
+- bh_unlock_sock(sk);
++ } else if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ goto discard_and_relse;
+ }
+- bh_unlock_sock(sk);
++
++ bh_unlock_sock(meta_sk);
+
+ sock_put(sk);
+ return ret ? -1 : 0;
+@@ -1607,6 +1609,18 @@ do_time_wait:
+ sk = sk2;
+ goto process;
+ }
++#ifdef CONFIG_MPTCP
++ if (th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++ if (ret < 0) {
++ tcp_v6_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++#endif
+ /* Fall through to ACK */
+ }
+ case TCP_TW_ACK:
+@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
+ }
+ }
+
+-static struct timewait_sock_ops tcp6_timewait_sock_ops = {
++struct timewait_sock_ops tcp6_timewait_sock_ops = {
+ .twsk_obj_size = sizeof(struct tcp6_timewait_sock),
+ .twsk_unique = tcp_twsk_unique,
+ .twsk_destructor = tcp_twsk_destructor,
+@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
+
+ tcp_init_sock(sk);
+
+- icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_specific;
+
+ #ifdef CONFIG_TCP_MD5SIG
+ tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
+@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
+ return 0;
+ }
+
+-static void tcp_v6_destroy_sock(struct sock *sk)
++void tcp_v6_destroy_sock(struct sock *sk)
+ {
+ tcp_v4_destroy_sock(sk);
+ inet6_destroy_sock(sk);
+@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
+ static void tcp_v6_clear_sk(struct sock *sk, int size)
+ {
+ struct inet_sock *inet = inet_sk(sk);
++#ifdef CONFIG_MPTCP
++ struct tcp_sock *tp = tcp_sk(sk);
++ /* size_tk_table goes from the end of tk_table to the end of sk */
++ int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
++ sizeof(tp->tk_table);
++#endif
+
+ /* we do not want to clear pinet6 field, because of RCU lookups */
+ sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
+
+ size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
++
++#ifdef CONFIG_MPTCP
++ /* We zero out only from pinet6 to tk_table */
++ size -= size_tk_table + sizeof(tp->tk_table);
++#endif
+ memset(&inet->pinet6 + 1, 0, size);
++
++#ifdef CONFIG_MPTCP
++ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
++#endif
++
+ }
+
+ struct proto tcpv6_prot = {
+diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
+new file mode 100644
+index 000000000000..cdfc03adabf8
+--- /dev/null
++++ b/net/mptcp/Kconfig
+@@ -0,0 +1,115 @@
++#
++# MPTCP configuration
++#
++config MPTCP
++ bool "MPTCP protocol"
++ depends on (IPV6=y || IPV6=n)
++ ---help---
++ This replaces the normal TCP stack with a Multipath TCP stack,
++ able to use several paths at once.
++
++menuconfig MPTCP_PM_ADVANCED
++ bool "MPTCP: advanced path-manager control"
++ depends on MPTCP=y
++ ---help---
++ Support for selection of different path-managers. You should choose 'Y' here,
++ because otherwise you will not actively create new MPTCP-subflows.
++
++if MPTCP_PM_ADVANCED
++
++config MPTCP_FULLMESH
++ tristate "MPTCP Full-Mesh Path-Manager"
++ depends on MPTCP=y
++ ---help---
++ This path-management module will create a full-mesh among all IP-addresses.
++
++config MPTCP_NDIFFPORTS
++ tristate "MPTCP ndiff-ports"
++ depends on MPTCP=y
++ ---help---
++ This path-management module will create multiple subflows between the same
++ pair of IP-addresses, modifying the source-port. You can set the number
++ of subflows via the mptcp_ndiffports-sysctl.
++
++config MPTCP_BINDER
++ tristate "MPTCP Binder"
++ depends on (MPTCP=y)
++ ---help---
++ This path-management module works like ndiffports, and adds the sysctl
++ option to set the gateway (and/or path to) per each additional subflow
++ via Loose Source Routing (IPv4 only).
++
++choice
++ prompt "Default MPTCP Path-Manager"
++ default DEFAULT
++ help
++ Select the Path-Manager of your choice
++
++ config DEFAULT_FULLMESH
++ bool "Full mesh" if MPTCP_FULLMESH=y
++
++ config DEFAULT_NDIFFPORTS
++ bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
++
++ config DEFAULT_BINDER
++ bool "binder" if MPTCP_BINDER=y
++
++ config DEFAULT_DUMMY
++ bool "Default"
++
++endchoice
++
++endif
++
++config DEFAULT_MPTCP_PM
++ string
++ default "default" if DEFAULT_DUMMY
++ default "fullmesh" if DEFAULT_FULLMESH
++ default "ndiffports" if DEFAULT_NDIFFPORTS
++ default "binder" if DEFAULT_BINDER
++ default "default"
++
++menuconfig MPTCP_SCHED_ADVANCED
++ bool "MPTCP: advanced scheduler control"
++ depends on MPTCP=y
++ ---help---
++ Support for selection of different schedulers. You should choose 'Y' here,
++ if you want to choose a different scheduler than the default one.
++
++if MPTCP_SCHED_ADVANCED
++
++config MPTCP_ROUNDROBIN
++ tristate "MPTCP Round-Robin"
++ depends on (MPTCP=y)
++ ---help---
++ This is a very simple round-robin scheduler. Probably has bad performance
++ but might be interesting for researchers.
++
++choice
++ prompt "Default MPTCP Scheduler"
++ default DEFAULT
++ help
++ Select the Scheduler of your choice
++
++ config DEFAULT_SCHEDULER
++ bool "Default"
++ ---help---
++ This is the default scheduler, sending first on the subflow
++ with the lowest RTT.
++
++ config DEFAULT_ROUNDROBIN
++ bool "Round-Robin" if MPTCP_ROUNDROBIN=y
++ ---help---
++ This is the round-rob scheduler, sending in a round-robin
++ fashion..
++
++endchoice
++endif
++
++config DEFAULT_MPTCP_SCHED
++ string
++ depends on (MPTCP=y)
++ default "default" if DEFAULT_SCHEDULER
++ default "roundrobin" if DEFAULT_ROUNDROBIN
++ default "default"
++
+diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
+new file mode 100644
+index 000000000000..35561a7012e3
+--- /dev/null
++++ b/net/mptcp/Makefile
+@@ -0,0 +1,20 @@
++#
++## Makefile for MultiPath TCP support code.
++#
++#
++
++obj-$(CONFIG_MPTCP) += mptcp.o
++
++mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
++ mptcp_output.o mptcp_input.o mptcp_sched.o
++
++obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
++obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
++obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
++obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
++obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
++obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
++obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
++
++mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
++
+diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
+new file mode 100644
+index 000000000000..95d8da560715
+--- /dev/null
++++ b/net/mptcp/mptcp_binder.c
+@@ -0,0 +1,487 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#include <linux/route.h>
++#include <linux/inet.h>
++#include <linux/mroute.h>
++#include <linux/spinlock_types.h>
++#include <net/inet_ecn.h>
++#include <net/route.h>
++#include <net/xfrm.h>
++#include <net/compat.h>
++#include <linux/slab.h>
++
++#define MPTCP_GW_MAX_LISTS 10
++#define MPTCP_GW_LIST_MAX_LEN 6
++#define MPTCP_GW_SYSCTL_MAX_LEN (15 * MPTCP_GW_LIST_MAX_LEN * \
++ MPTCP_GW_MAX_LISTS)
++
++struct mptcp_gw_list {
++ struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
++ u8 len[MPTCP_GW_MAX_LISTS];
++};
++
++struct binder_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++
++ struct mptcp_cb *mpcb;
++
++ /* Prevent multiple sub-sockets concurrently iterating over sockets */
++ spinlock_t *flow_lock;
++};
++
++static struct mptcp_gw_list *mptcp_gws;
++static rwlock_t mptcp_gws_lock;
++
++static int mptcp_binder_ndiffports __read_mostly = 1;
++
++static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
++
++static int mptcp_get_avail_list_ipv4(struct sock *sk)
++{
++ int i, j, list_taken, opt_ret, opt_len;
++ unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
++
++ for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
++ if (mptcp_gws->len[i] == 0)
++ goto error;
++
++ mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
++ list_taken = 0;
++
++ /* Loop through all sub-sockets in this connection */
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
++
++ /* Reset length and options buffer, then retrieve
++ * from socket
++ */
++ opt_len = MAX_IPOPTLEN;
++ memset(opt, 0, MAX_IPOPTLEN);
++ opt_ret = ip_getsockopt(sk, IPPROTO_IP,
++ IP_OPTIONS, opt, &opt_len);
++ if (opt_ret < 0) {
++ mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
++ __func__, opt_ret);
++ goto error;
++ }
++
++ /* If socket has no options, it has no stake in this list */
++ if (opt_len <= 0)
++ continue;
++
++ /* Iterate options buffer */
++ for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
++ if (*opt_ptr == IPOPT_LSRR) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
++ goto sock_lsrr;
++ }
++ }
++ continue;
++
++sock_lsrr:
++ /* Pointer to the 2nd to last address */
++ opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
++
++ /* Addresses start 3 bytes after type offset */
++ opt_ptr += 3;
++ j = 0;
++
++ /* Different length lists cannot be the same */
++ if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
++ continue;
++
++ /* Iterate if we are still inside options list
++ * and sysctl list
++ */
++ while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
++ /* If there is a different address, this list must
++ * not be set on this socket
++ */
++ if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
++ break;
++
++ /* Jump 4 bytes to next address */
++ opt_ptr += 4;
++ j++;
++ }
++
++ /* Reached the end without a differing address, lists
++ * are therefore identical.
++ */
++ if (j == mptcp_gws->len[i]) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
++ list_taken = 1;
++ break;
++ }
++ }
++
++ /* Free list found if not taken by a socket */
++ if (!list_taken) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
++ break;
++ }
++ }
++
++ if (i >= MPTCP_GW_MAX_LISTS)
++ goto error;
++
++ return i;
++error:
++ return -1;
++}
++
++/* The list of addresses is parsed each time a new connection is opened,
++ * to make sure it's up to date. In case of error, all the lists are
++ * marked as unavailable and the subflow's fingerprint is set to 0.
++ */
++static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
++{
++ int i, j, ret;
++ unsigned char opt[MAX_IPOPTLEN] = {0};
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
++
++ /* Read lock: multiple sockets can read LSRR addresses at the same
++ * time, but writes are done in mutual exclusion.
++ * Spin lock: must search for free list for one socket at a time, or
++ * multiple sockets could take the same list.
++ */
++ read_lock(&mptcp_gws_lock);
++ spin_lock(fmp->flow_lock);
++
++ i = mptcp_get_avail_list_ipv4(sk);
++
++ /* Execution enters here only if a free path is found.
++ */
++ if (i >= 0) {
++ opt[0] = IPOPT_NOP;
++ opt[1] = IPOPT_LSRR;
++ opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
++ (mptcp_gws->len[i] + 1) + 3;
++ opt[3] = IPOPT_MINOFF;
++ for (j = 0; j < mptcp_gws->len[i]; ++j)
++ memcpy(opt + 4 +
++ (j * sizeof(mptcp_gws->list[i][0].s_addr)),
++ &mptcp_gws->list[i][j].s_addr,
++ sizeof(mptcp_gws->list[i][0].s_addr));
++ /* Final destination must be part of IP_OPTIONS parameter. */
++ memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
++ sizeof(addr.s_addr));
++
++ /* setsockopt must be inside the lock, otherwise another
++ * subflow could fail to see that we have taken a list.
++ */
++ ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
++ 4 + sizeof(mptcp_gws->list[i][0].s_addr)
++ * (mptcp_gws->len[i] + 1));
++
++ if (ret < 0) {
++ mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
++ __func__, ret);
++ }
++ }
++
++ spin_unlock(fmp->flow_lock);
++ read_unlock(&mptcp_gws_lock);
++
++ return;
++}
++
++/* Parses gateways string for a list of paths to different
++ * gateways, and stores them for use with the Loose Source Routing (LSRR)
++ * socket option. Each list must have "," separated addresses, and the lists
++ * themselves must be separated by "-". Returns -1 in case one or more of the
++ * addresses is not a valid ipv4/6 address.
++ */
++static int mptcp_parse_gateway_ipv4(char *gateways)
++{
++ int i, j, k, ret;
++ char *tmp_string = NULL;
++ struct in_addr tmp_addr;
++
++ tmp_string = kzalloc(16, GFP_KERNEL);
++ if (tmp_string == NULL)
++ return -ENOMEM;
++
++ write_lock(&mptcp_gws_lock);
++
++ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++
++ /* A TMP string is used since inet_pton needs a null terminated string
++ * but we do not want to modify the sysctl for obvious reasons.
++ * i will iterate over the SYSCTL string, j will iterate over the
++ * temporary string where each IP is copied into, k will iterate over
++ * the IPs in each list.
++ */
++ for (i = j = k = 0;
++ i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
++ ++i) {
++ if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
++ /* If the temp IP is empty and the current list is
++ * empty, we are done.
++ */
++ if (j == 0 && mptcp_gws->len[k] == 0)
++ break;
++
++ /* Terminate the temp IP string, then if it is
++ * non-empty parse the IP and copy it.
++ */
++ tmp_string[j] = '\0';
++ if (j > 0) {
++ mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
++
++ ret = in4_pton(tmp_string, strlen(tmp_string),
++ (u8 *)&tmp_addr.s_addr, '\0',
++ NULL);
++
++ if (ret) {
++ mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
++ ret,
++ &tmp_addr.s_addr);
++ memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
++ &tmp_addr.s_addr,
++ sizeof(tmp_addr.s_addr));
++ mptcp_gws->len[k]++;
++ j = 0;
++ tmp_string[j] = '\0';
++ /* Since we can't impose a limit to
++ * what the user can input, make sure
++ * there are not too many IPs in the
++ * SYSCTL string.
++ */
++ if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
++ mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
++ k,
++ MPTCP_GW_LIST_MAX_LEN);
++ goto error;
++ }
++ } else {
++ goto error;
++ }
++ }
++
++ if (gateways[i] == '-' || gateways[i] == '\0')
++ ++k;
++ } else {
++ tmp_string[j] = gateways[i];
++ ++j;
++ }
++ }
++
++ /* Number of flows is number of gateway lists plus master flow */
++ mptcp_binder_ndiffports = k+1;
++
++ write_unlock(&mptcp_gws_lock);
++ kfree(tmp_string);
++
++ return 0;
++
++error:
++ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++ memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
++ write_unlock(&mptcp_gws_lock);
++ kfree(tmp_string);
++ return -1;
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ const struct binder_priv *pm_priv = container_of(work,
++ struct binder_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = pm_priv->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ int iter = 0;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ if (mptcp_binder_ndiffports > iter &&
++ mptcp_binder_ndiffports > mpcb->cnt_subflows) {
++ struct mptcp_loc4 loc;
++ struct mptcp_rem4 rem;
++
++ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++ loc.loc4_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem4_id = 0; /* Default 0 */
++
++ mptcp_init4_subsockets(meta_sk, &loc, &rem);
++
++ goto next_subflow;
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void binder_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
++ static DEFINE_SPINLOCK(flow_lock);
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (meta_sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(meta_sk)) {
++ mptcp_fallback_default(mpcb);
++ return;
++ }
++#endif
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ fmp->mpcb = mpcb;
++
++ fmp->flow_lock = &flow_lock;
++}
++
++static void binder_create_subflows(struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (!work_pending(&pm_priv->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &pm_priv->subflow_work);
++ }
++}
++
++static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
++ * Inspired from proc_tcp_congestion_control().
++ */
++static int proc_mptcp_gateways(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ int ret;
++ ctl_table tbl = {
++ .maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
++ };
++
++ if (write) {
++ tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
++ if (tbl.data == NULL)
++ return -1;
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (ret == 0) {
++ ret = mptcp_parse_gateway_ipv4(tbl.data);
++ memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
++ }
++ kfree(tbl.data);
++ } else {
++ ret = proc_dostring(ctl, write, buffer, lenp, ppos);
++ }
++
++
++ return ret;
++}
++
++static struct mptcp_pm_ops binder __read_mostly = {
++ .new_session = binder_new_session,
++ .fully_established = binder_create_subflows,
++ .get_local_id = binder_get_local_id,
++ .init_subsocket_v4 = mptcp_v4_add_lsrr,
++ .name = "binder",
++ .owner = THIS_MODULE,
++};
++
++static struct ctl_table binder_table[] = {
++ {
++ .procname = "mptcp_binder_gateways",
++ .data = &sysctl_mptcp_binder_gateways,
++ .maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
++ .mode = 0644,
++ .proc_handler = &proc_mptcp_gateways
++ },
++ { }
++};
++
++struct ctl_table_header *mptcp_sysctl_binder;
++
++/* General initialization of MPTCP_PM */
++static int __init binder_register(void)
++{
++ mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
++ if (!mptcp_gws)
++ return -ENOMEM;
++
++ rwlock_init(&mptcp_gws_lock);
++
++ BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
++
++ mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
++ binder_table);
++ if (!mptcp_sysctl_binder)
++ goto sysctl_fail;
++
++ if (mptcp_register_path_manager(&binder))
++ goto pm_failed;
++
++ return 0;
++
++pm_failed:
++ unregister_net_sysctl_table(mptcp_sysctl_binder);
++sysctl_fail:
++ kfree(mptcp_gws);
++
++ return -1;
++}
++
++static void binder_unregister(void)
++{
++ mptcp_unregister_path_manager(&binder);
++ unregister_net_sysctl_table(mptcp_sysctl_binder);
++ kfree(mptcp_gws);
++}
++
++module_init(binder_register);
++module_exit(binder_unregister);
++
++MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("BINDER MPTCP");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
+new file mode 100644
+index 000000000000..5d761164eb85
+--- /dev/null
++++ b/net/mptcp/mptcp_coupled.c
+@@ -0,0 +1,270 @@
++/*
++ * MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++/* Scaling is done in the numerator with alpha_scale_num and in the denominator
++ * with alpha_scale_den.
++ *
++ * To downscale, we just need to use alpha_scale.
++ *
++ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
++ */
++static int alpha_scale_den = 10;
++static int alpha_scale_num = 32;
++static int alpha_scale = 12;
++
++struct mptcp_ccc {
++ u64 alpha;
++ bool forced_update;
++};
++
++static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
++{
++ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
++{
++ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
++}
++
++static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
++{
++ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
++}
++
++static inline u64 mptcp_ccc_scale(u32 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++static inline bool mptcp_get_forced(const struct sock *meta_sk)
++{
++ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
++}
++
++static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
++{
++ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
++}
++
++static void mptcp_ccc_recalc_alpha(const struct sock *sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ const struct sock *sub_sk;
++ int best_cwnd = 0, best_rtt = 0, can_send = 0;
++ u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
++
++ if (!mpcb)
++ return;
++
++ /* Only one subflow left - fall back to normal reno-behavior
++ * (set alpha to 1)
++ */
++ if (mpcb->cnt_established <= 1)
++ goto exit;
++
++ /* Do regular alpha-calculation for multiple subflows */
++
++ /* Find the max numerator of the alpha-calculation */
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++ u64 tmp;
++
++ if (!mptcp_ccc_sk_can_send(sub_sk))
++ continue;
++
++ can_send++;
++
++ /* We need to look for the path, that provides the max-value.
++ * Integer-overflow is not possible here, because
++ * tmp will be in u64.
++ */
++ tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
++ alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
++
++ if (tmp >= max_numerator) {
++ max_numerator = tmp;
++ best_cwnd = sub_tp->snd_cwnd;
++ best_rtt = sub_tp->srtt_us;
++ }
++ }
++
++ /* No subflow is able to send - we don't care anymore */
++ if (unlikely(!can_send))
++ goto exit;
++
++ /* Calculate the denominator */
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++
++ if (!mptcp_ccc_sk_can_send(sub_sk))
++ continue;
++
++ sum_denominator += div_u64(
++ mptcp_ccc_scale(sub_tp->snd_cwnd,
++ alpha_scale_den) * best_rtt,
++ sub_tp->srtt_us);
++ }
++ sum_denominator *= sum_denominator;
++ if (unlikely(!sum_denominator)) {
++ pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
++ __func__, mpcb->cnt_established);
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++ pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
++ __func__, sub_tp->mptcp->path_index,
++ sub_sk->sk_state, sub_tp->srtt_us,
++ sub_tp->snd_cwnd);
++ }
++ }
++
++ alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
++
++ if (unlikely(!alpha))
++ alpha = 1;
++
++exit:
++ mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
++}
++
++static void mptcp_ccc_init(struct sock *sk)
++{
++ if (mptcp(tcp_sk(sk))) {
++ mptcp_set_forced(mptcp_meta_sk(sk), 0);
++ mptcp_set_alpha(mptcp_meta_sk(sk), 1);
++ }
++ /* If we do not mptcp, behave like reno: return */
++}
++
++static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++ if (event == CA_EVENT_LOSS)
++ mptcp_ccc_recalc_alpha(sk);
++}
++
++static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
++{
++ if (!mptcp(tcp_sk(sk)))
++ return;
++
++ mptcp_set_forced(mptcp_meta_sk(sk), 1);
++}
++
++static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++ int snd_cwnd;
++
++ if (!mptcp(tp)) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ if (!tcp_is_cwnd_limited(sk))
++ return;
++
++ if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ /* In "safe" area, increase. */
++ tcp_slow_start(tp, acked);
++ mptcp_ccc_recalc_alpha(sk);
++ return;
++ }
++
++ if (mptcp_get_forced(mptcp_meta_sk(sk))) {
++ mptcp_ccc_recalc_alpha(sk);
++ mptcp_set_forced(mptcp_meta_sk(sk), 0);
++ }
++
++ if (mpcb->cnt_established > 1) {
++ u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
++
++ /* This may happen, if at the initialization, the mpcb
++ * was not yet attached to the sock, and thus
++ * initializing alpha failed.
++ */
++ if (unlikely(!alpha))
++ alpha = 1;
++
++ snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
++ alpha);
++
++ /* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
++ * Thus, we select here the max value.
++ */
++ if (snd_cwnd < tp->snd_cwnd)
++ snd_cwnd = tp->snd_cwnd;
++ } else {
++ snd_cwnd = tp->snd_cwnd;
++ }
++
++ if (tp->snd_cwnd_cnt >= snd_cwnd) {
++ if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
++ tp->snd_cwnd++;
++ mptcp_ccc_recalc_alpha(sk);
++ }
++
++ tp->snd_cwnd_cnt = 0;
++ } else {
++ tp->snd_cwnd_cnt++;
++ }
++}
++
++static struct tcp_congestion_ops mptcp_ccc = {
++ .init = mptcp_ccc_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_ccc_cong_avoid,
++ .cwnd_event = mptcp_ccc_cwnd_event,
++ .set_state = mptcp_ccc_set_state,
++ .owner = THIS_MODULE,
++ .name = "lia",
++};
++
++static int __init mptcp_ccc_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
++ return tcp_register_congestion_control(&mptcp_ccc);
++}
++
++static void __exit mptcp_ccc_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_ccc);
++}
++
++module_init(mptcp_ccc_register);
++module_exit(mptcp_ccc_unregister);
++
++MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
+new file mode 100644
+index 000000000000..28dfa0479f5e
+--- /dev/null
++++ b/net/mptcp/mptcp_ctrl.c
+@@ -0,0 +1,2401 @@
++/*
++ * MPTCP implementation - MPTCP-control
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <net/inet_common.h>
++#include <net/inet6_hashtables.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/ip6_route.h>
++#include <net/mptcp_v6.h>
++#endif
++#include <net/sock.h>
++#include <net/tcp.h>
++#include <net/tcp_states.h>
++#include <net/transp_v6.h>
++#include <net/xfrm.h>
++
++#include <linux/cryptohash.h>
++#include <linux/kconfig.h>
++#include <linux/module.h>
++#include <linux/netpoll.h>
++#include <linux/list.h>
++#include <linux/jhash.h>
++#include <linux/tcp.h>
++#include <linux/net.h>
++#include <linux/in.h>
++#include <linux/random.h>
++#include <linux/inetdevice.h>
++#include <linux/workqueue.h>
++#include <linux/atomic.h>
++#include <linux/sysctl.h>
++
++static struct kmem_cache *mptcp_sock_cache __read_mostly;
++static struct kmem_cache *mptcp_cb_cache __read_mostly;
++static struct kmem_cache *mptcp_tw_cache __read_mostly;
++
++int sysctl_mptcp_enabled __read_mostly = 1;
++int sysctl_mptcp_checksum __read_mostly = 1;
++int sysctl_mptcp_debug __read_mostly;
++EXPORT_SYMBOL(sysctl_mptcp_debug);
++int sysctl_mptcp_syn_retries __read_mostly = 3;
++
++bool mptcp_init_failed __read_mostly;
++
++struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
++EXPORT_SYMBOL(mptcp_static_key);
++
++static int proc_mptcp_path_manager(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ char val[MPTCP_PM_NAME_MAX];
++ ctl_table tbl = {
++ .data = val,
++ .maxlen = MPTCP_PM_NAME_MAX,
++ };
++ int ret;
++
++ mptcp_get_default_path_manager(val);
++
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (write && ret == 0)
++ ret = mptcp_set_default_path_manager(val);
++ return ret;
++}
++
++static int proc_mptcp_scheduler(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ char val[MPTCP_SCHED_NAME_MAX];
++ ctl_table tbl = {
++ .data = val,
++ .maxlen = MPTCP_SCHED_NAME_MAX,
++ };
++ int ret;
++
++ mptcp_get_default_scheduler(val);
++
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (write && ret == 0)
++ ret = mptcp_set_default_scheduler(val);
++ return ret;
++}
++
++static struct ctl_table mptcp_table[] = {
++ {
++ .procname = "mptcp_enabled",
++ .data = &sysctl_mptcp_enabled,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_checksum",
++ .data = &sysctl_mptcp_checksum,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_debug",
++ .data = &sysctl_mptcp_debug,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_syn_retries",
++ .data = &sysctl_mptcp_syn_retries,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_path_manager",
++ .mode = 0644,
++ .maxlen = MPTCP_PM_NAME_MAX,
++ .proc_handler = proc_mptcp_path_manager,
++ },
++ {
++ .procname = "mptcp_scheduler",
++ .mode = 0644,
++ .maxlen = MPTCP_SCHED_NAME_MAX,
++ .proc_handler = proc_mptcp_scheduler,
++ },
++ { }
++};
++
++static inline u32 mptcp_hash_tk(u32 token)
++{
++ return token % MPTCP_HASH_SIZE;
++}
++
++struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++EXPORT_SYMBOL(tk_hashtable);
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
++
++/* The following hash table is used to avoid collision of token */
++static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_tk_hashlock; /* hashtable protection */
++
++static bool mptcp_reqsk_find_tk(const u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct mptcp_request_sock *mtreqsk;
++ const struct hlist_nulls_node *node;
++
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreqsk, node,
++ &mptcp_reqsk_tk_htb[hash], hash_entry) {
++ if (token == mtreqsk->mptcp_loc_token)
++ return true;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++ return false;
++}
++
++static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
++{
++ u32 hash = mptcp_hash_tk(token);
++
++ hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
++ &mptcp_reqsk_tk_htb[hash]);
++}
++
++static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
++{
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++}
++
++void mptcp_reqsk_destructor(struct request_sock *req)
++{
++ if (!mptcp_rsk(req)->is_sub) {
++ if (in_softirq()) {
++ mptcp_reqsk_remove_tk(req);
++ } else {
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++ }
++ } else {
++ mptcp_hash_request_remove(req);
++ }
++}
++
++static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
++{
++ u32 hash = mptcp_hash_tk(token);
++ hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
++ meta_tp->inside_tk_table = 1;
++}
++
++static bool mptcp_find_token(u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct tcp_sock *meta_tp;
++ const struct hlist_nulls_node *node;
++
++begin:
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
++ if (token == meta_tp->mptcp_loc_token)
++ return true;
++ }
++ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++ return false;
++}
++
++static void mptcp_set_key_reqsk(struct request_sock *req,
++ const struct sk_buff *skb)
++{
++ const struct inet_request_sock *ireq = inet_rsk(req);
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++ if (skb->protocol == htons(ETH_P_IP)) {
++ mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr,
++ htons(ireq->ir_num),
++ ireq->ir_rmt_port);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
++ ipv6_hdr(skb)->daddr.s6_addr32,
++ htons(ireq->ir_num),
++ ireq->ir_rmt_port);
++#endif
++ }
++
++ mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
++}
++
++/* New MPTCP-connection request, prepare a new token for the meta-socket that
++ * will be created in mptcp_check_req_master(), and store the received token.
++ */
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++ inet_rsk(req)->saw_mpc = 1;
++
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ do {
++ mptcp_set_key_reqsk(req, skb);
++ } while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
++ mptcp_find_token(mtreq->mptcp_loc_token));
++
++ mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++ mtreq->mptcp_rem_key = mopt->mptcp_key;
++}
++
++static void mptcp_set_key_sk(const struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct inet_sock *isk = inet_sk(sk);
++
++ if (sk->sk_family == AF_INET)
++ tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
++ isk->inet_daddr,
++ isk->inet_sport,
++ isk->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
++ sk->sk_v6_daddr.s6_addr32,
++ isk->inet_sport,
++ isk->inet_dport);
++#endif
++
++ mptcp_key_sha1(tp->mptcp_loc_key,
++ &tp->mptcp_loc_token, NULL);
++}
++
++void mptcp_connect_init(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ do {
++ mptcp_set_key_sk(sk);
++ } while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
++ mptcp_find_token(tp->mptcp_loc_token));
++
++ __mptcp_hash_insert(tp, tp->mptcp_loc_token);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++}
++
++/**
++ * This function increments the refcount of the mpcb struct.
++ * It is the responsibility of the caller to decrement when releasing
++ * the structure.
++ */
++struct sock *mptcp_hash_find(const struct net *net, const u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct tcp_sock *meta_tp;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
++ tk_table) {
++ meta_sk = (struct sock *)meta_tp;
++ if (token == meta_tp->mptcp_loc_token &&
++ net_eq(net, sock_net(meta_sk))) {
++ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ goto out;
++ if (unlikely(token != meta_tp->mptcp_loc_token ||
++ !net_eq(net, sock_net(meta_sk)))) {
++ sock_gen_put(meta_sk);
++ goto begin;
++ }
++ goto found;
++ }
++ }
++ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++out:
++ meta_sk = NULL;
++found:
++ rcu_read_unlock();
++ return meta_sk;
++}
++
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
++{
++ /* remove from the token hashtable */
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++ meta_tp->inside_tk_table = 0;
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++}
++
++void mptcp_hash_remove(struct tcp_sock *meta_tp)
++{
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++ meta_tp->inside_tk_table = 0;
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++}
++
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk, *rttsk = NULL, *lastsk = NULL;
++ u32 min_time = 0, last_active = 0;
++
++ mptcp_for_each_sk(meta_tp->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ u32 elapsed;
++
++ if (!mptcp_sk_can_send_ack(sk) || tp->pf)
++ continue;
++
++ elapsed = keepalive_time_elapsed(tp);
++
++ /* We take the one with the lowest RTT within a reasonable
++ * (meta-RTO)-timeframe
++ */
++ if (elapsed < inet_csk(meta_sk)->icsk_rto) {
++ if (!min_time || tp->srtt_us < min_time) {
++ min_time = tp->srtt_us;
++ rttsk = sk;
++ }
++ continue;
++ }
++
++ /* Otherwise, we just take the most recent active */
++ if (!rttsk && (!last_active || elapsed < last_active)) {
++ last_active = elapsed;
++ lastsk = sk;
++ }
++ }
++
++ if (rttsk)
++ return rttsk;
++
++ return lastsk;
++}
++EXPORT_SYMBOL(mptcp_select_ack_sock);
++
++static void mptcp_sock_def_error_report(struct sock *sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ if (!sock_flag(sk, SOCK_DEAD))
++ mptcp_sub_close(sk, 0);
++
++ if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
++ mpcb->send_infinite_mapping) {
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ meta_sk->sk_err = sk->sk_err;
++ meta_sk->sk_err_soft = sk->sk_err_soft;
++
++ if (!sock_flag(meta_sk, SOCK_DEAD))
++ meta_sk->sk_error_report(meta_sk);
++
++ tcp_done(meta_sk);
++ }
++
++ sk->sk_err = 0;
++ return;
++}
++
++static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
++{
++ if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
++ mptcp_cleanup_path_manager(mpcb);
++ mptcp_cleanup_scheduler(mpcb);
++ kmem_cache_free(mptcp_cb_cache, mpcb);
++ }
++}
++
++static void mptcp_sock_destruct(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ inet_sock_destruct(sk);
++
++ if (!is_meta_sk(sk) && !tp->was_meta_sk) {
++ BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
++
++ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++ tp->mptcp = NULL;
++
++ /* Taken when mpcb pointer was set */
++ sock_put(mptcp_meta_sk(sk));
++ mptcp_mpcb_put(tp->mpcb);
++ } else {
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct mptcp_tw *mptw;
++
++ /* The mpcb is disappearing - we can make the final
++ * update to the rcv_nxt of the time-wait-sock and remove
++ * its reference to the mpcb.
++ */
++ spin_lock_bh(&mpcb->tw_lock);
++ list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
++ list_del_rcu(&mptw->list);
++ mptw->in_list = 0;
++ mptcp_mpcb_put(mpcb);
++ rcu_assign_pointer(mptw->mpcb, NULL);
++ }
++ spin_unlock_bh(&mpcb->tw_lock);
++
++ mptcp_mpcb_put(mpcb);
++
++ mptcp_debug("%s destroying meta-sk\n", __func__);
++ }
++
++ WARN_ON(!static_key_false(&mptcp_static_key));
++ /* Must be the last call, because is_meta_sk() above still needs the
++ * static key
++ */
++ static_key_slow_dec(&mptcp_static_key);
++}
++
++void mptcp_destroy_sock(struct sock *sk)
++{
++ if (is_meta_sk(sk)) {
++ struct sock *sk_it, *tmpsk;
++
++ __skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
++ mptcp_purge_ofo_queue(tcp_sk(sk));
++
++ /* We have to close all remaining subflows. Normally, they
++ * should all be about to get closed. But, if the kernel is
++ * forcing a closure (e.g., tcp_write_err), the subflows might
++ * not have been closed properly (as we are waiting for the
++ * DATA_ACK of the DATA_FIN).
++ */
++ mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
++ /* Already did call tcp_close - waiting for graceful
++ * closure, or if we are retransmitting fast-close on
++ * the subflow. The reset (or timeout) will kill the
++ * subflow..
++ */
++ if (tcp_sk(sk_it)->closing ||
++ tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++
++ /* Allow the delayed work first to prevent time-wait state */
++ if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
++ continue;
++
++ mptcp_sub_close(sk_it, 0);
++ }
++
++ mptcp_delete_synack_timer(sk);
++ } else {
++ mptcp_del_sock(sk);
++ }
++}
++
++static void mptcp_set_state(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ /* Meta is not yet established - wake up the application */
++ if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
++ sk->sk_state == TCP_ESTABLISHED) {
++ tcp_set_state(meta_sk, TCP_ESTABLISHED);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ meta_sk->sk_state_change(meta_sk);
++ sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
++ }
++ }
++
++ if (sk->sk_state == TCP_ESTABLISHED) {
++ tcp_sk(sk)->mptcp->establish_increased = 1;
++ tcp_sk(sk)->mpcb->cnt_established++;
++ }
++}
++
++void mptcp_init_congestion_control(struct sock *sk)
++{
++ struct inet_connection_sock *icsk = inet_csk(sk);
++ struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
++ const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
++
++ /* The application didn't set the congestion control to use
++ * fallback to the default one.
++ */
++ if (ca == &tcp_init_congestion_ops)
++ goto use_default;
++
++ /* Use the same congestion control as set by the user. If the
++ * module is not available fallback to the default one.
++ */
++ if (!try_module_get(ca->owner)) {
++ pr_warn("%s: fallback to the system default CC\n", __func__);
++ goto use_default;
++ }
++
++ icsk->icsk_ca_ops = ca;
++ if (icsk->icsk_ca_ops->init)
++ icsk->icsk_ca_ops->init(sk);
++
++ return;
++
++use_default:
++ icsk->icsk_ca_ops = &tcp_init_congestion_ops;
++ tcp_init_congestion_control(sk);
++}
++
++u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
++u32 mptcp_seed = 0;
++
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
++{
++ u32 workspace[SHA_WORKSPACE_WORDS];
++ u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
++ u8 input[64];
++ int i;
++
++ memset(workspace, 0, sizeof(workspace));
++
++ /* Initialize input with appropriate padding */
++ memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
++ * is explicitly set too
++ */
++ memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
++ input[8] = 0x80; /* Padding: First bit after message = 1 */
++ input[63] = 0x40; /* Padding: Length of the message = 64 bits */
++
++ sha_init(mptcp_hashed_key);
++ sha_transform(mptcp_hashed_key, input, workspace);
++
++ for (i = 0; i < 5; i++)
++ mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
++
++ if (token)
++ *token = mptcp_hashed_key[0];
++ if (idsn)
++ *idsn = *((u64 *)&mptcp_hashed_key[3]);
++}
++
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++ u32 *hash_out)
++{
++ u32 workspace[SHA_WORKSPACE_WORDS];
++ u8 input[128]; /* 2 512-bit blocks */
++ int i;
++
++ memset(workspace, 0, sizeof(workspace));
++
++ /* Generate key xored with ipad */
++ memset(input, 0x36, 64);
++ for (i = 0; i < 8; i++)
++ input[i] ^= key_1[i];
++ for (i = 0; i < 8; i++)
++ input[i + 8] ^= key_2[i];
++
++ memcpy(&input[64], rand_1, 4);
++ memcpy(&input[68], rand_2, 4);
++ input[72] = 0x80; /* Padding: First bit after message = 1 */
++ memset(&input[73], 0, 53);
++
++ /* Padding: Length of the message = 512 + 64 bits */
++ input[126] = 0x02;
++ input[127] = 0x40;
++
++ sha_init(hash_out);
++ sha_transform(hash_out, input, workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ sha_transform(hash_out, &input[64], workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ for (i = 0; i < 5; i++)
++ hash_out[i] = cpu_to_be32(hash_out[i]);
++
++ /* Prepare second part of hmac */
++ memset(input, 0x5C, 64);
++ for (i = 0; i < 8; i++)
++ input[i] ^= key_1[i];
++ for (i = 0; i < 8; i++)
++ input[i + 8] ^= key_2[i];
++
++ memcpy(&input[64], hash_out, 20);
++ input[84] = 0x80;
++ memset(&input[85], 0, 41);
++
++ /* Padding: Length of the message = 512 + 160 bits */
++ input[126] = 0x02;
++ input[127] = 0xA0;
++
++ sha_init(hash_out);
++ sha_transform(hash_out, input, workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ sha_transform(hash_out, &input[64], workspace);
++
++ for (i = 0; i < 5; i++)
++ hash_out[i] = cpu_to_be32(hash_out[i]);
++}
++
++static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
++{
++ /* Socket-options handled by sk_clone_lock while creating the meta-sk.
++ * ======
++ * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
++ * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
++ * TCP_NODELAY, TCP_CORK
++ *
++ * Socket-options handled in this function here
++ * ======
++ * TCP_DEFER_ACCEPT
++ * SO_KEEPALIVE
++ *
++ * Socket-options on the todo-list
++ * ======
++ * SO_BINDTODEVICE - should probably prevent creation of new subsocks
++ * across other devices. - what about the api-draft?
++ * SO_DEBUG
++ * SO_REUSEADDR - probably we don't care about this
++ * SO_DONTROUTE, SO_BROADCAST
++ * SO_OOBINLINE
++ * SO_LINGER
++ * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
++ * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
++ * SO_RXQ_OVFL
++ * TCP_COOKIE_TRANSACTIONS
++ * TCP_MAXSEG
++ * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
++ * in mptcp_retransmit_timer. AND we need to check what is
++ * about the subsockets.
++ * TCP_LINGER2
++ * TCP_WINDOW_CLAMP
++ * TCP_USER_TIMEOUT
++ * TCP_MD5SIG
++ *
++ * Socket-options of no concern for the meta-socket (but for the subsocket)
++ * ======
++ * SO_PRIORITY
++ * SO_MARK
++ * TCP_CONGESTION
++ * TCP_SYNCNT
++ * TCP_QUICKACK
++ */
++
++ /* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
++ inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
++
++ /* Keepalives are handled entirely at the MPTCP-layer */
++ if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
++ inet_csk_reset_keepalive_timer(meta_sk,
++ keepalive_time_when(tcp_sk(meta_sk)));
++ sock_reset_flag(master_sk, SOCK_KEEPOPEN);
++ inet_csk_delete_keepalive_timer(master_sk);
++ }
++
++ /* Do not propagate subflow-errors up to the MPTCP-layer */
++ inet_sk(master_sk)->recverr = 0;
++}
++
++static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
++{
++ /* IP_TOS also goes to the subflow. */
++ if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
++ inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
++ sub_sk->sk_priority = meta_sk->sk_priority;
++ sk_dst_reset(sub_sk);
++ }
++
++ /* Inherit SO_REUSEADDR */
++ sub_sk->sk_reuse = meta_sk->sk_reuse;
++
++ /* Inherit snd/rcv-buffer locks */
++ sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
++
++ /* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
++ tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
++
++ /* Keepalives are handled entirely at the MPTCP-layer */
++ if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
++ sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
++ inet_csk_delete_keepalive_timer(sub_sk);
++ }
++
++ /* Do not propagate subflow-errors up to the MPTCP-layer */
++ inet_sk(sub_sk)->recverr = 0;
++}
++
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ /* skb-sk may be NULL if we receive a packet immediatly after the
++ * SYN/ACK + MP_CAPABLE.
++ */
++ struct sock *sk = skb->sk ? skb->sk : meta_sk;
++ int ret = 0;
++
++ skb->sk = NULL;
++
++ if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
++ kfree_skb(skb);
++ return 0;
++ }
++
++ if (sk->sk_family == AF_INET)
++ ret = tcp_v4_do_rcv(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ ret = tcp_v6_do_rcv(sk, skb);
++#endif
++
++ sock_put(sk);
++ return ret;
++}
++
++struct lock_class_key meta_key;
++struct lock_class_key meta_slock_key;
++
++static void mptcp_synack_timer_handler(unsigned long data)
++{
++ struct sock *meta_sk = (struct sock *) data;
++ struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
++
++ /* Only process if socket is not in use. */
++ bh_lock_sock(meta_sk);
++
++ if (sock_owned_by_user(meta_sk)) {
++ /* Try again later. */
++ mptcp_reset_synack_timer(meta_sk, HZ/20);
++ goto out;
++ }
++
++ /* May happen if the queue got destructed in mptcp_close */
++ if (!lopt)
++ goto out;
++
++ inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
++ TCP_TIMEOUT_INIT, TCP_RTO_MAX);
++
++ if (lopt->qlen)
++ mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
++
++out:
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk);
++}
++
++static const struct tcp_sock_ops mptcp_meta_specific = {
++ .__select_window = __mptcp_select_window,
++ .select_window = mptcp_select_window,
++ .select_initial_window = mptcp_select_initial_window,
++ .init_buffer_space = mptcp_init_buffer_space,
++ .set_rto = mptcp_tcp_set_rto,
++ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
++ .init_congestion_control = mptcp_init_congestion_control,
++ .send_fin = mptcp_send_fin,
++ .write_xmit = mptcp_write_xmit,
++ .send_active_reset = mptcp_send_active_reset,
++ .write_wakeup = mptcp_write_wakeup,
++ .prune_ofo_queue = mptcp_prune_ofo_queue,
++ .retransmit_timer = mptcp_retransmit_timer,
++ .time_wait = mptcp_time_wait,
++ .cleanup_rbuf = mptcp_cleanup_rbuf,
++};
++
++static const struct tcp_sock_ops mptcp_sub_specific = {
++ .__select_window = __mptcp_select_window,
++ .select_window = mptcp_select_window,
++ .select_initial_window = mptcp_select_initial_window,
++ .init_buffer_space = mptcp_init_buffer_space,
++ .set_rto = mptcp_tcp_set_rto,
++ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
++ .init_congestion_control = mptcp_init_congestion_control,
++ .send_fin = tcp_send_fin,
++ .write_xmit = tcp_write_xmit,
++ .send_active_reset = tcp_send_active_reset,
++ .write_wakeup = tcp_write_wakeup,
++ .prune_ofo_queue = tcp_prune_ofo_queue,
++ .retransmit_timer = tcp_retransmit_timer,
++ .time_wait = tcp_time_wait,
++ .cleanup_rbuf = tcp_cleanup_rbuf,
++};
++
++static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++ struct mptcp_cb *mpcb;
++ struct sock *master_sk;
++ struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
++ struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
++ u64 idsn;
++
++ dst_release(meta_sk->sk_rx_dst);
++ meta_sk->sk_rx_dst = NULL;
++ /* This flag is set to announce sock_lock_init to
++ * reclassify the lock-class of the master socket.
++ */
++ meta_tp->is_master_sk = 1;
++ master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
++ meta_tp->is_master_sk = 0;
++ if (!master_sk)
++ return -ENOBUFS;
++
++ master_tp = tcp_sk(master_sk);
++ master_icsk = inet_csk(master_sk);
++
++ mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
++ if (!mpcb) {
++ /* sk_free (and __sk_free) requirese wmem_alloc to be 1.
++ * All the rest is set to 0 thanks to __GFP_ZERO above.
++ */
++ atomic_set(&master_sk->sk_wmem_alloc, 1);
++ sk_free(master_sk);
++ return -ENOBUFS;
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
++ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++ newnp = inet6_sk(master_sk);
++ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++ newnp->ipv6_mc_list = NULL;
++ newnp->ipv6_ac_list = NULL;
++ newnp->ipv6_fl_list = NULL;
++ newnp->opt = NULL;
++ newnp->pktoptions = NULL;
++ (void)xchg(&newnp->rxpmtu, NULL);
++ } else if (meta_sk->sk_family == AF_INET6) {
++ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++ newnp = inet6_sk(master_sk);
++ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++ newnp->hop_limit = -1;
++ newnp->mcast_hops = IPV6_DEFAULT_MCASTHOPS;
++ newnp->mc_loop = 1;
++ newnp->pmtudisc = IPV6_PMTUDISC_WANT;
++ newnp->ipv6only = sock_net(master_sk)->ipv6.sysctl.bindv6only;
++ }
++#endif
++
++ meta_tp->mptcp = NULL;
++
++ /* Store the keys and generate the peer's token */
++ mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
++ mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
++
++ /* Generate Initial data-sequence-numbers */
++ mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
++ idsn = ntohll(idsn) + 1;
++ mpcb->snd_high_order[0] = idsn >> 32;
++ mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
++
++ meta_tp->write_seq = (u32)idsn;
++ meta_tp->snd_sml = meta_tp->write_seq;
++ meta_tp->snd_una = meta_tp->write_seq;
++ meta_tp->snd_nxt = meta_tp->write_seq;
++ meta_tp->pushed_seq = meta_tp->write_seq;
++ meta_tp->snd_up = meta_tp->write_seq;
++
++ mpcb->mptcp_rem_key = remote_key;
++ mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
++ idsn = ntohll(idsn) + 1;
++ mpcb->rcv_high_order[0] = idsn >> 32;
++ mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
++ meta_tp->copied_seq = (u32) idsn;
++ meta_tp->rcv_nxt = (u32) idsn;
++ meta_tp->rcv_wup = (u32) idsn;
++
++ meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
++ meta_tp->snd_wnd = window;
++ meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
++
++ meta_tp->packets_out = 0;
++ meta_icsk->icsk_probes_out = 0;
++
++ /* Set mptcp-pointers */
++ master_tp->mpcb = mpcb;
++ master_tp->meta_sk = meta_sk;
++ meta_tp->mpcb = mpcb;
++ meta_tp->meta_sk = meta_sk;
++ mpcb->meta_sk = meta_sk;
++ mpcb->master_sk = master_sk;
++
++ meta_tp->was_meta_sk = 0;
++
++ /* Initialize the queues */
++ skb_queue_head_init(&mpcb->reinject_queue);
++ skb_queue_head_init(&master_tp->out_of_order_queue);
++ tcp_prequeue_init(master_tp);
++ INIT_LIST_HEAD(&master_tp->tsq_node);
++
++ master_tp->tsq_flags = 0;
++
++ mutex_init(&mpcb->mpcb_mutex);
++
++ /* Init the accept_queue structure, we support a queue of 32 pending
++ * connections, it does not need to be huge, since we only store here
++ * pending subflow creations.
++ */
++ if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
++ inet_put_port(master_sk);
++ kmem_cache_free(mptcp_cb_cache, mpcb);
++ sk_free(master_sk);
++ return -ENOMEM;
++ }
++
++ /* Redefine function-pointers as the meta-sk is now fully ready */
++ static_key_slow_inc(&mptcp_static_key);
++ meta_tp->mpc = 1;
++ meta_tp->ops = &mptcp_meta_specific;
++
++ meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
++ meta_sk->sk_destruct = mptcp_sock_destruct;
++
++ /* Meta-level retransmit timer */
++ meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
++
++ tcp_init_xmit_timers(master_sk);
++ /* Has been set for sending out the SYN */
++ inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
++
++ if (!meta_tp->inside_tk_table) {
++ /* Adding the meta_tp in the token hashtable - coming from server-side */
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++
++ __mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
++
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++ }
++ master_tp->inside_tk_table = 0;
++
++ /* Init time-wait stuff */
++ INIT_LIST_HEAD(&mpcb->tw_list);
++ spin_lock_init(&mpcb->tw_lock);
++
++ INIT_HLIST_HEAD(&mpcb->callback_list);
++
++ mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
++
++ mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
++ mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
++ mpcb->orig_window_clamp = meta_tp->window_clamp;
++
++ /* The meta is directly linked - set refcnt to 1 */
++ atomic_set(&mpcb->mpcb_refcnt, 1);
++
++ mptcp_init_path_manager(mpcb);
++ mptcp_init_scheduler(mpcb);
++
++ setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
++ (unsigned long)meta_sk);
++
++ mptcp_debug("%s: created mpcb with token %#x\n",
++ __func__, mpcb->mptcp_loc_token);
++
++ return 0;
++}
++
++void mptcp_fallback_meta_sk(struct sock *meta_sk)
++{
++ kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
++ kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
++}
++
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++ gfp_t flags)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
++ if (!tp->mptcp)
++ return -ENOMEM;
++
++ tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
++ /* No more space for more subflows? */
++ if (!tp->mptcp->path_index) {
++ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++ return -EPERM;
++ }
++
++ INIT_HLIST_NODE(&tp->mptcp->cb_list);
++
++ tp->mptcp->tp = tp;
++ tp->mpcb = mpcb;
++ tp->meta_sk = meta_sk;
++
++ static_key_slow_inc(&mptcp_static_key);
++ tp->mpc = 1;
++ tp->ops = &mptcp_sub_specific;
++
++ tp->mptcp->loc_id = loc_id;
++ tp->mptcp->rem_id = rem_id;
++ if (mpcb->sched_ops->init)
++ mpcb->sched_ops->init(sk);
++
++ /* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
++ * included in mptcp_del_sock(), because the mpcb must remain alive
++ * until the last subsocket is completely destroyed.
++ */
++ sock_hold(meta_sk);
++ atomic_inc(&mpcb->mpcb_refcnt);
++
++ tp->mptcp->next = mpcb->connection_list;
++ mpcb->connection_list = tp;
++ tp->mptcp->attached = 1;
++
++ mpcb->cnt_subflows++;
++ atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
++ &meta_sk->sk_rmem_alloc);
++
++ mptcp_sub_inherit_sockopts(meta_sk, sk);
++ INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
++
++ /* As we successfully allocated the mptcp_tcp_sock, we have to
++ * change the function-pointers here (for sk_destruct to work correctly)
++ */
++ sk->sk_error_report = mptcp_sock_def_error_report;
++ sk->sk_data_ready = mptcp_data_ready;
++ sk->sk_write_space = mptcp_write_space;
++ sk->sk_state_change = mptcp_set_state;
++ sk->sk_destruct = mptcp_sock_destruct;
++
++ if (sk->sk_family == AF_INET)
++ mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
++ __func__ , mpcb->mptcp_loc_token,
++ tp->mptcp->path_index,
++ &((struct inet_sock *)tp)->inet_saddr,
++ ntohs(((struct inet_sock *)tp)->inet_sport),
++ &((struct inet_sock *)tp)->inet_daddr,
++ ntohs(((struct inet_sock *)tp)->inet_dport),
++ mpcb->cnt_subflows);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
++ __func__ , mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &inet6_sk(sk)->saddr,
++ ntohs(((struct inet_sock *)tp)->inet_sport),
++ &sk->sk_v6_daddr,
++ ntohs(((struct inet_sock *)tp)->inet_dport),
++ mpcb->cnt_subflows);
++#endif
++
++ return 0;
++}
++
++void mptcp_del_sock(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
++ struct mptcp_cb *mpcb;
++
++ if (!tp->mptcp || !tp->mptcp->attached)
++ return;
++
++ mpcb = tp->mpcb;
++ tp_prev = mpcb->connection_list;
++
++ mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
++ __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
++ sk->sk_state, is_meta_sk(sk));
++
++ if (tp_prev == tp) {
++ mpcb->connection_list = tp->mptcp->next;
++ } else {
++ for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
++ if (tp_prev->mptcp->next == tp) {
++ tp_prev->mptcp->next = tp->mptcp->next;
++ break;
++ }
++ }
++ }
++ mpcb->cnt_subflows--;
++ if (tp->mptcp->establish_increased)
++ mpcb->cnt_established--;
++
++ tp->mptcp->next = NULL;
++ tp->mptcp->attached = 0;
++ mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
++
++ if (!skb_queue_empty(&sk->sk_write_queue))
++ mptcp_reinject_data(sk, 0);
++
++ if (is_master_tp(tp))
++ mpcb->master_sk = NULL;
++ else if (tp->mptcp->pre_established)
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++
++ rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
++}
++
++/* Updates the metasocket ULID/port data, based on the given sock.
++ * The argument sock must be the sock accessible to the application.
++ * In this function, we update the meta socket info, based on the changes
++ * in the application socket (bind, address allocation, ...)
++ */
++void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
++{
++ if (tcp_sk(sk)->mpcb->pm_ops->new_session)
++ tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
++
++ tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
++}
++
++/* Clean up the receive buffer for full frames taken by the user,
++ * then send an ACK if necessary. COPIED is the number of bytes
++ * tcp_recvmsg has given to the user so far, it speeds up the
++ * calculation of whether or not we must ACK for the sake of
++ * a window update.
++ */
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk;
++ __u32 rcv_window_now = 0;
++
++ if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
++ rcv_window_now = tcp_receive_window(meta_tp);
++
++ if (2 * rcv_window_now > meta_tp->window_clamp)
++ rcv_window_now = 0;
++ }
++
++ mptcp_for_each_sk(meta_tp->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (!mptcp_sk_can_send_ack(sk))
++ continue;
++
++ if (!inet_csk_ack_scheduled(sk))
++ goto second_part;
++ /* Delayed ACKs frequently hit locked sockets during bulk
++ * receive.
++ */
++ if (icsk->icsk_ack.blocked ||
++ /* Once-per-two-segments ACK was not sent by tcp_input.c */
++ tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
++ /* If this read emptied read buffer, we send ACK, if
++ * connection is not bidirectional, user drained
++ * receive buffer and there was a small segment
++ * in queue.
++ */
++ (copied > 0 &&
++ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
++ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
++ !icsk->icsk_ack.pingpong)) &&
++ !atomic_read(&meta_sk->sk_rmem_alloc))) {
++ tcp_send_ack(sk);
++ continue;
++ }
++
++second_part:
++ /* This here is the second part of tcp_cleanup_rbuf */
++ if (rcv_window_now) {
++ __u32 new_window = tp->ops->__select_window(sk);
++
++ /* Send ACK now, if this read freed lots of space
++ * in our buffer. Certainly, new_window is new window.
++ * We can advertise it now, if it is not less than
++ * current one.
++ * "Lots" means "at least twice" here.
++ */
++ if (new_window && new_window >= 2 * rcv_window_now)
++ tcp_send_ack(sk);
++ }
++ }
++}
++
++static int mptcp_sub_send_fin(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *skb = tcp_write_queue_tail(sk);
++ int mss_now;
++
++ /* Optimization, tack on the FIN if we have a queue of
++ * unsent frames. But be careful about outgoing SACKS
++ * and IP options.
++ */
++ mss_now = tcp_current_mss(sk);
++
++ if (tcp_send_head(sk) != NULL) {
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ TCP_SKB_CB(skb)->end_seq++;
++ tp->write_seq++;
++ } else {
++ skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
++ if (!skb)
++ return 1;
++
++ /* Reserve space for headers and prepare control bits. */
++ skb_reserve(skb, MAX_TCP_HEADER);
++ /* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
++ tcp_init_nondata_skb(skb, tp->write_seq,
++ TCPHDR_ACK | TCPHDR_FIN);
++ tcp_queue_skb(sk, skb);
++ }
++ __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
++
++ return 0;
++}
++
++void mptcp_sub_close_wq(struct work_struct *work)
++{
++ struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
++ struct sock *sk = (struct sock *)tp;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ mutex_lock(&tp->mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ if (sock_flag(sk, SOCK_DEAD))
++ goto exit;
++
++ /* We come from tcp_disconnect. We are sure that meta_sk is set */
++ if (!mptcp(tp)) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ goto exit;
++ }
++
++ if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ } else if (tcp_close_state(sk)) {
++ sk->sk_shutdown |= SEND_SHUTDOWN;
++ tcp_send_fin(sk);
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&tp->mpcb->mpcb_mutex);
++ sock_put(sk);
++}
++
++void mptcp_sub_close(struct sock *sk, unsigned long delay)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
++
++ /* We are already closing - e.g., call from sock_def_error_report upon
++ * tcp_disconnect in tcp_close.
++ */
++ if (tp->closing)
++ return;
++
++ /* Work already scheduled ? */
++ if (work_pending(&work->work)) {
++ /* Work present - who will be first ? */
++ if (jiffies + delay > work->timer.expires)
++ return;
++
++ /* Try canceling - if it fails, work will be executed soon */
++ if (!cancel_delayed_work(work))
++ return;
++ sock_put(sk);
++ }
++
++ if (!delay) {
++ unsigned char old_state = sk->sk_state;
++
++ /* If we are in user-context we can directly do the closing
++ * procedure. No need to schedule a work-queue.
++ */
++ if (!in_softirq()) {
++ if (sock_flag(sk, SOCK_DEAD))
++ return;
++
++ if (!mptcp(tp)) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ return;
++ }
++
++ if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
++ sk->sk_state == TCP_CLOSE) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ } else if (tcp_close_state(sk)) {
++ sk->sk_shutdown |= SEND_SHUTDOWN;
++ tcp_send_fin(sk);
++ }
++
++ return;
++ }
++
++ /* We directly send the FIN. Because it may take so a long time,
++ * untile the work-queue will get scheduled...
++ *
++ * If mptcp_sub_send_fin returns 1, it failed and thus we reset
++ * the old state so that tcp_close will finally send the fin
++ * in user-context.
++ */
++ if (!sk->sk_err && old_state != TCP_CLOSE &&
++ tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
++ if (old_state == TCP_ESTABLISHED)
++ TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
++ sk->sk_state = old_state;
++ }
++ }
++
++ sock_hold(sk);
++ queue_delayed_work(mptcp_wq, work, delay);
++}
++
++void mptcp_sub_force_close(struct sock *sk)
++{
++ /* The below tcp_done may have freed the socket, if he is already dead.
++ * Thus, we are not allowed to access it afterwards. That's why
++ * we have to store the dead-state in this local variable.
++ */
++ int sock_is_dead = sock_flag(sk, SOCK_DEAD);
++
++ tcp_sk(sk)->mp_killed = 1;
++
++ if (sk->sk_state != TCP_CLOSE)
++ tcp_done(sk);
++
++ if (!sock_is_dead)
++ mptcp_sub_close(sk, 0);
++}
++EXPORT_SYMBOL(mptcp_sub_force_close);
++
++/* Update the mpcb send window, based on the contributions
++ * of each subflow
++ */
++void mptcp_update_sndbuf(const struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk, *sk;
++ int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
++
++ mptcp_for_each_sk(tp->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ new_sndbuf += sk->sk_sndbuf;
++
++ if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
++ new_sndbuf = sysctl_tcp_wmem[2];
++ break;
++ }
++ }
++ meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
++
++ /* The subflow's call to sk_write_space in tcp_new_space ends up in
++ * mptcp_write_space.
++ * It has nothing to do with waking up the application.
++ * So, we do it here.
++ */
++ if (old_sndbuf != meta_sk->sk_sndbuf)
++ meta_sk->sk_write_space(meta_sk);
++}
++
++void mptcp_close(struct sock *meta_sk, long timeout)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk_it, *tmpsk;
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb;
++ int data_was_unread = 0;
++ int state;
++
++ mptcp_debug("%s: Close of meta_sk with tok %#x\n",
++ __func__, mpcb->mptcp_loc_token);
++
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock(meta_sk);
++
++ if (meta_tp->inside_tk_table) {
++ /* Detach the mpcb from the token hashtable */
++ mptcp_hash_remove_bh(meta_tp);
++ reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
++ }
++
++ meta_sk->sk_shutdown = SHUTDOWN_MASK;
++ /* We need to flush the recv. buffs. We do this only on the
++ * descriptor close, not protocol-sourced closes, because the
++ * reader process may not have drained the data yet!
++ */
++ while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
++ u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
++ tcp_hdr(skb)->fin;
++ data_was_unread += len;
++ __kfree_skb(skb);
++ }
++
++ sk_mem_reclaim(meta_sk);
++
++ /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
++ if (meta_sk->sk_state == TCP_CLOSE) {
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++ mptcp_sub_close(sk_it, 0);
++ }
++ goto adjudge_to_death;
++ }
++
++ if (data_was_unread) {
++ /* Unread data was tossed, zap the connection. */
++ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
++ meta_sk->sk_allocation);
++ } else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
++ /* Check zero linger _after_ checking for unread data. */
++ meta_sk->sk_prot->disconnect(meta_sk, 0);
++ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ } else if (tcp_close_state(meta_sk)) {
++ mptcp_send_fin(meta_sk);
++ } else if (meta_tp->snd_una == meta_tp->write_seq) {
++ /* The DATA_FIN has been sent and acknowledged
++ * (e.g., by sk_shutdown). Close all the other subflows
++ */
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ unsigned long delay = 0;
++ /* If we are the passive closer, don't trigger
++ * subflow-fin until the subflow has been finned
++ * by the peer. - thus we add a delay
++ */
++ if (mpcb->passive_close &&
++ sk_it->sk_state == TCP_ESTABLISHED)
++ delay = inet_csk(sk_it)->icsk_rto << 3;
++
++ mptcp_sub_close(sk_it, delay);
++ }
++ }
++
++ sk_stream_wait_close(meta_sk, timeout);
++
++adjudge_to_death:
++ state = meta_sk->sk_state;
++ sock_hold(meta_sk);
++ sock_orphan(meta_sk);
++
++ /* socket will be freed after mptcp_close - we have to prevent
++ * access from the subflows.
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ /* Similar to sock_orphan, but we don't set it DEAD, because
++ * the callbacks are still set and must be called.
++ */
++ write_lock_bh(&sk_it->sk_callback_lock);
++ sk_set_socket(sk_it, NULL);
++ sk_it->sk_wq = NULL;
++ write_unlock_bh(&sk_it->sk_callback_lock);
++ }
++
++ /* It is the last release_sock in its life. It will remove backlog. */
++ release_sock(meta_sk);
++
++ /* Now socket is owned by kernel and we acquire BH lock
++ * to finish close. No need to check for user refs.
++ */
++ local_bh_disable();
++ bh_lock_sock(meta_sk);
++ WARN_ON(sock_owned_by_user(meta_sk));
++
++ percpu_counter_inc(meta_sk->sk_prot->orphan_count);
++
++ /* Have we already been destroyed by a softirq or backlog? */
++ if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
++ goto out;
++
++ /* This is a (useful) BSD violating of the RFC. There is a
++ * problem with TCP as specified in that the other end could
++ * keep a socket open forever with no application left this end.
++ * We use a 3 minute timeout (about the same as BSD) then kill
++ * our end. If they send after that then tough - BUT: long enough
++ * that we won't make the old 4*rto = almost no time - whoops
++ * reset mistake.
++ *
++ * Nope, it was not mistake. It is really desired behaviour
++ * f.e. on http servers, when such sockets are useless, but
++ * consume significant resources. Let's do it with special
++ * linger2 option. --ANK
++ */
++
++ if (meta_sk->sk_state == TCP_FIN_WAIT2) {
++ if (meta_tp->linger2 < 0) {
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPABORTONLINGER);
++ } else {
++ const int tmo = tcp_fin_time(meta_sk);
++
++ if (tmo > TCP_TIMEWAIT_LEN) {
++ inet_csk_reset_keepalive_timer(meta_sk,
++ tmo - TCP_TIMEWAIT_LEN);
++ } else {
++ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
++ tmo);
++ goto out;
++ }
++ }
++ }
++ if (meta_sk->sk_state != TCP_CLOSE) {
++ sk_mem_reclaim(meta_sk);
++ if (tcp_too_many_orphans(meta_sk, 0)) {
++ if (net_ratelimit())
++ pr_info("MPTCP: too many of orphaned sockets\n");
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPABORTONMEMORY);
++ }
++ }
++
++
++ if (meta_sk->sk_state == TCP_CLOSE)
++ inet_csk_destroy_sock(meta_sk);
++ /* Otherwise, socket is reprieved until protocol close. */
++
++out:
++ bh_unlock_sock(meta_sk);
++ local_bh_enable();
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk); /* Taken by sock_hold */
++}
++
++void mptcp_disconnect(struct sock *sk)
++{
++ struct sock *subsk, *tmpsk;
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ mptcp_delete_synack_timer(sk);
++
++ __skb_queue_purge(&tp->mpcb->reinject_queue);
++
++ if (tp->inside_tk_table) {
++ mptcp_hash_remove_bh(tp);
++ reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
++ }
++
++ local_bh_disable();
++ mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
++ /* The socket will get removed from the subsocket-list
++ * and made non-mptcp by setting mpc to 0.
++ *
++ * This is necessary, because tcp_disconnect assumes
++ * that the connection is completly dead afterwards.
++ * Thus we need to do a mptcp_del_sock. Due to this call
++ * we have to make it non-mptcp.
++ *
++ * We have to lock the socket, because we set mpc to 0.
++ * An incoming packet would take the subsocket's lock
++ * and go on into the receive-path.
++ * This would be a race.
++ */
++
++ bh_lock_sock(subsk);
++ mptcp_del_sock(subsk);
++ tcp_sk(subsk)->mpc = 0;
++ tcp_sk(subsk)->ops = &tcp_specific;
++ mptcp_sub_force_close(subsk);
++ bh_unlock_sock(subsk);
++ }
++ local_bh_enable();
++
++ tp->was_meta_sk = 1;
++ tp->mpc = 0;
++ tp->ops = &tcp_specific;
++}
++
++
++/* Returns 1 if we should enable MPTCP for that socket. */
++int mptcp_doit(struct sock *sk)
++{
++ /* Do not allow MPTCP enabling if the MPTCP initialization failed */
++ if (mptcp_init_failed)
++ return 0;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++ return 0;
++
++ /* Socket may already be established (e.g., called from tcp_recvmsg) */
++ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
++ return 1;
++
++ /* Don't do mptcp over loopback */
++ if (sk->sk_family == AF_INET &&
++ (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
++ ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
++ return 0;
++#if IS_ENABLED(CONFIG_IPV6)
++ if (sk->sk_family == AF_INET6 &&
++ (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
++ ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
++ return 0;
++#endif
++ if (mptcp_v6_is_v4_mapped(sk) &&
++ ipv4_is_loopback(inet_sk(sk)->inet_saddr))
++ return 0;
++
++#ifdef CONFIG_TCP_MD5SIG
++ /* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
++ if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
++ return 0;
++#endif
++
++ return 1;
++}
++
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++ struct tcp_sock *master_tp;
++ struct sock *master_sk;
++
++ if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
++ goto err_alloc_mpcb;
++
++ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++ master_tp = tcp_sk(master_sk);
++
++ if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
++ goto err_add_sock;
++
++ if (__inet_inherit_port(meta_sk, master_sk) < 0)
++ goto err_add_sock;
++
++ meta_sk->sk_prot->unhash(meta_sk);
++
++ if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
++ __inet_hash_nolisten(master_sk, NULL);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ __inet6_hash(master_sk, NULL);
++#endif
++
++ master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
++
++ return 0;
++
++err_add_sock:
++ mptcp_fallback_meta_sk(meta_sk);
++
++ inet_csk_prepare_forced_close(master_sk);
++ tcp_done(master_sk);
++ inet_csk_prepare_forced_close(meta_sk);
++ tcp_done(meta_sk);
++
++err_alloc_mpcb:
++ return -ENOBUFS;
++}
++
++static int __mptcp_check_req_master(struct sock *child,
++ struct request_sock *req)
++{
++ struct tcp_sock *child_tp = tcp_sk(child);
++ struct sock *meta_sk = child;
++ struct mptcp_cb *mpcb;
++ struct mptcp_request_sock *mtreq;
++
++ /* Never contained an MP_CAPABLE */
++ if (!inet_rsk(req)->mptcp_rqsk)
++ return 1;
++
++ if (!inet_rsk(req)->saw_mpc) {
++ /* Fallback to regular TCP, because we saw one SYN without
++ * MP_CAPABLE. In tcp_check_req we continue the regular path.
++ * But, the socket has been added to the reqsk_tk_htb, so we
++ * must still remove it.
++ */
++ mptcp_reqsk_remove_tk(req);
++ return 1;
++ }
++
++ /* Just set this values to pass them to mptcp_alloc_mpcb */
++ mtreq = mptcp_rsk(req);
++ child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
++ child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
++
++ if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
++ child_tp->snd_wnd))
++ return -ENOBUFS;
++
++ child = tcp_sk(child)->mpcb->master_sk;
++ child_tp = tcp_sk(child);
++ mpcb = child_tp->mpcb;
++
++ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++
++ mpcb->dss_csum = mtreq->dss_csum;
++ mpcb->server_side = 1;
++
++ /* Will be moved to ESTABLISHED by tcp_rcv_state_process() */
++ mptcp_update_metasocket(child, meta_sk);
++
++ /* Needs to be done here additionally, because when accepting a
++ * new connection we pass by __reqsk_free and not reqsk_free.
++ */
++ mptcp_reqsk_remove_tk(req);
++
++ /* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
++ sock_put(meta_sk);
++
++ return 0;
++}
++
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
++{
++ struct sock *meta_sk = child, *master_sk;
++ struct sk_buff *skb;
++ u32 new_mapping;
++ int ret;
++
++ ret = __mptcp_check_req_master(child, req);
++ if (ret)
++ return ret;
++
++ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++
++ /* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
++ * pre-MPTCP data in the receive queue.
++ */
++ tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
++ tcp_rsk(req)->rcv_isn - 1;
++
++ /* Map subflow sequence number to data sequence numbers. We need to map
++ * these data to [IDSN - len - 1, IDSN[.
++ */
++ new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
++
++ /* There should be only one skb: the SYN + data. */
++ skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
++ TCP_SKB_CB(skb)->seq += new_mapping;
++ TCP_SKB_CB(skb)->end_seq += new_mapping;
++ }
++
++ /* With fastopen we change the semantics of the relative subflow
++ * sequence numbers to deal with middleboxes that could add/remove
++ * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
++ * instead of the regular TCP ISN.
++ */
++ tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
++
++ /* We need to update copied_seq of the master_sk to account for the
++ * already moved data to the meta receive queue.
++ */
++ tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
++
++ /* Handled by the master_sk */
++ tcp_sk(meta_sk)->fastopen_rsk = NULL;
++
++ return 0;
++}
++
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev)
++{
++ struct sock *meta_sk = child;
++ int ret;
++
++ ret = __mptcp_check_req_master(child, req);
++ if (ret)
++ return ret;
++
++ inet_csk_reqsk_queue_unlink(sk, req, prev);
++ inet_csk_reqsk_queue_removed(sk, req);
++ inet_csk_reqsk_queue_add(sk, req, meta_sk);
++
++ return 0;
++}
++
++struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt)
++{
++ struct tcp_sock *child_tp = tcp_sk(child);
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ u8 hash_mac_check[20];
++
++ child_tp->inside_tk_table = 0;
++
++ if (!mopt->join_ack)
++ goto teardown;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mtreq->mptcp_rem_nonce,
++ (u8 *)&mtreq->mptcp_loc_nonce,
++ (u32 *)hash_mac_check);
++
++ if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
++ goto teardown;
++
++ /* Point it to the same struct socket and wq as the meta_sk */
++ sk_set_socket(child, meta_sk->sk_socket);
++ child->sk_wq = meta_sk->sk_wq;
++
++ if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
++ /* Has been inherited, but now child_tp->mptcp is NULL */
++ child_tp->mpc = 0;
++ child_tp->ops = &tcp_specific;
++
++ /* TODO when we support acking the third ack for new subflows,
++ * we should silently discard this third ack, by returning NULL.
++ *
++ * Maybe, at the retransmission we will have enough memory to
++ * fully add the socket to the meta-sk.
++ */
++ goto teardown;
++ }
++
++ /* The child is a clone of the meta socket, we must now reset
++ * some of the fields
++ */
++ child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
++
++ /* We should allow proper increase of the snd/rcv-buffers. Thus, we
++ * use the original values instead of the bloated up ones from the
++ * clone.
++ */
++ child->sk_sndbuf = mpcb->orig_sk_sndbuf;
++ child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
++
++ child_tp->mptcp->slave_sk = 1;
++ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++ child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
++
++ child_tp->tsq_flags = 0;
++
++ /* Subflows do not use the accept queue, as they
++ * are attached immediately to the mpcb.
++ */
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++ reqsk_free(req);
++ return child;
++
++teardown:
++ /* Drop this request - sock creation failed. */
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++ reqsk_free(req);
++ inet_csk_prepare_forced_close(child);
++ tcp_done(child);
++ return meta_sk;
++}
++
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
++{
++ struct mptcp_tw *mptw;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++
++ /* A subsocket in tw can only receive data. So, if we are in
++ * infinite-receive, then we should not reply with a data-ack or act
++ * upon general MPTCP-signaling. We prevent this by simply not creating
++ * the mptcp_tw_sock.
++ */
++ if (mpcb->infinite_mapping_rcv) {
++ tw->mptcp_tw = NULL;
++ return 0;
++ }
++
++ /* Alloc MPTCP-tw-sock */
++ mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
++ if (!mptw)
++ return -ENOBUFS;
++
++ atomic_inc(&mpcb->mpcb_refcnt);
++
++ tw->mptcp_tw = mptw;
++ mptw->loc_key = mpcb->mptcp_loc_key;
++ mptw->meta_tw = mpcb->in_time_wait;
++ if (mptw->meta_tw) {
++ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
++ if (mpcb->mptw_state != TCP_TIME_WAIT)
++ mptw->rcv_nxt++;
++ }
++ rcu_assign_pointer(mptw->mpcb, mpcb);
++
++ spin_lock(&mpcb->tw_lock);
++ list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
++ mptw->in_list = 1;
++ spin_unlock(&mpcb->tw_lock);
++
++ return 0;
++}
++
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
++{
++ struct mptcp_cb *mpcb;
++
++ rcu_read_lock();
++ mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
++
++ /* If we are still holding a ref to the mpcb, we have to remove ourself
++ * from the list and drop the ref properly.
++ */
++ if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
++ spin_lock(&mpcb->tw_lock);
++ if (tw->mptcp_tw->in_list) {
++ list_del_rcu(&tw->mptcp_tw->list);
++ tw->mptcp_tw->in_list = 0;
++ }
++ spin_unlock(&mpcb->tw_lock);
++
++ /* Twice, because we increased it above */
++ mptcp_mpcb_put(mpcb);
++ mptcp_mpcb_put(mpcb);
++ }
++
++ rcu_read_unlock();
++
++ kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
++}
++
++/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
++ * data-fin.
++ */
++void mptcp_time_wait(struct sock *sk, int state, int timeo)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_tw *mptw;
++
++ /* Used for sockets that go into tw after the meta
++ * (see mptcp_init_tw_sock())
++ */
++ tp->mpcb->in_time_wait = 1;
++ tp->mpcb->mptw_state = state;
++
++ /* Update the time-wait-sock's information */
++ rcu_read_lock_bh();
++ list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
++ mptw->meta_tw = 1;
++ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
++
++ /* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
++ * pretend as if the DATA_FIN has already reached us, that way
++ * the checks in tcp_timewait_state_process will be good as the
++ * DATA_FIN comes in.
++ */
++ if (state != TCP_TIME_WAIT)
++ mptw->rcv_nxt++;
++ }
++ rcu_read_unlock_bh();
++
++ tcp_done(sk);
++}
++
++void mptcp_tsq_flags(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ /* It will be handled as a regular deferred-call */
++ if (is_meta_sk(sk))
++ return;
++
++ if (hlist_unhashed(&tp->mptcp->cb_list)) {
++ hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
++ /* We need to hold it here, as the sock_hold is not assured
++ * by the release_sock as it is done in regular TCP.
++ *
++ * The subsocket may get inet_csk_destroy'd while it is inside
++ * the callback_list.
++ */
++ sock_hold(sk);
++ }
++
++ if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
++ sock_hold(meta_sk);
++}
++
++void mptcp_tsq_sub_deferred(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_tcp_sock *mptcp;
++ struct hlist_node *tmp;
++
++ BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
++
++ __sock_put(meta_sk);
++ hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
++ struct tcp_sock *tp = mptcp->tp;
++ struct sock *sk = (struct sock *)tp;
++
++ hlist_del_init(&mptcp->cb_list);
++ sk->sk_prot->release_cb(sk);
++ /* Final sock_put (cfr. mptcp_tsq_flags */
++ sock_put(sk);
++ }
++}
++
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_options_received mopt;
++ u8 mptcp_hash_mac[20];
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ mtreq = mptcp_rsk(req);
++ mtreq->mptcp_mpcb = mpcb;
++ mtreq->is_sub = 1;
++ inet_rsk(req)->mptcp_rqsk = 1;
++
++ mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mtreq->mptcp_loc_nonce,
++ (u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
++ mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
++
++ mtreq->rem_id = mopt.rem_id;
++ mtreq->rcv_low_prio = mopt.low_prio;
++ inet_rsk(req)->saw_mpc = 1;
++}
++
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
++{
++ struct mptcp_options_received mopt;
++ struct mptcp_request_sock *mreq = mptcp_rsk(req);
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ mreq->is_sub = 0;
++ inet_rsk(req)->mptcp_rqsk = 1;
++ mreq->dss_csum = mopt.dss_csum;
++ mreq->hash_entry.pprev = NULL;
++
++ mptcp_reqsk_new_mptcp(req, &mopt, skb);
++}
++
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
++{
++ struct mptcp_options_received mopt;
++ const struct tcp_sock *tp = tcp_sk(sk);
++ __u32 isn = TCP_SKB_CB(skb)->when;
++ bool want_cookie = false;
++
++ if ((sysctl_tcp_syncookies == 2 ||
++ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++ want_cookie = tcp_syn_flood_action(sk, skb,
++ mptcp_request_sock_ops.slab_name);
++ if (!want_cookie)
++ goto drop;
++ }
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ if (mopt.is_mp_join)
++ return mptcp_do_join_short(skb, &mopt, sock_net(sk));
++ if (mopt.drop_me)
++ goto drop;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
++ mopt.saw_mpc = 0;
++
++ if (skb->protocol == htons(ETH_P_IP)) {
++ if (mopt.saw_mpc && !want_cookie) {
++ if (skb_rtable(skb)->rt_flags &
++ (RTCF_BROADCAST | RTCF_MULTICAST))
++ goto drop;
++
++ return tcp_conn_request(&mptcp_request_sock_ops,
++ &mptcp_request_sock_ipv4_ops,
++ sk, skb);
++ }
++
++ return tcp_v4_conn_request(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ if (mopt.saw_mpc && !want_cookie) {
++ if (!ipv6_unicast_destination(skb))
++ goto drop;
++
++ return tcp_conn_request(&mptcp6_request_sock_ops,
++ &mptcp_request_sock_ipv6_ops,
++ sk, skb);
++ }
++
++ return tcp_v6_conn_request(sk, skb);
++#endif
++ }
++drop:
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++ return 0;
++}
++
++struct workqueue_struct *mptcp_wq;
++EXPORT_SYMBOL(mptcp_wq);
++
++/* Output /proc/net/mptcp */
++static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
++{
++ struct tcp_sock *meta_tp;
++ const struct net *net = seq->private;
++ int i, n = 0;
++
++ seq_printf(seq, " sl loc_tok rem_tok v6 local_address remote_address st ns tx_queue rx_queue inode");
++ seq_putc(seq, '\n');
++
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ struct hlist_nulls_node *node;
++ rcu_read_lock_bh();
++ hlist_nulls_for_each_entry_rcu(meta_tp, node,
++ &tk_hashtable[i], tk_table) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *meta_sk = (struct sock *)meta_tp;
++ struct inet_sock *isk = inet_sk(meta_sk);
++
++ if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
++ continue;
++
++ if (capable(CAP_NET_ADMIN)) {
++ seq_printf(seq, "%4d: %04X %04X ", n++,
++ mpcb->mptcp_loc_token,
++ mpcb->mptcp_rem_token);
++ } else {
++ seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
++ }
++ if (meta_sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(meta_sk)) {
++ seq_printf(seq, " 0 %08X:%04X %08X:%04X ",
++ isk->inet_rcv_saddr,
++ ntohs(isk->inet_sport),
++ isk->inet_daddr,
++ ntohs(isk->inet_dport));
++#if IS_ENABLED(CONFIG_IPV6)
++ } else if (meta_sk->sk_family == AF_INET6) {
++ struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
++ struct in6_addr *dst = &meta_sk->sk_v6_daddr;
++ seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
++ src->s6_addr32[0], src->s6_addr32[1],
++ src->s6_addr32[2], src->s6_addr32[3],
++ ntohs(isk->inet_sport),
++ dst->s6_addr32[0], dst->s6_addr32[1],
++ dst->s6_addr32[2], dst->s6_addr32[3],
++ ntohs(isk->inet_dport));
++#endif
++ }
++ seq_printf(seq, " %02X %02X %08X:%08X %lu",
++ meta_sk->sk_state, mpcb->cnt_subflows,
++ meta_tp->write_seq - meta_tp->snd_una,
++ max_t(int, meta_tp->rcv_nxt -
++ meta_tp->copied_seq, 0),
++ sock_i_ino(meta_sk));
++ seq_putc(seq, '\n');
++ }
++
++ rcu_read_unlock_bh();
++ }
++
++ return 0;
++}
++
++static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
++{
++ return single_open_net(inode, file, mptcp_pm_seq_show);
++}
++
++static const struct file_operations mptcp_pm_seq_fops = {
++ .owner = THIS_MODULE,
++ .open = mptcp_pm_seq_open,
++ .read = seq_read,
++ .llseek = seq_lseek,
++ .release = single_release_net,
++};
++
++static int mptcp_pm_init_net(struct net *net)
++{
++ if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
++ return -ENOMEM;
++
++ return 0;
++}
++
++static void mptcp_pm_exit_net(struct net *net)
++{
++ remove_proc_entry("mptcp", net->proc_net);
++}
++
++static struct pernet_operations mptcp_pm_proc_ops = {
++ .init = mptcp_pm_init_net,
++ .exit = mptcp_pm_exit_net,
++};
++
++/* General initialization of mptcp */
++void __init mptcp_init(void)
++{
++ int i;
++ struct ctl_table_header *mptcp_sysctl;
++
++ mptcp_sock_cache = kmem_cache_create("mptcp_sock",
++ sizeof(struct mptcp_tcp_sock),
++ 0, SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_sock_cache)
++ goto mptcp_sock_cache_failed;
++
++ mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
++ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_cb_cache)
++ goto mptcp_cb_cache_failed;
++
++ mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
++ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_tw_cache)
++ goto mptcp_tw_cache_failed;
++
++ get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
++
++ mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
++ if (!mptcp_wq)
++ goto alloc_workqueue_failed;
++
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
++ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
++ i + MPTCP_REQSK_NULLS_BASE);
++ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
++ }
++
++ spin_lock_init(&mptcp_reqsk_hlock);
++ spin_lock_init(&mptcp_tk_hashlock);
++
++ if (register_pernet_subsys(&mptcp_pm_proc_ops))
++ goto pernet_failed;
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (mptcp_pm_v6_init())
++ goto mptcp_pm_v6_failed;
++#endif
++ if (mptcp_pm_v4_init())
++ goto mptcp_pm_v4_failed;
++
++ mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
++ if (!mptcp_sysctl)
++ goto register_sysctl_failed;
++
++ if (mptcp_register_path_manager(&mptcp_pm_default))
++ goto register_pm_failed;
++
++ if (mptcp_register_scheduler(&mptcp_sched_default))
++ goto register_sched_failed;
++
++ pr_info("MPTCP: Stable release v0.89.0-rc");
++
++ mptcp_init_failed = false;
++
++ return;
++
++register_sched_failed:
++ mptcp_unregister_path_manager(&mptcp_pm_default);
++register_pm_failed:
++ unregister_net_sysctl_table(mptcp_sysctl);
++register_sysctl_failed:
++ mptcp_pm_v4_undo();
++mptcp_pm_v4_failed:
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_pm_v6_undo();
++mptcp_pm_v6_failed:
++#endif
++ unregister_pernet_subsys(&mptcp_pm_proc_ops);
++pernet_failed:
++ destroy_workqueue(mptcp_wq);
++alloc_workqueue_failed:
++ kmem_cache_destroy(mptcp_tw_cache);
++mptcp_tw_cache_failed:
++ kmem_cache_destroy(mptcp_cb_cache);
++mptcp_cb_cache_failed:
++ kmem_cache_destroy(mptcp_sock_cache);
++mptcp_sock_cache_failed:
++ mptcp_init_failed = true;
++}
+diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
+new file mode 100644
+index 000000000000..3a54413ce25b
+--- /dev/null
++++ b/net/mptcp/mptcp_fullmesh.c
+@@ -0,0 +1,1722 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#include <net/addrconf.h>
++#endif
++
++enum {
++ MPTCP_EVENT_ADD = 1,
++ MPTCP_EVENT_DEL,
++ MPTCP_EVENT_MOD,
++};
++
++#define MPTCP_SUBFLOW_RETRY_DELAY 1000
++
++/* Max number of local or remote addresses we can store.
++ * When changing, see the bitfield below in fullmesh_rem4/6.
++ */
++#define MPTCP_MAX_ADDR 8
++
++struct fullmesh_rem4 {
++ u8 rem4_id;
++ u8 bitfield;
++ u8 retry_bitfield;
++ __be16 port;
++ struct in_addr addr;
++};
++
++struct fullmesh_rem6 {
++ u8 rem6_id;
++ u8 bitfield;
++ u8 retry_bitfield;
++ __be16 port;
++ struct in6_addr addr;
++};
++
++struct mptcp_loc_addr {
++ struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
++ u8 loc4_bits;
++ u8 next_v4_index;
++
++ struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
++ u8 loc6_bits;
++ u8 next_v6_index;
++};
++
++struct mptcp_addr_event {
++ struct list_head list;
++ unsigned short family;
++ u8 code:7,
++ low_prio:1;
++ union inet_addr addr;
++};
++
++struct fullmesh_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++ /* Delayed worker, when the routing-tables are not yet ready. */
++ struct delayed_work subflow_retry_work;
++
++ /* Remote addresses */
++ struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
++ struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
++
++ struct mptcp_cb *mpcb;
++
++ u16 remove_addrs; /* Addresses to remove */
++ u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
++ u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
++
++ u8 add_addr; /* Are we sending an add_addr? */
++
++ u8 rem4_bits;
++ u8 rem6_bits;
++};
++
++struct mptcp_fm_ns {
++ struct mptcp_loc_addr __rcu *local;
++ spinlock_t local_lock; /* Protecting the above pointer */
++ struct list_head events;
++ struct delayed_work address_worker;
++
++ struct net *net;
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly;
++
++static void full_mesh_create_subflows(struct sock *meta_sk);
++
++static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
++{
++ return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
++}
++
++static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
++{
++ return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
++}
++
++/* Find the first free index in the bitfield */
++static int __mptcp_find_free_index(u8 bitfield, u8 base)
++{
++ int i;
++
++ /* There are anyways no free bits... */
++ if (bitfield == 0xff)
++ goto exit;
++
++ i = ffs(~(bitfield >> base)) - 1;
++ if (i < 0)
++ goto exit;
++
++ /* No free bits when starting at base, try from 0 on */
++ if (i + base >= sizeof(bitfield) * 8)
++ return __mptcp_find_free_index(bitfield, 0);
++
++ return i + base;
++exit:
++ return -1;
++}
++
++static int mptcp_find_free_index(u8 bitfield)
++{
++ return __mptcp_find_free_index(bitfield, 0);
++}
++
++static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
++ const struct in_addr *addr,
++ __be16 port, u8 id)
++{
++ int i;
++ struct fullmesh_rem4 *rem4;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ rem4 = &fmp->remaddr4[i];
++
++ /* Address is already in the list --- continue */
++ if (rem4->rem4_id == id &&
++ rem4->addr.s_addr == addr->s_addr && rem4->port == port)
++ return;
++
++ /* This may be the case, when the peer is behind a NAT. He is
++ * trying to JOIN, thus sending the JOIN with a certain ID.
++ * However the src_addr of the IP-packet has been changed. We
++ * update the addr in the list, because this is the address as
++ * OUR BOX sees it.
++ */
++ if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
++ /* update the address */
++ mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
++ __func__, &rem4->addr.s_addr,
++ &addr->s_addr, id);
++ rem4->addr.s_addr = addr->s_addr;
++ rem4->port = port;
++ mpcb->list_rcvd = 1;
++ return;
++ }
++ }
++
++ i = mptcp_find_free_index(fmp->rem4_bits);
++ /* Do we have already the maximum number of local/remote addresses? */
++ if (i < 0) {
++ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
++ __func__, MPTCP_MAX_ADDR, &addr->s_addr);
++ return;
++ }
++
++ rem4 = &fmp->remaddr4[i];
++
++ /* Address is not known yet, store it */
++ rem4->addr.s_addr = addr->s_addr;
++ rem4->port = port;
++ rem4->bitfield = 0;
++ rem4->retry_bitfield = 0;
++ rem4->rem4_id = id;
++ mpcb->list_rcvd = 1;
++ fmp->rem4_bits |= (1 << i);
++
++ return;
++}
++
++static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
++ const struct in6_addr *addr,
++ __be16 port, u8 id)
++{
++ int i;
++ struct fullmesh_rem6 *rem6;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ rem6 = &fmp->remaddr6[i];
++
++ /* Address is already in the list --- continue */
++ if (rem6->rem6_id == id &&
++ ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
++ return;
++
++ /* This may be the case, when the peer is behind a NAT. He is
++ * trying to JOIN, thus sending the JOIN with a certain ID.
++ * However the src_addr of the IP-packet has been changed. We
++ * update the addr in the list, because this is the address as
++ * OUR BOX sees it.
++ */
++ if (rem6->rem6_id == id) {
++ /* update the address */
++ mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
++ __func__, &rem6->addr, addr, id);
++ rem6->addr = *addr;
++ rem6->port = port;
++ mpcb->list_rcvd = 1;
++ return;
++ }
++ }
++
++ i = mptcp_find_free_index(fmp->rem6_bits);
++ /* Do we have already the maximum number of local/remote addresses? */
++ if (i < 0) {
++ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
++ __func__, MPTCP_MAX_ADDR, addr);
++ return;
++ }
++
++ rem6 = &fmp->remaddr6[i];
++
++ /* Address is not known yet, store it */
++ rem6->addr = *addr;
++ rem6->port = port;
++ rem6->bitfield = 0;
++ rem6->retry_bitfield = 0;
++ rem6->rem6_id = id;
++ mpcb->list_rcvd = 1;
++ fmp->rem6_bits |= (1 << i);
++
++ return;
++}
++
++static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ if (fmp->remaddr4[i].rem4_id == id) {
++ /* remove address from bitfield */
++ fmp->rem4_bits &= ~(1 << i);
++
++ break;
++ }
++ }
++}
++
++static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ if (fmp->remaddr6[i].rem6_id == id) {
++ /* remove address from bitfield */
++ fmp->rem6_bits &= ~(1 << i);
++
++ break;
++ }
++ }
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
++ const struct in_addr *addr, u8 index)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
++ fmp->remaddr4[i].bitfield |= (1 << index);
++ return;
++ }
++ }
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
++ const struct in6_addr *addr, u8 index)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
++ fmp->remaddr6[i].bitfield |= (1 << index);
++ return;
++ }
++ }
++}
++
++static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
++ const union inet_addr *addr,
++ sa_family_t family, u8 id)
++{
++ if (family == AF_INET)
++ mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
++ else
++ mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
++}
++
++static void retry_subflow_worker(struct work_struct *work)
++{
++ struct delayed_work *delayed_work = container_of(work,
++ struct delayed_work,
++ work);
++ struct fullmesh_priv *fmp = container_of(delayed_work,
++ struct fullmesh_priv,
++ subflow_retry_work);
++ struct mptcp_cb *mpcb = fmp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int iter = 0, i;
++
++ /* We need a local (stable) copy of the address-list. Really, it is not
++ * such a big deal, if the address-list is not 100% up-to-date.
++ */
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++ rcu_read_unlock_bh();
++
++ if (!mptcp_local)
++ return;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
++ /* Do we need to retry establishing a subflow ? */
++ if (rem->retry_bitfield) {
++ int i = mptcp_find_free_index(~rem->retry_bitfield);
++ struct mptcp_rem4 rem4;
++
++ rem->bitfield |= (1 << i);
++ rem->retry_bitfield &= ~(1 << i);
++
++ rem4.addr = rem->addr;
++ rem4.port = rem->port;
++ rem4.rem4_id = rem->rem4_id;
++
++ mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
++ goto next_subflow;
++ }
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
++
++ /* Do we need to retry establishing a subflow ? */
++ if (rem->retry_bitfield) {
++ int i = mptcp_find_free_index(~rem->retry_bitfield);
++ struct mptcp_rem6 rem6;
++
++ rem->bitfield |= (1 << i);
++ rem->retry_bitfield &= ~(1 << i);
++
++ rem6.addr = rem->addr;
++ rem6.port = rem->port;
++ rem6.rem6_id = rem->rem6_id;
++
++ mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
++ goto next_subflow;
++ }
++ }
++#endif
++
++exit:
++ kfree(mptcp_local);
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = fmp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int iter = 0, retry = 0;
++ int i;
++
++ /* We need a local (stable) copy of the address-list. Really, it is not
++ * such a big deal, if the address-list is not 100% up-to-date.
++ */
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++ rcu_read_unlock_bh();
++
++ if (!mptcp_local)
++ return;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ struct fullmesh_rem4 *rem;
++ u8 remaining_bits;
++
++ rem = &fmp->remaddr4[i];
++ remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
++
++ /* Are there still combinations to handle? */
++ if (remaining_bits) {
++ int i = mptcp_find_free_index(~remaining_bits);
++ struct mptcp_rem4 rem4;
++
++ rem->bitfield |= (1 << i);
++
++ rem4.addr = rem->addr;
++ rem4.port = rem->port;
++ rem4.rem4_id = rem->rem4_id;
++
++ /* If a route is not yet available then retry once */
++ if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
++ &rem4) == -ENETUNREACH)
++ retry = rem->retry_bitfield |= (1 << i);
++ goto next_subflow;
++ }
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ struct fullmesh_rem6 *rem;
++ u8 remaining_bits;
++
++ rem = &fmp->remaddr6[i];
++ remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
++
++ /* Are there still combinations to handle? */
++ if (remaining_bits) {
++ int i = mptcp_find_free_index(~remaining_bits);
++ struct mptcp_rem6 rem6;
++
++ rem->bitfield |= (1 << i);
++
++ rem6.addr = rem->addr;
++ rem6.port = rem->port;
++ rem6.rem6_id = rem->rem6_id;
++
++ /* If a route is not yet available then retry once */
++ if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
++ &rem6) == -ENETUNREACH)
++ retry = rem->retry_bitfield |= (1 << i);
++ goto next_subflow;
++ }
++ }
++#endif
++
++ if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
++ sock_hold(meta_sk);
++ queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
++ msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
++ }
++
++exit:
++ kfree(mptcp_local);
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ struct sock *sk = mptcp_select_ack_sock(meta_sk);
++
++ fmp->remove_addrs |= (1 << addr_id);
++ mpcb->addr_signal = 1;
++
++ if (sk)
++ tcp_send_ack(sk);
++}
++
++static void update_addr_bitfields(struct sock *meta_sk,
++ const struct mptcp_loc_addr *mptcp_local)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ int i;
++
++ /* The bits in announced_addrs_* always match with loc*_bits. So, a
++ * simply & operation unsets the correct bits, because these go from
++ * announced to non-announced
++ */
++ fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
++ fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
++ }
++
++ fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
++ fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
++ }
++}
++
++static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
++ sa_family_t family, const union inet_addr *addr)
++{
++ int i;
++ u8 loc_bits;
++ bool found = false;
++
++ if (family == AF_INET)
++ loc_bits = mptcp_local->loc4_bits;
++ else
++ loc_bits = mptcp_local->loc6_bits;
++
++ mptcp_for_each_bit_set(loc_bits, i) {
++ if (family == AF_INET &&
++ mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
++ found = true;
++ break;
++ }
++ if (family == AF_INET6 &&
++ ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
++ &addr->in6)) {
++ found = true;
++ break;
++ }
++ }
++
++ if (!found)
++ return -1;
++
++ return i;
++}
++
++static void mptcp_address_worker(struct work_struct *work)
++{
++ const struct delayed_work *delayed_work = container_of(work,
++ struct delayed_work,
++ work);
++ struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
++ struct mptcp_fm_ns,
++ address_worker);
++ struct net *net = fm_ns->net;
++ struct mptcp_addr_event *event = NULL;
++ struct mptcp_loc_addr *mptcp_local, *old;
++ int i, id = -1; /* id is used in the socket-code on a delete-event */
++ bool success; /* Used to indicate if we succeeded handling the event */
++
++next_event:
++ success = false;
++ kfree(event);
++
++ /* First, let's dequeue an event from our event-list */
++ rcu_read_lock_bh();
++ spin_lock(&fm_ns->local_lock);
++
++ event = list_first_entry_or_null(&fm_ns->events,
++ struct mptcp_addr_event, list);
++ if (!event) {
++ spin_unlock(&fm_ns->local_lock);
++ rcu_read_unlock_bh();
++ return;
++ }
++
++ list_del(&event->list);
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++ if (event->code == MPTCP_EVENT_DEL) {
++ id = mptcp_find_address(mptcp_local, event->family, &event->addr);
++
++ /* Not in the list - so we don't care */
++ if (id < 0) {
++ mptcp_debug("%s could not find id\n", __func__);
++ goto duno;
++ }
++
++ old = mptcp_local;
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++ GFP_ATOMIC);
++ if (!mptcp_local)
++ goto duno;
++
++ if (event->family == AF_INET)
++ mptcp_local->loc4_bits &= ~(1 << id);
++ else
++ mptcp_local->loc6_bits &= ~(1 << id);
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ kfree(old);
++ } else {
++ int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
++ int j = i;
++
++ if (j < 0) {
++ /* Not in the list, so we have to find an empty slot */
++ if (event->family == AF_INET)
++ i = __mptcp_find_free_index(mptcp_local->loc4_bits,
++ mptcp_local->next_v4_index);
++ if (event->family == AF_INET6)
++ i = __mptcp_find_free_index(mptcp_local->loc6_bits,
++ mptcp_local->next_v6_index);
++
++ if (i < 0) {
++ mptcp_debug("%s no more space\n", __func__);
++ goto duno;
++ }
++
++ /* It might have been a MOD-event. */
++ event->code = MPTCP_EVENT_ADD;
++ } else {
++ /* Let's check if anything changes */
++ if (event->family == AF_INET &&
++ event->low_prio == mptcp_local->locaddr4[i].low_prio)
++ goto duno;
++
++ if (event->family == AF_INET6 &&
++ event->low_prio == mptcp_local->locaddr6[i].low_prio)
++ goto duno;
++ }
++
++ old = mptcp_local;
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++ GFP_ATOMIC);
++ if (!mptcp_local)
++ goto duno;
++
++ if (event->family == AF_INET) {
++ mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
++ mptcp_local->locaddr4[i].loc4_id = i + 1;
++ mptcp_local->locaddr4[i].low_prio = event->low_prio;
++ } else {
++ mptcp_local->locaddr6[i].addr = event->addr.in6;
++ mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
++ mptcp_local->locaddr6[i].low_prio = event->low_prio;
++ }
++
++ if (j < 0) {
++ if (event->family == AF_INET) {
++ mptcp_local->loc4_bits |= (1 << i);
++ mptcp_local->next_v4_index = i + 1;
++ } else {
++ mptcp_local->loc6_bits |= (1 << i);
++ mptcp_local->next_v6_index = i + 1;
++ }
++ }
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ kfree(old);
++ }
++ success = true;
++
++duno:
++ spin_unlock(&fm_ns->local_lock);
++ rcu_read_unlock_bh();
++
++ if (!success)
++ goto next_event;
++
++ /* Now we iterate over the MPTCP-sockets and apply the event. */
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ const struct hlist_nulls_node *node;
++ struct tcp_sock *meta_tp;
++
++ rcu_read_lock_bh();
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
++ tk_table) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *meta_sk = (struct sock *)meta_tp, *sk;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ if (sock_net(meta_sk) != net)
++ continue;
++
++ if (meta_v4) {
++ /* skip IPv6 events if meta is IPv4 */
++ if (event->family == AF_INET6)
++ continue;
++ }
++ /* skip IPv4 events if IPV6_V6ONLY is set */
++ else if (event->family == AF_INET &&
++ inet6_sk(meta_sk)->ipv6only)
++ continue;
++
++ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ continue;
++
++ bh_lock_sock(meta_sk);
++
++ if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
++ mpcb->infinite_mapping_snd ||
++ mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping)
++ goto next;
++
++ /* May be that the pm has changed in-between */
++ if (mpcb->pm_ops != &full_mesh)
++ goto next;
++
++ if (sock_owned_by_user(meta_sk)) {
++ if (!test_and_set_bit(MPTCP_PATH_MANAGER,
++ &meta_tp->tsq_flags))
++ sock_hold(meta_sk);
++
++ goto next;
++ }
++
++ if (event->code == MPTCP_EVENT_ADD) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++
++ full_mesh_create_subflows(meta_sk);
++ }
++
++ if (event->code == MPTCP_EVENT_DEL) {
++ struct sock *sk, *tmpsk;
++ struct mptcp_loc_addr *mptcp_local;
++ bool found = false;
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++ /* In any case, we need to update our bitfields */
++ if (id >= 0)
++ update_addr_bitfields(meta_sk, mptcp_local);
++
++ /* Look for the socket and remove him */
++ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++ if ((event->family == AF_INET6 &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk))) ||
++ (event->family == AF_INET &&
++ (sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(sk))))
++ continue;
++
++ if (event->family == AF_INET &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk)) &&
++ inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
++ continue;
++
++ if (event->family == AF_INET6 &&
++ sk->sk_family == AF_INET6 &&
++ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
++ continue;
++
++ /* Reinject, so that pf = 1 and so we
++ * won't select this one as the
++ * ack-sock.
++ */
++ mptcp_reinject_data(sk, 0);
++
++ /* We announce the removal of this id */
++ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
++
++ mptcp_sub_force_close(sk);
++ found = true;
++ }
++
++ if (found)
++ goto next;
++
++ /* The id may have been given by the event,
++ * matching on a local address. And it may not
++ * have matched on one of the above sockets,
++ * because the client never created a subflow.
++ * So, we have to finally remove it here.
++ */
++ if (id > 0)
++ announce_remove_addr(id, meta_sk);
++ }
++
++ if (event->code == MPTCP_EVENT_MOD) {
++ struct sock *sk;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ if (event->family == AF_INET &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk)) &&
++ inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
++ if (event->low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = event->low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (event->family == AF_INET6 &&
++ sk->sk_family == AF_INET6 &&
++ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
++ if (event->low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = event->low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++ }
++ }
++next:
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk);
++ }
++ rcu_read_unlock_bh();
++ }
++ goto next_event;
++}
++
++static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
++ const struct mptcp_addr_event *event)
++{
++ struct mptcp_addr_event *eventq;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++ list_for_each_entry(eventq, &fm_ns->events, list) {
++ if (eventq->family != event->family)
++ continue;
++ if (event->family == AF_INET) {
++ if (eventq->addr.in.s_addr == event->addr.in.s_addr)
++ return eventq;
++ } else {
++ if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
++ return eventq;
++ }
++ }
++ return NULL;
++}
++
++/* We already hold the net-namespace MPTCP-lock */
++static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
++{
++ struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++ if (eventq) {
++ switch (event->code) {
++ case MPTCP_EVENT_DEL:
++ mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
++ list_del(&eventq->list);
++ kfree(eventq);
++ break;
++ case MPTCP_EVENT_ADD:
++ mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
++ eventq->low_prio = event->low_prio;
++ eventq->code = MPTCP_EVENT_ADD;
++ return;
++ case MPTCP_EVENT_MOD:
++ mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
++ eventq->low_prio = event->low_prio;
++ eventq->code = MPTCP_EVENT_MOD;
++ return;
++ }
++ }
++
++ /* OK, we have to add the new address to the wait queue */
++ eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
++ if (!eventq)
++ return;
++
++ list_add_tail(&eventq->list, &fm_ns->events);
++
++ /* Create work-queue */
++ if (!delayed_work_pending(&fm_ns->address_worker))
++ queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
++ msecs_to_jiffies(500));
++}
++
++static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
++ struct net *net)
++{
++ const struct net_device *netdev = ifa->ifa_dev->dev;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ struct mptcp_addr_event mpevent;
++
++ if (ifa->ifa_scope > RT_SCOPE_LINK ||
++ ipv4_is_loopback(ifa->ifa_local))
++ return;
++
++ spin_lock_bh(&fm_ns->local_lock);
++
++ mpevent.family = AF_INET;
++ mpevent.addr.in.s_addr = ifa->ifa_local;
++ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++ if (event == NETDEV_DOWN || !netif_running(netdev) ||
++ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++ mpevent.code = MPTCP_EVENT_DEL;
++ else if (event == NETDEV_UP)
++ mpevent.code = MPTCP_EVENT_ADD;
++ else if (event == NETDEV_CHANGE)
++ mpevent.code = MPTCP_EVENT_MOD;
++
++ mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
++ &ifa->ifa_local, mpevent.code, mpevent.low_prio);
++ add_pm_event(net, &mpevent);
++
++ spin_unlock_bh(&fm_ns->local_lock);
++ return;
++}
++
++/* React on IPv4-addr add/rem-events */
++static int mptcp_pm_inetaddr_event(struct notifier_block *this,
++ unsigned long event, void *ptr)
++{
++ const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
++ struct net *net = dev_net(ifa->ifa_dev->dev);
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ addr4_event_handler(ifa, event, net);
++
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_inetaddr_notifier = {
++ .notifier_call = mptcp_pm_inetaddr_event,
++};
++
++#if IS_ENABLED(CONFIG_IPV6)
++
++/* IPV6-related address/interface watchers */
++struct mptcp_dad_data {
++ struct timer_list timer;
++ struct inet6_ifaddr *ifa;
++};
++
++static void dad_callback(unsigned long arg);
++static int inet6_addr_event(struct notifier_block *this,
++ unsigned long event, void *ptr);
++
++static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
++{
++ return (ifa->flags & IFA_F_TENTATIVE) &&
++ ifa->state == INET6_IFADDR_STATE_DAD;
++}
++
++static void dad_init_timer(struct mptcp_dad_data *data,
++ struct inet6_ifaddr *ifa)
++{
++ data->ifa = ifa;
++ data->timer.data = (unsigned long)data;
++ data->timer.function = dad_callback;
++ if (ifa->idev->cnf.rtr_solicit_delay)
++ data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
++ else
++ data->timer.expires = jiffies + (HZ/10);
++}
++
++static void dad_callback(unsigned long arg)
++{
++ struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
++
++ if (ipv6_is_in_dad_state(data->ifa)) {
++ dad_init_timer(data, data->ifa);
++ add_timer(&data->timer);
++ } else {
++ inet6_addr_event(NULL, NETDEV_UP, data->ifa);
++ in6_ifa_put(data->ifa);
++ kfree(data);
++ }
++}
++
++static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
++{
++ struct mptcp_dad_data *data;
++
++ data = kmalloc(sizeof(*data), GFP_ATOMIC);
++
++ if (!data)
++ return;
++
++ init_timer(&data->timer);
++ dad_init_timer(data, ifa);
++ add_timer(&data->timer);
++ in6_ifa_hold(ifa);
++}
++
++static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
++ struct net *net)
++{
++ const struct net_device *netdev = ifa->idev->dev;
++ int addr_type = ipv6_addr_type(&ifa->addr);
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ struct mptcp_addr_event mpevent;
++
++ if (ifa->scope > RT_SCOPE_LINK ||
++ addr_type == IPV6_ADDR_ANY ||
++ (addr_type & IPV6_ADDR_LOOPBACK) ||
++ (addr_type & IPV6_ADDR_LINKLOCAL))
++ return;
++
++ spin_lock_bh(&fm_ns->local_lock);
++
++ mpevent.family = AF_INET6;
++ mpevent.addr.in6 = ifa->addr;
++ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++ if (event == NETDEV_DOWN || !netif_running(netdev) ||
++ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++ mpevent.code = MPTCP_EVENT_DEL;
++ else if (event == NETDEV_UP)
++ mpevent.code = MPTCP_EVENT_ADD;
++ else if (event == NETDEV_CHANGE)
++ mpevent.code = MPTCP_EVENT_MOD;
++
++ mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
++ &ifa->addr, mpevent.code, mpevent.low_prio);
++ add_pm_event(net, &mpevent);
++
++ spin_unlock_bh(&fm_ns->local_lock);
++ return;
++}
++
++/* React on IPv6-addr add/rem-events */
++static int inet6_addr_event(struct notifier_block *this, unsigned long event,
++ void *ptr)
++{
++ struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
++ struct net *net = dev_net(ifa6->idev->dev);
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ if (ipv6_is_in_dad_state(ifa6))
++ dad_setup_timer(ifa6);
++ else
++ addr6_event_handler(ifa6, event, net);
++
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block inet6_addr_notifier = {
++ .notifier_call = inet6_addr_event,
++};
++
++#endif
++
++/* React on ifup/down-events */
++static int netdev_event(struct notifier_block *this, unsigned long event,
++ void *ptr)
++{
++ const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
++ struct in_device *in_dev;
++#if IS_ENABLED(CONFIG_IPV6)
++ struct inet6_dev *in6_dev;
++#endif
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ rcu_read_lock();
++ in_dev = __in_dev_get_rtnl(dev);
++
++ if (in_dev) {
++ for_ifa(in_dev) {
++ mptcp_pm_inetaddr_event(NULL, event, ifa);
++ } endfor_ifa(in_dev);
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ in6_dev = __in6_dev_get(dev);
++
++ if (in6_dev) {
++ struct inet6_ifaddr *ifa6;
++ list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
++ inet6_addr_event(NULL, event, ifa6);
++ }
++#endif
++
++ rcu_read_unlock();
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_netdev_notifier = {
++ .notifier_call = netdev_event,
++};
++
++static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
++ const union inet_addr *addr,
++ sa_family_t family, __be16 port, u8 id)
++{
++ if (family == AF_INET)
++ mptcp_addv4_raddr(mpcb, &addr->in, port, id);
++ else
++ mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
++}
++
++static void full_mesh_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int i, index;
++ union inet_addr saddr, daddr;
++ sa_family_t family;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ /* Init local variables necessary for the rest */
++ if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
++ saddr.ip = inet_sk(meta_sk)->inet_saddr;
++ daddr.ip = inet_sk(meta_sk)->inet_daddr;
++ family = AF_INET;
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ saddr.in6 = inet6_sk(meta_sk)->saddr;
++ daddr.in6 = meta_sk->sk_v6_daddr;
++ family = AF_INET6;
++#endif
++ }
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ index = mptcp_find_address(mptcp_local, family, &saddr);
++ if (index < 0)
++ goto fallback;
++
++ full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
++ mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
++ fmp->mpcb = mpcb;
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* Look for the address among the local addresses */
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ __be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
++
++ /* We do not need to announce the initial subflow's address again */
++ if (family == AF_INET && saddr.ip == ifa_address)
++ continue;
++
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++ }
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++ /* skip IPv6 addresses if meta-socket is IPv4 */
++ if (meta_v4)
++ goto skip_ipv6;
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
++
++ /* We do not need to announce the initial subflow's address again */
++ if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
++ continue;
++
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++ }
++
++skip_ipv6:
++#endif
++
++ rcu_read_unlock();
++
++ if (family == AF_INET)
++ fmp->announced_addrs_v4 |= (1 << index);
++ else
++ fmp->announced_addrs_v6 |= (1 << index);
++
++ for (i = fmp->add_addr; i && fmp->add_addr; i--)
++ tcp_send_ack(mpcb->master_sk);
++
++ return;
++
++fallback:
++ rcu_read_unlock();
++ mptcp_fallback_default(mpcb);
++ return;
++}
++
++static void full_mesh_create_subflows(struct sock *meta_sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ return;
++
++ if (!work_pending(&fmp->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &fmp->subflow_work);
++ }
++}
++
++/* Called upon release_sock, if the socket was owned by the user during
++ * a path-management event.
++ */
++static void full_mesh_release_sock(struct sock *meta_sk)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ struct sock *sk, *tmpsk;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++ int i;
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* First, detect modifications or additions */
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ struct in_addr ifa = mptcp_local->locaddr4[i].addr;
++ bool found = false;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(sk))
++ continue;
++
++ if (inet_sk(sk)->inet_saddr != ifa.s_addr)
++ continue;
++
++ found = true;
++
++ if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (!found) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++ full_mesh_create_subflows(meta_sk);
++ }
++ }
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++ /* skip IPv6 addresses if meta-socket is IPv4 */
++ if (meta_v4)
++ goto removal;
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
++ bool found = false;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk))
++ continue;
++
++ if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
++ continue;
++
++ found = true;
++
++ if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (!found) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++ full_mesh_create_subflows(meta_sk);
++ }
++ }
++
++removal:
++#endif
++
++ /* Now, detect address-removals */
++ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++ bool shall_remove = true;
++
++ if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
++ shall_remove = false;
++ break;
++ }
++ }
++ } else {
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
++ shall_remove = false;
++ break;
++ }
++ }
++ }
++
++ if (shall_remove) {
++ /* Reinject, so that pf = 1 and so we
++ * won't select this one as the
++ * ack-sock.
++ */
++ mptcp_reinject_data(sk, 0);
++
++ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
++ meta_sk);
++
++ mptcp_sub_force_close(sk);
++ }
++ }
++
++ /* Just call it optimistically. It actually cannot do any harm */
++ update_addr_bitfields(meta_sk, mptcp_local);
++
++ rcu_read_unlock();
++}
++
++static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ int index, id = -1;
++
++ /* Handle the backup-flows */
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ index = mptcp_find_address(mptcp_local, family, addr);
++
++ if (index != -1) {
++ if (family == AF_INET) {
++ id = mptcp_local->locaddr4[index].loc4_id;
++ *low_prio = mptcp_local->locaddr4[index].low_prio;
++ } else {
++ id = mptcp_local->locaddr6[index].loc6_id;
++ *low_prio = mptcp_local->locaddr6[index].low_prio;
++ }
++ }
++
++
++ rcu_read_unlock();
++
++ return id;
++}
++
++static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
++ struct tcp_out_options *opts,
++ struct sk_buff *skb)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
++ int remove_addr_len;
++ u8 unannouncedv4 = 0, unannouncedv6 = 0;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ mpcb->addr_signal = 0;
++
++ if (likely(!fmp->add_addr))
++ goto remove_addr;
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* IPv4 */
++ unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
++ if (unannouncedv4 &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
++ int ind = mptcp_find_free_index(~unannouncedv4);
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_ADD_ADDR;
++ opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
++ opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
++ opts->add_addr_v4 = 1;
++
++ if (skb) {
++ fmp->announced_addrs_v4 |= (1 << ind);
++ fmp->add_addr--;
++ }
++ *size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
++ }
++
++ if (meta_v4)
++ goto skip_ipv6;
++
++skip_ipv4:
++ /* IPv6 */
++ unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
++ if (unannouncedv6 &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
++ int ind = mptcp_find_free_index(~unannouncedv6);
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_ADD_ADDR;
++ opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
++ opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
++ opts->add_addr_v6 = 1;
++
++ if (skb) {
++ fmp->announced_addrs_v6 |= (1 << ind);
++ fmp->add_addr--;
++ }
++ *size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
++ }
++
++skip_ipv6:
++ rcu_read_unlock();
++
++ if (!unannouncedv4 && !unannouncedv6 && skb)
++ fmp->add_addr--;
++
++remove_addr:
++ if (likely(!fmp->remove_addrs))
++ goto exit;
++
++ remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
++ if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
++ goto exit;
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_REMOVE_ADDR;
++ opts->remove_addrs = fmp->remove_addrs;
++ *size += remove_addr_len;
++ if (skb)
++ fmp->remove_addrs = 0;
++
++exit:
++ mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
++}
++
++static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
++{
++ mptcp_v4_rem_raddress(mpcb, rem_id);
++ mptcp_v6_rem_raddress(mpcb, rem_id);
++}
++
++/* Output /proc/net/mptcp_fullmesh */
++static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
++{
++ const struct net *net = seq->private;
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ int i;
++
++ seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
++
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
++
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
++
++ seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
++ loc4->low_prio, &loc4->addr);
++ }
++
++ seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
++
++ seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
++ loc6->low_prio, &loc6->addr);
++ }
++ rcu_read_unlock_bh();
++
++ return 0;
++}
++
++static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
++{
++ return single_open_net(inode, file, mptcp_fm_seq_show);
++}
++
++static const struct file_operations mptcp_fm_seq_fops = {
++ .owner = THIS_MODULE,
++ .open = mptcp_fm_seq_open,
++ .read = seq_read,
++ .llseek = seq_lseek,
++ .release = single_release_net,
++};
++
++static int mptcp_fm_init_net(struct net *net)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns;
++ int err = 0;
++
++ fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
++ if (!fm_ns)
++ return -ENOBUFS;
++
++ mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
++ if (!mptcp_local) {
++ err = -ENOBUFS;
++ goto err_mptcp_local;
++ }
++
++ if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
++ &mptcp_fm_seq_fops)) {
++ err = -ENOMEM;
++ goto err_seq_fops;
++ }
++
++ mptcp_local->next_v4_index = 1;
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
++ INIT_LIST_HEAD(&fm_ns->events);
++ spin_lock_init(&fm_ns->local_lock);
++ fm_ns->net = net;
++ net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
++
++ return 0;
++err_seq_fops:
++ kfree(mptcp_local);
++err_mptcp_local:
++ kfree(fm_ns);
++ return err;
++}
++
++static void mptcp_fm_exit_net(struct net *net)
++{
++ struct mptcp_addr_event *eventq, *tmp;
++ struct mptcp_fm_ns *fm_ns;
++ struct mptcp_loc_addr *mptcp_local;
++
++ fm_ns = fm_get_ns(net);
++ cancel_delayed_work_sync(&fm_ns->address_worker);
++
++ rcu_read_lock_bh();
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ kfree(mptcp_local);
++
++ spin_lock(&fm_ns->local_lock);
++ list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
++ list_del(&eventq->list);
++ kfree(eventq);
++ }
++ spin_unlock(&fm_ns->local_lock);
++
++ rcu_read_unlock_bh();
++
++ remove_proc_entry("mptcp_fullmesh", net->proc_net);
++
++ kfree(fm_ns);
++}
++
++static struct pernet_operations full_mesh_net_ops = {
++ .init = mptcp_fm_init_net,
++ .exit = mptcp_fm_exit_net,
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly = {
++ .new_session = full_mesh_new_session,
++ .release_sock = full_mesh_release_sock,
++ .fully_established = full_mesh_create_subflows,
++ .new_remote_address = full_mesh_create_subflows,
++ .get_local_id = full_mesh_get_local_id,
++ .addr_signal = full_mesh_addr_signal,
++ .add_raddr = full_mesh_add_raddr,
++ .rem_raddr = full_mesh_rem_raddr,
++ .name = "fullmesh",
++ .owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init full_mesh_register(void)
++{
++ int ret;
++
++ BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
++
++ ret = register_pernet_subsys(&full_mesh_net_ops);
++ if (ret)
++ goto out;
++
++ ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++ if (ret)
++ goto err_reg_inetaddr;
++ ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
++ if (ret)
++ goto err_reg_netdev;
++
++#if IS_ENABLED(CONFIG_IPV6)
++ ret = register_inet6addr_notifier(&inet6_addr_notifier);
++ if (ret)
++ goto err_reg_inet6addr;
++#endif
++
++ ret = mptcp_register_path_manager(&full_mesh);
++ if (ret)
++ goto err_reg_pm;
++
++out:
++ return ret;
++
++
++err_reg_pm:
++#if IS_ENABLED(CONFIG_IPV6)
++ unregister_inet6addr_notifier(&inet6_addr_notifier);
++err_reg_inet6addr:
++#endif
++ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++err_reg_netdev:
++ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++err_reg_inetaddr:
++ unregister_pernet_subsys(&full_mesh_net_ops);
++ goto out;
++}
++
++static void full_mesh_unregister(void)
++{
++#if IS_ENABLED(CONFIG_IPV6)
++ unregister_inet6addr_notifier(&inet6_addr_notifier);
++#endif
++ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++ unregister_pernet_subsys(&full_mesh_net_ops);
++ mptcp_unregister_path_manager(&full_mesh);
++}
++
++module_init(full_mesh_register);
++module_exit(full_mesh_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("Full-Mesh MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
+new file mode 100644
+index 000000000000..43704ccb639e
+--- /dev/null
++++ b/net/mptcp/mptcp_input.c
+@@ -0,0 +1,2405 @@
++/*
++ * MPTCP implementation - Sending side
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <asm/unaligned.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++
++#include <linux/kconfig.h>
++
++/* is seq1 < seq2 ? */
++static inline bool before64(const u64 seq1, const u64 seq2)
++{
++ return (s64)(seq1 - seq2) < 0;
++}
++
++/* is seq1 > seq2 ? */
++#define after64(seq1, seq2) before64(seq2, seq1)
++
++static inline void mptcp_become_fully_estab(struct sock *sk)
++{
++ tcp_sk(sk)->mptcp->fully_established = 1;
++
++ if (is_master_tp(tcp_sk(sk)) &&
++ tcp_sk(sk)->mpcb->pm_ops->fully_established)
++ tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
++}
++
++/* Similar to tcp_tso_acked without any memory accounting */
++static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
++ struct sk_buff *skb)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ u32 packets_acked, len;
++
++ BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
++
++ packets_acked = tcp_skb_pcount(skb);
++
++ if (skb_unclone(skb, GFP_ATOMIC))
++ return 0;
++
++ len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
++ __pskb_trim_head(skb, len);
++
++ TCP_SKB_CB(skb)->seq += len;
++ skb->ip_summed = CHECKSUM_PARTIAL;
++ skb->truesize -= len;
++
++ /* Any change of skb->len requires recalculation of tso factor. */
++ if (tcp_skb_pcount(skb) > 1)
++ tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
++ packets_acked -= tcp_skb_pcount(skb);
++
++ if (packets_acked) {
++ BUG_ON(tcp_skb_pcount(skb) == 0);
++ BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
++ }
++
++ return packets_acked;
++}
++
++/**
++ * Cleans the meta-socket retransmission queue and the reinject-queue.
++ * @sk must be the metasocket.
++ */
++static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
++{
++ struct sk_buff *skb, *tmp;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ bool acked = false;
++ u32 acked_pcount;
++
++ while ((skb = tcp_write_queue_head(meta_sk)) &&
++ skb != tcp_send_head(meta_sk)) {
++ bool fully_acked = true;
++
++ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++ if (tcp_skb_pcount(skb) == 1 ||
++ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++ break;
++
++ acked_pcount = tcp_tso_acked(meta_sk, skb);
++ if (!acked_pcount)
++ break;
++
++ fully_acked = false;
++ } else {
++ acked_pcount = tcp_skb_pcount(skb);
++ }
++
++ acked = true;
++ meta_tp->packets_out -= acked_pcount;
++ meta_tp->retrans_stamp = 0;
++
++ if (!fully_acked)
++ break;
++
++ tcp_unlink_write_queue(skb, meta_sk);
++
++ if (mptcp_is_data_fin(skb)) {
++ struct sock *sk_it;
++
++ /* DATA_FIN has been acknowledged - now we can close
++ * the subflows
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ unsigned long delay = 0;
++
++ /* If we are the passive closer, don't trigger
++ * subflow-fin until the subflow has been finned
++ * by the peer - thus we add a delay.
++ */
++ if (mpcb->passive_close &&
++ sk_it->sk_state == TCP_ESTABLISHED)
++ delay = inet_csk(sk_it)->icsk_rto << 3;
++
++ mptcp_sub_close(sk_it, delay);
++ }
++ }
++ sk_wmem_free_skb(meta_sk, skb);
++ }
++ /* Remove acknowledged data from the reinject queue */
++ skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
++ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++ if (tcp_skb_pcount(skb) == 1 ||
++ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++ break;
++
++ mptcp_tso_acked_reinject(meta_sk, skb);
++ break;
++ }
++
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ __kfree_skb(skb);
++ }
++
++ if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
++ meta_tp->snd_up = meta_tp->snd_una;
++
++ if (acked) {
++ tcp_rearm_rto(meta_sk);
++ /* Normally this is done in tcp_try_undo_loss - but MPTCP
++ * does not call this function.
++ */
++ inet_csk(meta_sk)->icsk_retransmits = 0;
++ }
++}
++
++/* Inspired by tcp_rcv_state_process */
++static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
++ const struct sk_buff *skb, u32 data_seq,
++ u16 data_len)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++ const struct tcphdr *th = tcp_hdr(skb);
++
++ /* State-machine handling if FIN has been enqueued and he has
++ * been acked (snd_una == write_seq) - it's important that this
++ * here is after sk_wmem_free_skb because otherwise
++ * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
++ */
++ switch (meta_sk->sk_state) {
++ case TCP_FIN_WAIT1: {
++ struct dst_entry *dst;
++ int tmo;
++
++ if (meta_tp->snd_una != meta_tp->write_seq)
++ break;
++
++ tcp_set_state(meta_sk, TCP_FIN_WAIT2);
++ meta_sk->sk_shutdown |= SEND_SHUTDOWN;
++
++ dst = __sk_dst_get(sk);
++ if (dst)
++ dst_confirm(dst);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ /* Wake up lingering close() */
++ meta_sk->sk_state_change(meta_sk);
++ break;
++ }
++
++ if (meta_tp->linger2 < 0 ||
++ (data_len &&
++ after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
++ meta_tp->rcv_nxt))) {
++ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++ tcp_done(meta_sk);
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ return 1;
++ }
++
++ tmo = tcp_fin_time(meta_sk);
++ if (tmo > TCP_TIMEWAIT_LEN) {
++ inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
++ } else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
++ /* Bad case. We could lose such FIN otherwise.
++ * It is not a big problem, but it looks confusing
++ * and not so rare event. We still can lose it now,
++ * if it spins in bh_lock_sock(), but it is really
++ * marginal case.
++ */
++ inet_csk_reset_keepalive_timer(meta_sk, tmo);
++ } else {
++ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
++ }
++ break;
++ }
++ case TCP_CLOSING:
++ case TCP_LAST_ACK:
++ if (meta_tp->snd_una == meta_tp->write_seq) {
++ tcp_done(meta_sk);
++ return 1;
++ }
++ break;
++ }
++
++ /* step 7: process the segment text */
++ switch (meta_sk->sk_state) {
++ case TCP_FIN_WAIT1:
++ case TCP_FIN_WAIT2:
++ /* RFC 793 says to queue data in these states,
++ * RFC 1122 says we MUST send a reset.
++ * BSD 4.4 also does reset.
++ */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++ if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
++ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++ !mptcp_is_data_fin2(skb, tp)) {
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++ tcp_reset(meta_sk);
++ return 1;
++ }
++ }
++ break;
++ }
++
++ return 0;
++}
++
++/**
++ * @return:
++ * i) 1: Everything's fine.
++ * ii) -1: A reset has been sent on the subflow - csum-failure
++ * iii) 0: csum-failure but no reset sent, because it's the last subflow.
++ * Last packet should not be destroyed by the caller because it has
++ * been done here.
++ */
++static int mptcp_verif_dss_csum(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *tmp, *tmp1, *last = NULL;
++ __wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
++ int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
++ int iter = 0;
++
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
++ unsigned int csum_len;
++
++ if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
++ /* Mapping ends in the middle of the packet -
++ * csum only these bytes
++ */
++ csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
++ else
++ csum_len = tmp->len;
++
++ offset = 0;
++ if (overflowed) {
++ char first_word[4];
++ first_word[0] = 0;
++ first_word[1] = 0;
++ first_word[2] = 0;
++ first_word[3] = *(tmp->data);
++ csum_tcp = csum_partial(first_word, 4, csum_tcp);
++ offset = 1;
++ csum_len--;
++ overflowed = 0;
++ }
++
++ csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
++
++ /* Was it on an odd-length? Then we have to merge the next byte
++ * correctly (see above)
++ */
++ if (csum_len != (csum_len & (~1)))
++ overflowed = 1;
++
++ if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
++ __be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
++
++ /* If a 64-bit dss is present, we increase the offset
++ * by 4 bytes, as the high-order 64-bits will be added
++ * in the final csum_partial-call.
++ */
++ u32 offset = skb_transport_offset(tmp) +
++ TCP_SKB_CB(tmp)->dss_off;
++ if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
++ offset += 4;
++
++ csum_tcp = skb_checksum(tmp, offset,
++ MPTCP_SUB_LEN_SEQ_CSUM,
++ csum_tcp);
++
++ csum_tcp = csum_partial(&data_seq,
++ sizeof(data_seq), csum_tcp);
++
++ dss_csum_added = 1; /* Just do it once */
++ }
++ last = tmp;
++ iter++;
++
++ if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
++ !before(TCP_SKB_CB(tmp1)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++
++ /* Now, checksum must be 0 */
++ if (unlikely(csum_fold(csum_tcp))) {
++ pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
++ __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
++ dss_csum_added, overflowed, iter);
++
++ tp->mptcp->send_mp_fail = 1;
++
++ /* map_data_seq is the data-seq number of the
++ * mapping we are currently checking
++ */
++ tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
++
++ if (tp->mpcb->cnt_subflows > 1) {
++ mptcp_send_reset(sk);
++ ans = -1;
++ } else {
++ tp->mpcb->send_infinite_mapping = 1;
++
++ /* Need to purge the rcv-queue as it's no more valid */
++ while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
++ tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
++ kfree_skb(tmp);
++ }
++
++ ans = 0;
++ }
++ }
++
++ return ans;
++}
++
++static inline void mptcp_prepare_skb(struct sk_buff *skb,
++ const struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 inc = 0;
++
++ /* If skb is the end of this mapping (end is always at mapping-boundary
++ * thanks to the splitting/trimming), then we need to increase
++ * data-end-seq by 1 if this here is a data-fin.
++ *
++ * We need to do -1 because end_seq includes the subflow-FIN.
++ */
++ if (tp->mptcp->map_data_fin &&
++ (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
++ (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++ inc = 1;
++
++ /* We manually set the fin-flag if it is a data-fin. For easy
++ * processing in tcp_recvmsg.
++ */
++ tcp_hdr(skb)->fin = 1;
++ } else {
++ /* We may have a subflow-fin with data but without data-fin */
++ tcp_hdr(skb)->fin = 0;
++ }
++
++ /* Adapt data-seq's to the packet itself. We kinda transform the
++ * dss-mapping to a per-packet granularity. This is necessary to
++ * correctly handle overlapping mappings coming from different
++ * subflows. Otherwise it would be a complete mess.
++ */
++ tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
++ tcb->end_seq = tcb->seq + skb->len + inc;
++}
++
++/**
++ * @return: 1 if the segment has been eaten and can be suppressed,
++ * otherwise 0.
++ */
++static inline int mptcp_direct_copy(const struct sk_buff *skb,
++ struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
++ int eaten = 0;
++
++ __set_current_state(TASK_RUNNING);
++
++ local_bh_enable();
++ if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
++ meta_tp->ucopy.len -= chunk;
++ meta_tp->copied_seq += chunk;
++ eaten = (chunk == skb->len);
++ tcp_rcv_space_adjust(meta_sk);
++ }
++ local_bh_disable();
++ return eaten;
++}
++
++static inline void mptcp_reset_mapping(struct tcp_sock *tp)
++{
++ tp->mptcp->map_data_len = 0;
++ tp->mptcp->map_data_seq = 0;
++ tp->mptcp->map_subseq = 0;
++ tp->mptcp->map_data_fin = 0;
++ tp->mptcp->mapping_present = 0;
++}
++
++/* The DSS-mapping received on the sk only covers the second half of the skb
++ * (cut at seq). We trim the head from the skb.
++ * Data will be freed upon kfree().
++ *
++ * Inspired by tcp_trim_head().
++ */
++static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++ int len = seq - TCP_SKB_CB(skb)->seq;
++ u32 new_seq = TCP_SKB_CB(skb)->seq + len;
++
++ if (len < skb_headlen(skb))
++ __skb_pull(skb, len);
++ else
++ __pskb_trim_head(skb, len - skb_headlen(skb));
++
++ TCP_SKB_CB(skb)->seq = new_seq;
++
++ skb->truesize -= len;
++ atomic_sub(len, &sk->sk_rmem_alloc);
++ sk_mem_uncharge(sk, len);
++}
++
++/* The DSS-mapping received on the sk only covers the first half of the skb
++ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
++ * as further packets may resolve the mapping of the second half of data.
++ *
++ * Inspired by tcp_fragment().
++ */
++static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++ struct sk_buff *buff;
++ int nsize;
++ int nlen, len;
++
++ len = seq - TCP_SKB_CB(skb)->seq;
++ nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
++ if (nsize < 0)
++ nsize = 0;
++
++ /* Get a new skb... force flag on. */
++ buff = alloc_skb(nsize, GFP_ATOMIC);
++ if (buff == NULL)
++ return -ENOMEM;
++
++ skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
++ skb_reset_transport_header(buff);
++
++ tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
++ tcp_hdr(skb)->fin = 0;
++
++ /* We absolutly need to call skb_set_owner_r before refreshing the
++ * truesize of buff, otherwise the moved data will account twice.
++ */
++ skb_set_owner_r(buff, sk);
++ nlen = skb->len - len - nsize;
++ buff->truesize += nlen;
++ skb->truesize -= nlen;
++
++ /* Correct the sequence numbers. */
++ TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
++ TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
++ TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
++
++ skb_split(skb, buff, len);
++
++ __skb_queue_after(&sk->sk_receive_queue, skb, buff);
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
++ if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
++ !tp->mpcb->infinite_mapping_rcv) {
++ /* Remove a pure subflow-fin from the queue and increase
++ * copied_seq.
++ */
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++
++ /* If we are not yet fully established and do not know the mapping for
++ * this segment, this path has to fallback to infinite or be torn down.
++ */
++ if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
++ !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
++ pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
++ __func__, tp->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, __builtin_return_address(0),
++ TCP_SKB_CB(skb)->seq);
++
++ if (!is_master_tp(tp)) {
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mpcb->infinite_mapping_rcv = 1;
++ /* We do a seamless fallback and should not send a inf.mapping. */
++ tp->mpcb->send_infinite_mapping = 0;
++ tp->mptcp->fully_established = 1;
++ }
++
++ /* Receiver-side becomes fully established when a whole rcv-window has
++ * been received without the need to fallback due to the previous
++ * condition.
++ */
++ if (!tp->mptcp->fully_established) {
++ tp->mptcp->init_rcv_wnd -= skb->len;
++ if (tp->mptcp->init_rcv_wnd < 0)
++ mptcp_become_fully_estab(sk);
++ }
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 *ptr;
++ u32 data_seq, sub_seq, data_len, tcp_end_seq;
++
++ /* If we are in infinite-mapping-mode, the subflow is guaranteed to be
++ * in-order at the data-level. Thus data-seq-numbers can be inferred
++ * from what is expected at the data-level.
++ */
++ if (mpcb->infinite_mapping_rcv) {
++ tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
++ tp->mptcp->map_subseq = tcb->seq;
++ tp->mptcp->map_data_len = skb->len;
++ tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
++ tp->mptcp->mapping_present = 1;
++ return 0;
++ }
++
++ /* No mapping here? Exit - it is either already set or still on its way */
++ if (!mptcp_is_data_seq(skb)) {
++ /* Too many packets without a mapping - this subflow is broken */
++ if (!tp->mptcp->mapping_present &&
++ tp->rcv_nxt - tp->copied_seq > 65536) {
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ return 0;
++ }
++
++ ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
++ ptr++;
++ sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
++ ptr++;
++ data_len = get_unaligned_be16(ptr);
++
++ /* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
++ * The draft sets it to 0, but we really would like to have the
++ * real value, to have an easy handling afterwards here in this
++ * function.
++ */
++ if (mptcp_is_data_fin(skb) && skb->len == 0)
++ sub_seq = TCP_SKB_CB(skb)->seq;
++
++ /* If there is already a mapping - we check if it maps with the current
++ * one. If not - we reset.
++ */
++ if (tp->mptcp->mapping_present &&
++ (data_seq != (u32)tp->mptcp->map_data_seq ||
++ sub_seq != tp->mptcp->map_subseq ||
++ data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
++ mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
++ /* Mapping in packet is different from what we want */
++ pr_err("%s Mappings do not match!\n", __func__);
++ pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
++ __func__, data_seq, (u32)tp->mptcp->map_data_seq,
++ sub_seq, tp->mptcp->map_subseq, data_len,
++ tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
++ tp->mptcp->map_data_fin);
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ /* If the previous check was good, the current mapping is valid and we exit. */
++ if (tp->mptcp->mapping_present)
++ return 0;
++
++ /* Mapping not yet set on this subflow - we set it here! */
++
++ if (!data_len) {
++ mpcb->infinite_mapping_rcv = 1;
++ tp->mptcp->fully_established = 1;
++ /* We need to repeat mp_fail's until the sender felt
++ * back to infinite-mapping - here we stop repeating it.
++ */
++ tp->mptcp->send_mp_fail = 0;
++
++ /* We have to fixup data_len - it must be the same as skb->len */
++ data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
++ sub_seq = tcb->seq;
++
++ /* TODO kill all other subflows than this one */
++ /* data_seq and so on are set correctly */
++
++ /* At this point, the meta-ofo-queue has to be emptied,
++ * as the following data is guaranteed to be in-order at
++ * the data and subflow-level
++ */
++ mptcp_purge_ofo_queue(meta_tp);
++ }
++
++ /* We are sending mp-fail's and thus are in fallback mode.
++ * Ignore packets which do not announce the fallback and still
++ * want to provide a mapping.
++ */
++ if (tp->mptcp->send_mp_fail) {
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++
++ /* FIN increased the mapping-length by 1 */
++ if (mptcp_is_data_fin(skb))
++ data_len--;
++
++ /* Subflow-sequences of packet must be
++ * (at least partially) be part of the DSS-mapping's
++ * subflow-sequence-space.
++ *
++ * Basically the mapping is not valid, if either of the
++ * following conditions is true:
++ *
++ * 1. It's not a data_fin and
++ * MPTCP-sub_seq >= TCP-end_seq
++ *
++ * 2. It's a data_fin and TCP-end_seq > TCP-seq and
++ * MPTCP-sub_seq >= TCP-end_seq
++ *
++ * The previous two can be merged into:
++ * TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
++ * Because if it's not a data-fin, TCP-end_seq > TCP-seq
++ *
++ * 3. It's a data_fin and skb->len == 0 and
++ * MPTCP-sub_seq > TCP-end_seq
++ *
++ * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
++ * MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
++ *
++ * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
++ */
++
++ /* subflow-fin is not part of the mapping - ignore it here ! */
++ tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
++ if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
++ (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
++ (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
++ before(sub_seq, tp->copied_seq)) {
++ /* Subflow-sequences of packet is different from what is in the
++ * packet's dss-mapping. The peer is misbehaving - reset
++ */
++ pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
++ "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
++ "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
++ skb->len, data_len, tp->copied_seq);
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ /* Does the DSS had 64-bit seqnum's ? */
++ if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
++ /* Wrapped around? */
++ if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
++ } else {
++ /* Else, access the default high-order bits */
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
++ }
++ } else {
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
++
++ if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
++ /* We make sure that the data_seq is invalid.
++ * It will be dropped later.
++ */
++ tp->mptcp->map_data_seq += 0xFFFFFFFF;
++ tp->mptcp->map_data_seq += 0xFFFFFFFF;
++ }
++ }
++
++ tp->mptcp->map_data_len = data_len;
++ tp->mptcp->map_subseq = sub_seq;
++ tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
++ tp->mptcp->mapping_present = 1;
++
++ return 0;
++}
++
++/* Similar to tcp_sequence(...) */
++static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
++ u64 data_seq, u64 end_data_seq)
++{
++ const struct mptcp_cb *mpcb = meta_tp->mpcb;
++ u64 rcv_wup64;
++
++ /* Wrap-around? */
++ if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
++ rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
++ meta_tp->rcv_wup;
++ } else {
++ rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++ meta_tp->rcv_wup);
++ }
++
++ return !before64(end_data_seq, rcv_wup64) &&
++ !after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *tmp, *tmp1;
++ u32 tcp_end_seq;
++
++ if (!tp->mptcp->mapping_present)
++ return 0;
++
++ /* either, the new skb gave us the mapping and the first segment
++ * in the sub-rcv-queue has to be trimmed ...
++ */
++ tmp = skb_peek(&sk->sk_receive_queue);
++ if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
++ after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
++ mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
++
++ /* ... or the new skb (tail) has to be split at the end. */
++ tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
++ if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++ u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
++ if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
++ /* TODO : maybe handle this here better.
++ * We now just force meta-retransmission.
++ */
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++ }
++
++ /* Now, remove old sk_buff's from the receive-queue.
++ * This may happen if the mapping has been lost for these segments and
++ * the next mapping has already been received.
++ */
++ if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
++ break;
++
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++
++ /* Impossible that we could free skb here, because his
++ * mapping is known to be valid from previous checks
++ */
++ __kfree_skb(tmp1);
++ }
++ }
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this mapping has been put in the meta-receive-queue
++ * -2 this mapping has been eaten by the application
++ */
++static int mptcp_queue_skb(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sk_buff *tmp, *tmp1;
++ u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
++ bool data_queued = false;
++
++ /* Have we not yet received the full mapping? */
++ if (!tp->mptcp->mapping_present ||
++ before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ return 0;
++
++ /* Is this an overlapping mapping? rcv_nxt >= end_data_seq
++ * OR
++ * This mapping is out of window
++ */
++ if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
++ !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
++ tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ __kfree_skb(tmp1);
++
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++
++ mptcp_reset_mapping(tp);
++
++ return -1;
++ }
++
++ /* Record it, because we want to send our data_fin on the same path */
++ if (tp->mptcp->map_data_fin) {
++ mpcb->dfin_path_index = tp->mptcp->path_index;
++ mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
++ }
++
++ /* Verify the checksum */
++ if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
++ int ret = mptcp_verif_dss_csum(sk);
++
++ if (ret <= 0) {
++ mptcp_reset_mapping(tp);
++ return 1;
++ }
++ }
++
++ if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
++ /* Seg's have to go to the meta-ofo-queue */
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_prepare_skb(tmp1, sk);
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ /* MUST be done here, because fragstolen may be true later.
++ * Then, kfree_skb_partial will not account the memory.
++ */
++ skb_orphan(tmp1);
++
++ if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
++ mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
++ else
++ __kfree_skb(tmp1);
++
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++ tcp_enter_quickack_mode(sk);
++ } else {
++ /* Ready for the meta-rcv-queue */
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ int eaten = 0;
++ const bool copied_early = false;
++ bool fragstolen = false;
++ u32 old_rcv_nxt = meta_tp->rcv_nxt;
++
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_prepare_skb(tmp1, sk);
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ /* MUST be done here, because fragstolen may be true.
++ * Then, kfree_skb_partial will not account the memory.
++ */
++ skb_orphan(tmp1);
++
++ /* This segment has already been received */
++ if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
++ __kfree_skb(tmp1);
++ goto next;
++ }
++
++#ifdef CONFIG_NET_DMA
++ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.task == current &&
++ meta_tp->copied_seq == meta_tp->rcv_nxt &&
++ tmp1->len <= meta_tp->ucopy.len &&
++ sock_owned_by_user(meta_sk) &&
++ tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
++ copied_early = true;
++ eaten = 1;
++ }
++#endif
++
++ /* Is direct copy possible ? */
++ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.task == current &&
++ meta_tp->copied_seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
++ !copied_early)
++ eaten = mptcp_direct_copy(tmp1, meta_sk);
++
++ if (mpcb->in_time_wait) /* In time-wait, do not receive data */
++ eaten = 1;
++
++ if (!eaten)
++ eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
++
++ meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++#ifdef CONFIG_NET_DMA
++ if (copied_early)
++ meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
++#endif
++
++ if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
++ mptcp_fin(meta_sk);
++
++ /* Check if this fills a gap in the ofo queue */
++ if (!skb_queue_empty(&meta_tp->out_of_order_queue))
++ mptcp_ofo_queue(meta_sk);
++
++#ifdef CONFIG_NET_DMA
++ if (copied_early)
++ __skb_queue_tail(&meta_sk->sk_async_wait_queue,
++ tmp1);
++ else
++#endif
++ if (eaten)
++ kfree_skb_partial(tmp1, fragstolen);
++
++ data_queued = true;
++next:
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++ }
++
++ inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
++ mptcp_reset_mapping(tp);
++
++ return data_queued ? -1 : -2;
++}
++
++void mptcp_data_ready(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct sk_buff *skb, *tmp;
++ int queued = 0;
++
++ /* restart before the check, because mptcp_fin might have changed the
++ * state.
++ */
++restart:
++ /* If the meta cannot receive data, there is no point in pushing data.
++ * If we are in time-wait, we may still be waiting for the final FIN.
++ * So, we should proceed with the processing.
++ */
++ if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
++ skb_queue_purge(&sk->sk_receive_queue);
++ tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
++ goto exit;
++ }
++
++ /* Iterate over all segments, detect their mapping (if we don't have
++ * one yet), validate them and push everything one level higher.
++ */
++ skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
++ int ret;
++ /* Pre-validation - e.g., early fallback */
++ ret = mptcp_prevalidate_skb(sk, skb);
++ if (ret < 0)
++ goto restart;
++ else if (ret > 0)
++ break;
++
++ /* Set the current mapping */
++ ret = mptcp_detect_mapping(sk, skb);
++ if (ret < 0)
++ goto restart;
++ else if (ret > 0)
++ break;
++
++ /* Validation */
++ if (mptcp_validate_mapping(sk, skb) < 0)
++ goto restart;
++
++ /* Push a level higher */
++ ret = mptcp_queue_skb(sk);
++ if (ret < 0) {
++ if (ret == -1)
++ queued = ret;
++ goto restart;
++ } else if (ret == 0) {
++ continue;
++ } else { /* ret == 1 */
++ break;
++ }
++ }
++
++exit:
++ if (tcp_sk(sk)->close_it) {
++ tcp_send_ack(sk);
++ tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
++ }
++
++ if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
++ meta_sk->sk_data_ready(meta_sk);
++}
++
++
++int mptcp_check_req(struct sk_buff *skb, struct net *net)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ struct sock *meta_sk = NULL;
++
++ /* MPTCP structures not initialized */
++ if (mptcp_init_failed)
++ return 0;
++
++ if (skb->protocol == htons(ETH_P_IP))
++ meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr, net);
++#if IS_ENABLED(CONFIG_IPV6)
++ else /* IPv6 */
++ meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
++ &ipv6_hdr(skb)->daddr, net);
++#endif /* CONFIG_IPV6 */
++
++ if (!meta_sk)
++ return 0;
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
++ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++ sock_put(meta_sk); /* Taken by mptcp_search_req */
++ kfree_skb(skb);
++ return 1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else { /* IPv6 */
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
++ return 1;
++}
++
++struct mp_join *mptcp_find_join(const struct sk_buff *skb)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ unsigned char *ptr;
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++ /* Jump through the options to check whether JOIN is there */
++ ptr = (unsigned char *)(th + 1);
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return NULL;
++ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2) /* "silly options" */
++ return NULL;
++ if (opsize > length)
++ return NULL; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
++ return (struct mp_join *)(ptr - 2);
++ }
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++ }
++ return NULL;
++}
++
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
++{
++ const struct mptcp_cb *mpcb;
++ struct sock *meta_sk;
++ u32 token;
++ bool meta_v4;
++ struct mp_join *join_opt = mptcp_find_join(skb);
++ if (!join_opt)
++ return 0;
++
++ /* MPTCP structures were not initialized, so return error */
++ if (mptcp_init_failed)
++ return -1;
++
++ token = join_opt->u.syn.token;
++ meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
++ if (!meta_sk) {
++ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++ return -1;
++ }
++
++ meta_v4 = meta_sk->sk_family == AF_INET;
++ if (meta_v4) {
++ if (skb->protocol == htons(ETH_P_IPV6)) {
++ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP) &&
++ inet6_sk(meta_sk)->ipv6only) {
++ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ mpcb = tcp_sk(meta_sk)->mpcb;
++ if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
++ /* We are in fallback-mode on the reception-side -
++ * no new subflows!
++ */
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ /* Coming from time-wait-sock processing in tcp_v4_rcv.
++ * We have to deschedule it before continuing, because otherwise
++ * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
++ */
++ if (tw) {
++ inet_twsk_deschedule(tw, &tcp_death_row);
++ inet_twsk_put(tw);
++ }
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++ /* OK, this is a new syn/join, let's create a new open request and
++ * send syn+ack
++ */
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPBACKLOGDROP);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ kfree_skb(skb);
++ return 1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return 1;
++}
++
++int mptcp_do_join_short(struct sk_buff *skb,
++ const struct mptcp_options_received *mopt,
++ struct net *net)
++{
++ struct sock *meta_sk;
++ u32 token;
++ bool meta_v4;
++
++ token = mopt->mptcp_rem_token;
++ meta_sk = mptcp_hash_find(net, token);
++ if (!meta_sk) {
++ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++ return -1;
++ }
++
++ meta_v4 = meta_sk->sk_family == AF_INET;
++ if (meta_v4) {
++ if (skb->protocol == htons(ETH_P_IPV6)) {
++ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP) &&
++ inet6_sk(meta_sk)->ipv6only) {
++ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++ /* OK, this is a new syn/join, let's create a new open request and
++ * send syn+ack
++ */
++ bh_lock_sock(meta_sk);
++
++ /* This check is also done in mptcp_vX_do_rcv. But, there we cannot
++ * call tcp_vX_send_reset, because we hold already two socket-locks.
++ * (the listener and the meta from above)
++ *
++ * And the send-reset will try to take yet another one (ip_send_reply).
++ * Thus, we propagate the reset up to tcp_rcv_state_process.
++ */
++ if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
++ tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
++ meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
++ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++ else
++ /* Must make sure that upper layers won't free the
++ * skb if it is added to the backlog-queue.
++ */
++ skb_get(skb);
++ } else {
++ /* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
++ * the skb will finally be freed by tcp_v4_do_rcv (where we are
++ * coming from)
++ */
++ skb_get(skb);
++ if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else { /* IPv6 */
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ }
++
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return 0;
++}
++
++/**
++ * Equivalent of tcp_fin() for MPTCP
++ * Can be called only when the FIN is validly part
++ * of the data seqnum space. Not before when we get holes.
++ */
++void mptcp_fin(struct sock *meta_sk)
++{
++ struct sock *sk = NULL, *sk_it;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
++ sk = sk_it;
++ break;
++ }
++ }
++
++ if (!sk || sk->sk_state == TCP_CLOSE)
++ sk = mptcp_select_ack_sock(meta_sk);
++
++ inet_csk_schedule_ack(sk);
++
++ meta_sk->sk_shutdown |= RCV_SHUTDOWN;
++ sock_set_flag(meta_sk, SOCK_DONE);
++
++ switch (meta_sk->sk_state) {
++ case TCP_SYN_RECV:
++ case TCP_ESTABLISHED:
++ /* Move to CLOSE_WAIT */
++ tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
++ inet_csk(sk)->icsk_ack.pingpong = 1;
++ break;
++
++ case TCP_CLOSE_WAIT:
++ case TCP_CLOSING:
++ /* Received a retransmission of the FIN, do
++ * nothing.
++ */
++ break;
++ case TCP_LAST_ACK:
++ /* RFC793: Remain in the LAST-ACK state. */
++ break;
++
++ case TCP_FIN_WAIT1:
++ /* This case occurs when a simultaneous close
++ * happens, we must ack the received FIN and
++ * enter the CLOSING state.
++ */
++ tcp_send_ack(sk);
++ tcp_set_state(meta_sk, TCP_CLOSING);
++ break;
++ case TCP_FIN_WAIT2:
++ /* Received a FIN -- send ACK and enter TIME_WAIT. */
++ tcp_send_ack(sk);
++ meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
++ break;
++ default:
++ /* Only TCP_LISTEN and TCP_CLOSE are left, in these
++ * cases we should never reach this piece of code.
++ */
++ pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
++ meta_sk->sk_state);
++ break;
++ }
++
++ /* It _is_ possible, that we have something out-of-order _after_ FIN.
++ * Probably, we should reset in this case. For now drop them.
++ */
++ mptcp_purge_ofo_queue(meta_tp);
++ sk_mem_reclaim(meta_sk);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ meta_sk->sk_state_change(meta_sk);
++
++ /* Do not send POLL_HUP for half duplex close. */
++ if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
++ meta_sk->sk_state == TCP_CLOSE)
++ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
++ else
++ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
++ }
++
++ return;
++}
++
++static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++
++ if (!meta_tp->packets_out)
++ return;
++
++ tcp_for_write_queue(skb, meta_sk) {
++ if (skb == tcp_send_head(meta_sk))
++ break;
++
++ if (mptcp_retransmit_skb(meta_sk, skb))
++ return;
++
++ if (skb == tcp_write_queue_head(meta_sk))
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++ inet_csk(meta_sk)->icsk_rto,
++ TCP_RTO_MAX);
++ }
++}
++
++/* Handle the DATA_ACK */
++static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 prior_snd_una = meta_tp->snd_una;
++ int prior_packets;
++ u32 nwin, data_ack, data_seq;
++ u16 data_len = 0;
++
++ /* A valid packet came in - subflow is operational again */
++ tp->pf = 0;
++
++ /* Even if there is no data-ack, we stop retransmitting.
++ * Except if this is a SYN/ACK. Then it is just a retransmission
++ */
++ if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
++ tp->mptcp->pre_established = 0;
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++ }
++
++ /* If we are in infinite mapping mode, rx_opt.data_ack has been
++ * set by mptcp_clean_rtx_infinite.
++ */
++ if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
++ goto exit;
++
++ data_ack = tp->mptcp->rx_opt.data_ack;
++
++ if (unlikely(!tp->mptcp->fully_established) &&
++ tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
++ /* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
++ * includes a data-ack, we are fully established
++ */
++ mptcp_become_fully_estab(sk);
++
++ /* Get the data_seq */
++ if (mptcp_is_data_seq(skb)) {
++ data_seq = tp->mptcp->rx_opt.data_seq;
++ data_len = tp->mptcp->rx_opt.data_len;
++ } else {
++ data_seq = meta_tp->snd_wl1;
++ }
++
++ /* If the ack is older than previous acks
++ * then we can probably ignore it.
++ */
++ if (before(data_ack, prior_snd_una))
++ goto exit;
++
++ /* If the ack includes data we haven't sent yet, discard
++ * this segment (RFC793 Section 3.9).
++ */
++ if (after(data_ack, meta_tp->snd_nxt))
++ goto exit;
++
++ /*** Now, update the window - inspired by tcp_ack_update_window ***/
++ nwin = ntohs(tcp_hdr(skb)->window);
++
++ if (likely(!tcp_hdr(skb)->syn))
++ nwin <<= tp->rx_opt.snd_wscale;
++
++ if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
++ tcp_update_wl(meta_tp, data_seq);
++
++ /* Draft v09, Section 3.3.5:
++ * [...] It should only update its local receive window values
++ * when the largest sequence number allowed (i.e. DATA_ACK +
++ * receive window) increases. [...]
++ */
++ if (meta_tp->snd_wnd != nwin &&
++ !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
++ meta_tp->snd_wnd = nwin;
++
++ if (nwin > meta_tp->max_window)
++ meta_tp->max_window = nwin;
++ }
++ }
++ /*** Done, update the window ***/
++
++ /* We passed data and got it acked, remove any soft error
++ * log. Something worked...
++ */
++ sk->sk_err_soft = 0;
++ inet_csk(meta_sk)->icsk_probes_out = 0;
++ meta_tp->rcv_tstamp = tcp_time_stamp;
++ prior_packets = meta_tp->packets_out;
++ if (!prior_packets)
++ goto no_queue;
++
++ meta_tp->snd_una = data_ack;
++
++ mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
++
++ /* We are in loss-state, and something got acked, retransmit the whole
++ * queue now!
++ */
++ if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
++ after(data_ack, prior_snd_una)) {
++ mptcp_xmit_retransmit_queue(meta_sk);
++ inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
++ }
++
++ /* Simplified version of tcp_new_space, because the snd-buffer
++ * is handled by all the subflows.
++ */
++ if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
++ sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
++ if (meta_sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++ meta_sk->sk_write_space(meta_sk);
++ }
++
++ if (meta_sk->sk_state != TCP_ESTABLISHED &&
++ mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
++ return;
++
++exit:
++ mptcp_push_pending_frames(meta_sk);
++
++ return;
++
++no_queue:
++ if (tcp_send_head(meta_sk))
++ tcp_ack_probe(meta_sk);
++
++ mptcp_push_pending_frames(meta_sk);
++
++ return;
++}
++
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
++
++ if (!tp->mpcb->infinite_mapping_snd)
++ return;
++
++ /* The difference between both write_seq's represents the offset between
++ * data-sequence and subflow-sequence. As we are infinite, this must
++ * match.
++ *
++ * Thus, from this difference we can infer the meta snd_una.
++ */
++ tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
++ tp->snd_una;
++
++ mptcp_data_ack(sk, skb);
++}
++
++/**** static functions used by mptcp_parse_options */
++
++static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
++{
++ struct sock *sk_it, *tmpsk;
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
++ mptcp_reinject_data(sk_it, 0);
++ sk_it->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk_it->sk_state))
++ tcp_sk(sk_it)->ops->send_active_reset(sk_it,
++ GFP_ATOMIC);
++ mptcp_sub_force_close(sk_it);
++ }
++ }
++}
++
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++ struct mptcp_options_received *mopt,
++ const struct sk_buff *skb)
++{
++ const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
++
++ /* If the socket is mp-capable we would have a mopt. */
++ if (!mopt)
++ return;
++
++ switch (mp_opt->sub) {
++ case MPTCP_SUB_CAPABLE:
++ {
++ const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
++ opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
++ mptcp_debug("%s: mp_capable: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ if (!sysctl_mptcp_enabled)
++ break;
++
++ /* We only support MPTCP version 0 */
++ if (mpcapable->ver != 0)
++ break;
++
++ /* MPTCP-RFC 6824:
++ * "If receiving a message with the 'B' flag set to 1, and this
++ * is not understood, then this SYN MUST be silently ignored;
++ */
++ if (mpcapable->b) {
++ mopt->drop_me = 1;
++ break;
++ }
++
++ /* MPTCP-RFC 6824:
++ * "An implementation that only supports this method MUST set
++ * bit "H" to 1, and bits "C" through "G" to 0."
++ */
++ if (!mpcapable->h)
++ break;
++
++ mopt->saw_mpc = 1;
++ mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
++
++ if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
++ mopt->mptcp_key = mpcapable->sender_key;
++
++ break;
++ }
++ case MPTCP_SUB_JOIN:
++ {
++ const struct mp_join *mpjoin = (struct mp_join *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
++ opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
++ opsize != MPTCP_SUB_LEN_JOIN_ACK) {
++ mptcp_debug("%s: mp_join: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ /* saw_mpc must be set, because in tcp_check_req we assume that
++ * it is set to support falling back to reg. TCP if a rexmitted
++ * SYN has no MP_CAPABLE or MP_JOIN
++ */
++ switch (opsize) {
++ case MPTCP_SUB_LEN_JOIN_SYN:
++ mopt->is_mp_join = 1;
++ mopt->saw_mpc = 1;
++ mopt->low_prio = mpjoin->b;
++ mopt->rem_id = mpjoin->addr_id;
++ mopt->mptcp_rem_token = mpjoin->u.syn.token;
++ mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
++ break;
++ case MPTCP_SUB_LEN_JOIN_SYNACK:
++ mopt->saw_mpc = 1;
++ mopt->low_prio = mpjoin->b;
++ mopt->rem_id = mpjoin->addr_id;
++ mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
++ mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
++ break;
++ case MPTCP_SUB_LEN_JOIN_ACK:
++ mopt->saw_mpc = 1;
++ mopt->join_ack = 1;
++ memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
++ break;
++ }
++ break;
++ }
++ case MPTCP_SUB_DSS:
++ {
++ const struct mp_dss *mdss = (struct mp_dss *)ptr;
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++
++ /* We check opsize for the csum and non-csum case. We do this,
++ * because the draft says that the csum SHOULD be ignored if
++ * it has not been negotiated in the MP_CAPABLE but still is
++ * present in the data.
++ *
++ * It will get ignored later in mptcp_queue_skb.
++ */
++ if (opsize != mptcp_sub_len_dss(mdss, 0) &&
++ opsize != mptcp_sub_len_dss(mdss, 1)) {
++ mptcp_debug("%s: mp_dss: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ ptr += 4;
++
++ if (mdss->A) {
++ tcb->mptcp_flags |= MPTCPHDR_ACK;
++
++ if (mdss->a) {
++ mopt->data_ack = (u32) get_unaligned_be64(ptr);
++ ptr += MPTCP_SUB_LEN_ACK_64;
++ } else {
++ mopt->data_ack = get_unaligned_be32(ptr);
++ ptr += MPTCP_SUB_LEN_ACK;
++ }
++ }
++
++ tcb->dss_off = (ptr - skb_transport_header(skb));
++
++ if (mdss->M) {
++ if (mdss->m) {
++ u64 data_seq64 = get_unaligned_be64(ptr);
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
++ mopt->data_seq = (u32) data_seq64;
++
++ ptr += 12; /* 64-bit dseq + subseq */
++ } else {
++ mopt->data_seq = get_unaligned_be32(ptr);
++ ptr += 8; /* 32-bit dseq + subseq */
++ }
++ mopt->data_len = get_unaligned_be16(ptr);
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++ /* Is a check-sum present? */
++ if (opsize == mptcp_sub_len_dss(mdss, 1))
++ tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
++
++ /* DATA_FIN only possible with DSS-mapping */
++ if (mdss->F)
++ tcb->mptcp_flags |= MPTCPHDR_FIN;
++ }
++
++ break;
++ }
++ case MPTCP_SUB_ADD_ADDR:
++ {
++#if IS_ENABLED(CONFIG_IPV6)
++ const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
++#else
++ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
++#endif /* CONFIG_IPV6 */
++ mptcp_debug("%s: mp_add_addr: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ /* We have to manually parse the options if we got two of them. */
++ if (mopt->saw_add_addr) {
++ mopt->more_add_addr = 1;
++ break;
++ }
++ mopt->saw_add_addr = 1;
++ mopt->add_addr_ptr = ptr;
++ break;
++ }
++ case MPTCP_SUB_REMOVE_ADDR:
++ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
++ mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ if (mopt->saw_rem_addr) {
++ mopt->more_rem_addr = 1;
++ break;
++ }
++ mopt->saw_rem_addr = 1;
++ mopt->rem_addr_ptr = ptr;
++ break;
++ case MPTCP_SUB_PRIO:
++ {
++ const struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_PRIO &&
++ opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
++ mptcp_debug("%s: mp_prio: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ mopt->saw_low_prio = 1;
++ mopt->low_prio = mpprio->b;
++
++ if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
++ mopt->saw_low_prio = 2;
++ mopt->prio_addr_id = mpprio->addr_id;
++ }
++ break;
++ }
++ case MPTCP_SUB_FAIL:
++ if (opsize != MPTCP_SUB_LEN_FAIL) {
++ mptcp_debug("%s: mp_fail: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++ mopt->mp_fail = 1;
++ break;
++ case MPTCP_SUB_FCLOSE:
++ if (opsize != MPTCP_SUB_LEN_FCLOSE) {
++ mptcp_debug("%s: mp_fclose: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ mopt->mp_fclose = 1;
++ mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
++
++ break;
++ default:
++ mptcp_debug("%s: Received unkown subtype: %d\n",
++ __func__, mp_opt->sub);
++ break;
++ }
++}
++
++/** Parse only MPTCP options */
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++ struct mptcp_options_received *mopt)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++ const unsigned char *ptr = (const unsigned char *)(th + 1);
++
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return;
++ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2) /* "silly options" */
++ return;
++ if (opsize > length)
++ return; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP)
++ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++ }
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++}
++
++int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sock *sk;
++ u32 rtt_max = 0;
++
++ /* In MPTCP, we take the max delay across all flows,
++ * in order to take into account meta-reordering buffers.
++ */
++ mptcp_for_each_sk(mpcb, sk) {
++ if (!mptcp_sk_can_recv(sk))
++ continue;
++
++ if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
++ rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
++ }
++ if (time < (rtt_max >> 3) || !rtt_max)
++ return 1;
++
++ return 0;
++}
++
++static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
++{
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ __be16 port = 0;
++ union inet_addr addr;
++ sa_family_t family;
++
++ if (mpadd->ipver == 4) {
++ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++ port = mpadd->u.v4.port;
++ family = AF_INET;
++ addr.in = mpadd->u.v4.addr;
++#if IS_ENABLED(CONFIG_IPV6)
++ } else if (mpadd->ipver == 6) {
++ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
++ port = mpadd->u.v6.port;
++ family = AF_INET6;
++ addr.in6 = mpadd->u.v6.addr;
++#endif /* CONFIG_IPV6 */
++ } else {
++ return;
++ }
++
++ if (mpcb->pm_ops->add_raddr)
++ mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
++}
++
++static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
++{
++ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++ int i;
++ u8 rem_id;
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
++ rem_id = (&mprem->addrs_id)[i];
++
++ if (mpcb->pm_ops->rem_raddr)
++ mpcb->pm_ops->rem_raddr(mpcb, rem_id);
++ mptcp_send_reset_rem_id(mpcb, rem_id);
++ }
++}
++
++static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
++{
++ struct tcphdr *th = tcp_hdr(skb);
++ unsigned char *ptr;
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++ /* Jump through the options to check whether ADD_ADDR is there */
++ ptr = (unsigned char *)(th + 1);
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return;
++ case TCPOPT_NOP:
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2)
++ return;
++ if (opsize > length)
++ return; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
++#if IS_ENABLED(CONFIG_IPV6)
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
++#else
++ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++#endif /* CONFIG_IPV6 */
++ goto cont;
++
++ mptcp_handle_add_addr(ptr, sk);
++ }
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
++ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
++ goto cont;
++
++ mptcp_handle_rem_addr(ptr, sk);
++ }
++cont:
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++ }
++ return;
++}
++
++static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
++{
++ struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ if (unlikely(mptcp->rx_opt.mp_fail)) {
++ mptcp->rx_opt.mp_fail = 0;
++
++ if (!th->rst && !mpcb->infinite_mapping_snd) {
++ struct sock *sk_it;
++
++ mpcb->send_infinite_mapping = 1;
++ /* We resend everything that has not been acknowledged */
++ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++
++ /* We artificially restart the whole send-queue. Thus,
++ * it is as if no packets are in flight
++ */
++ tcp_sk(meta_sk)->packets_out = 0;
++
++ /* If the snd_nxt already wrapped around, we have to
++ * undo the wrapping, as we are restarting from snd_una
++ * on.
++ */
++ if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
++ mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
++ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++ }
++ tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
++
++ /* Trigger a sending on the meta. */
++ mptcp_push_pending_frames(meta_sk);
++
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (sk != sk_it)
++ mptcp_sub_force_close(sk_it);
++ }
++ }
++
++ return 0;
++ }
++
++ if (unlikely(mptcp->rx_opt.mp_fclose)) {
++ struct sock *sk_it, *tmpsk;
++
++ mptcp->rx_opt.mp_fclose = 0;
++ if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
++ return 0;
++
++ if (tcp_need_reset(sk->sk_state))
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
++ mptcp_sub_force_close(sk_it);
++
++ tcp_reset(meta_sk);
++
++ return 1;
++ }
++
++ return 0;
++}
++
++static inline void mptcp_path_array_check(struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++
++ if (unlikely(mpcb->list_rcvd)) {
++ mpcb->list_rcvd = 0;
++ if (mpcb->pm_ops->new_remote_address)
++ mpcb->pm_ops->new_remote_address(meta_sk);
++ }
++}
++
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++ const struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++ if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
++ return 0;
++
++ if (mptcp_mp_fail_rcvd(sk, th))
++ return 1;
++
++ /* RFC 6824, Section 3.3:
++ * If a checksum is not present when its use has been negotiated, the
++ * receiver MUST close the subflow with a RST as it is considered broken.
++ */
++ if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
++ !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
++ if (tcp_need_reset(sk->sk_state))
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
++
++ mptcp_sub_force_close(sk);
++ return 1;
++ }
++
++ /* We have to acknowledge retransmissions of the third
++ * ack.
++ */
++ if (mopt->join_ack) {
++ tcp_send_delayed_ack(sk);
++ mopt->join_ack = 0;
++ }
++
++ if (mopt->saw_add_addr || mopt->saw_rem_addr) {
++ if (mopt->more_add_addr || mopt->more_rem_addr) {
++ mptcp_parse_addropt(skb, sk);
++ } else {
++ if (mopt->saw_add_addr)
++ mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
++ if (mopt->saw_rem_addr)
++ mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
++ }
++
++ mopt->more_add_addr = 0;
++ mopt->saw_add_addr = 0;
++ mopt->more_rem_addr = 0;
++ mopt->saw_rem_addr = 0;
++ }
++ if (mopt->saw_low_prio) {
++ if (mopt->saw_low_prio == 1) {
++ tp->mptcp->rcv_low_prio = mopt->low_prio;
++ } else {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tp->mpcb, sk_it) {
++ struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
++ if (mptcp->rem_id == mopt->prio_addr_id)
++ mptcp->rcv_low_prio = mopt->low_prio;
++ }
++ }
++ mopt->saw_low_prio = 0;
++ }
++
++ mptcp_data_ack(sk, skb);
++
++ mptcp_path_array_check(mptcp_meta_sk(sk));
++ /* Socket may have been mp_killed by a REMOVE_ADDR */
++ if (tp->mp_killed)
++ return 1;
++
++ return 0;
++}
++
++/* In case of fastopen, some data can already be in the write queue.
++ * We need to update the sequence number of the segments as they
++ * were initially TCP sequence numbers.
++ */
++static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
++ struct sk_buff *skb;
++ u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
++
++ /* There should only be one skb in write queue: the data not
++ * acknowledged in the SYN+ACK. In this case, we need to map
++ * this data to data sequence numbers.
++ */
++ skb_queue_walk(&meta_sk->sk_write_queue, skb) {
++ /* If the server only acknowledges partially the data sent in
++ * the SYN, we need to trim the acknowledged part because
++ * we don't want to retransmit this already received data.
++ * When we reach this point, tcp_ack() has already cleaned up
++ * fully acked segments. However, tcp trims partially acked
++ * segments only when retransmitting. Since MPTCP comes into
++ * play only now, we will fake an initial transmit, and
++ * retransmit_skb() will not be called. The following fragment
++ * comes from __tcp_retransmit_skb().
++ */
++ if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
++ BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
++ master_tp->snd_una));
++ /* tcp_trim_head can only returns ENOMEM if skb is
++ * cloned. It is not the case here (see
++ * tcp_send_syn_data).
++ */
++ BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
++ TCP_SKB_CB(skb)->seq));
++ }
++
++ TCP_SKB_CB(skb)->seq += new_mapping;
++ TCP_SKB_CB(skb)->end_seq += new_mapping;
++ }
++
++ /* We can advance write_seq by the number of bytes unacknowledged
++ * and that were mapped in the previous loop.
++ */
++ meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
++
++ /* The packets from the master_sk will be entailed to it later
++ * Until that time, its write queue is empty, and
++ * write_seq must align with snd_una
++ */
++ master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
++ master_tp->packets_out = 0;
++
++ /* Although these data have been sent already over the subsk,
++ * They have never been sent over the meta_sk, so we rewind
++ * the send_head so that tcp considers it as an initial send
++ * (instead of retransmit).
++ */
++ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++}
++
++/* The skptr is needed, because if we become MPTCP-capable, we have to switch
++ * from meta-socket to master-socket.
++ *
++ * @return: 1 - we want to reset this connection
++ * 2 - we want to discard the received syn/ack
++ * 0 - everything is fine - continue
++ */
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++ const struct sk_buff *skb,
++ const struct mptcp_options_received *mopt)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (mptcp(tp)) {
++ u8 hash_mac_check[20];
++ struct mptcp_cb *mpcb = tp->mpcb;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++ (u8 *)&tp->mptcp->mptcp_loc_nonce,
++ (u32 *)hash_mac_check);
++ if (memcmp(hash_mac_check,
++ (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
++ mptcp_sub_force_close(sk);
++ return 1;
++ }
++
++ /* Set this flag in order to postpone data sending
++ * until the 4th ack arrives.
++ */
++ tp->mptcp->pre_established = 1;
++ tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&tp->mptcp->mptcp_loc_nonce,
++ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++ (u32 *)&tp->mptcp->sender_mac[0]);
++
++ } else if (mopt->saw_mpc) {
++ struct sock *meta_sk = sk;
++
++ if (mptcp_create_master_sk(sk, mopt->mptcp_key,
++ ntohs(tcp_hdr(skb)->window)))
++ return 2;
++
++ sk = tcp_sk(sk)->mpcb->master_sk;
++ *skptr = sk;
++ tp = tcp_sk(sk);
++
++ /* If fastopen was used data might be in the send queue. We
++ * need to update their sequence number to MPTCP-level seqno.
++ * Note that it can happen in rare cases that fastopen_req is
++ * NULL and syn_data is 0 but fastopen indeed occurred and
++ * data has been queued in the write queue (but not sent).
++ * Example of such rare cases: connect is non-blocking and
++ * TFO is configured to work without cookies.
++ */
++ if (!skb_queue_empty(&meta_sk->sk_write_queue))
++ mptcp_rcv_synsent_fastopen(meta_sk);
++
++ /* -1, because the SYN consumed 1 byte. In case of TFO, we
++ * start the subflow-sequence number as if the data of the SYN
++ * is not part of any mapping.
++ */
++ tp->mptcp->snt_isn = tp->snd_una - 1;
++ tp->mpcb->dss_csum = mopt->dss_csum;
++ tp->mptcp->include_mpc = 1;
++
++ /* Ensure that fastopen is handled at the meta-level. */
++ tp->fastopen_req = NULL;
++
++ sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
++ sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
++
++ /* hold in sk_clone_lock due to initialization to 2 */
++ sock_put(sk);
++ } else {
++ tp->request_mptcp = 0;
++
++ if (tp->inside_tk_table)
++ mptcp_hash_remove(tp);
++ }
++
++ if (mptcp(tp))
++ tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
++
++ return 0;
++}
++
++bool mptcp_should_expand_sndbuf(const struct sock *sk)
++{
++ const struct sock *sk_it;
++ const struct sock *meta_sk = mptcp_meta_sk(sk);
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int cnt_backups = 0;
++ int backup_available = 0;
++
++ /* We circumvent this check in tcp_check_space, because we want to
++ * always call sk_write_space. So, we reproduce the check here.
++ */
++ if (!meta_sk->sk_socket ||
++ !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++ return false;
++
++ /* If the user specified a specific send buffer setting, do
++ * not modify it.
++ */
++ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++ return false;
++
++ /* If we are under global TCP memory pressure, do not expand. */
++ if (sk_under_memory_pressure(meta_sk))
++ return false;
++
++ /* If we are under soft global TCP memory pressure, do not expand. */
++ if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
++ return false;
++
++
++ /* For MPTCP we look for a subsocket that could send data.
++ * If we found one, then we update the send-buffer.
++ */
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++
++ if (!mptcp_sk_can_send(sk_it))
++ continue;
++
++ /* Backup-flows have to be counted - if there is no other
++ * subflow we take the backup-flow into account.
++ */
++ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
++ cnt_backups++;
++
++ if (tp_it->packets_out < tp_it->snd_cwnd) {
++ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
++ backup_available = 1;
++ continue;
++ }
++ return true;
++ }
++ }
++
++ /* Backup-flow is available for sending - update send-buffer */
++ if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
++ return true;
++ return false;
++}
++
++void mptcp_init_buffer_space(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int space;
++
++ tcp_init_buffer_space(sk);
++
++ if (is_master_tp(tp)) {
++ meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
++ meta_tp->rcvq_space.time = tcp_time_stamp;
++ meta_tp->rcvq_space.seq = meta_tp->copied_seq;
++
++ /* If there is only one subflow, we just use regular TCP
++ * autotuning. User-locks are handled already by
++ * tcp_init_buffer_space
++ */
++ meta_tp->window_clamp = tp->window_clamp;
++ meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
++ meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
++ meta_sk->sk_sndbuf = sk->sk_sndbuf;
++
++ return;
++ }
++
++ if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
++ goto snd_buf;
++
++ /* Adding a new subflow to the rcv-buffer space. We make a simple
++ * addition, to give some space to allow traffic on the new subflow.
++ * Autotuning will increase it further later on.
++ */
++ space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
++ if (space > meta_sk->sk_rcvbuf) {
++ meta_tp->window_clamp += tp->window_clamp;
++ meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
++ meta_sk->sk_rcvbuf = space;
++ }
++
++snd_buf:
++ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++ return;
++
++ /* Adding a new subflow to the send-buffer space. We make a simple
++ * addition, to give some space to allow traffic on the new subflow.
++ * Autotuning will increase it further later on.
++ */
++ space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
++ if (space > meta_sk->sk_sndbuf) {
++ meta_sk->sk_sndbuf = space;
++ meta_sk->sk_write_space(meta_sk);
++ }
++}
++
++void mptcp_tcp_set_rto(struct sock *sk)
++{
++ tcp_set_rto(sk);
++ mptcp_set_rto(sk);
++}
+diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
+new file mode 100644
+index 000000000000..1183d1305d35
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv4.c
+@@ -0,0 +1,483 @@
++/*
++ * MPTCP implementation - IPv4-specific functions
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/ip.h>
++#include <linux/list.h>
++#include <linux/skbuff.h>
++#include <linux/spinlock.h>
++#include <linux/tcp.h>
++
++#include <net/inet_common.h>
++#include <net/inet_connection_sock.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/request_sock.h>
++#include <net/tcp.h>
++
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++ u32 hash[MD5_DIGEST_WORDS];
++
++ hash[0] = (__force u32)saddr;
++ hash[1] = (__force u32)daddr;
++ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++ hash[3] = mptcp_seed++;
++
++ md5_transform(hash, mptcp_secret);
++
++ return hash[0];
++}
++
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++ u32 hash[MD5_DIGEST_WORDS];
++
++ hash[0] = (__force u32)saddr;
++ hash[1] = (__force u32)daddr;
++ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++ hash[3] = mptcp_seed++;
++
++ md5_transform(hash, mptcp_secret);
++
++ return *((u64 *)hash);
++}
++
++
++static void mptcp_v4_reqsk_destructor(struct request_sock *req)
++{
++ mptcp_reqsk_destructor(req);
++
++ tcp_v4_reqsk_destructor(req);
++}
++
++static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++ mptcp_reqsk_init(req, skb);
++
++ return 0;
++}
++
++static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ union inet_addr addr;
++ int loc_id;
++ bool low_prio = false;
++
++ /* We need to do this as early as possible. Because, if we fail later
++ * (e.g., get_local_id), then reqsk_free tries to remove the
++ * request-socket from the htb in mptcp_hash_request_remove as pprev
++ * may be different from NULL.
++ */
++ mtreq->hash_entry.pprev = NULL;
++
++ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++
++ mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr,
++ tcp_hdr(skb)->source,
++ tcp_hdr(skb)->dest);
++ addr.ip = inet_rsk(req)->ir_loc_addr;
++ loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
++ if (loc_id == -1)
++ return -1;
++ mtreq->loc_id = loc_id;
++ mtreq->low_prio = low_prio;
++
++ mptcp_join_reqsk_init(mpcb, req, skb);
++
++ return 0;
++}
++
++/* Similar to tcp_request_sock_ops */
++struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
++ .family = PF_INET,
++ .obj_size = sizeof(struct mptcp_request_sock),
++ .rtx_syn_ack = tcp_rtx_synack,
++ .send_ack = tcp_v4_reqsk_send_ack,
++ .destructor = mptcp_v4_reqsk_destructor,
++ .send_reset = tcp_v4_send_reset,
++ .syn_ack_timeout = tcp_syn_ack_timeout,
++};
++
++static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
++ struct request_sock *req,
++ const unsigned long timeout)
++{
++ const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ 0, MPTCP_HASH_SIZE);
++ /* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
++ * want to reset the keepalive-timer (responsible for retransmitting
++ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++ * overload the keepalive timer. Also, it's not a big deal, because the
++ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++ * if the third ACK gets lost, the client will handle the retransmission
++ * anyways. If our SYN/ACK gets lost, the client will retransmit the
++ * SYN.
++ */
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++ const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ lopt->hash_rnd, lopt->nr_table_entries);
++
++ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++ mptcp_reset_synack_timer(meta_sk, timeout);
++
++ rcu_read_lock();
++ spin_lock(&mptcp_reqsk_hlock);
++ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++ spin_unlock(&mptcp_reqsk_hlock);
++ rcu_read_unlock();
++}
++
++/* Similar to tcp_v4_conn_request */
++static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return tcp_conn_request(&mptcp_request_sock_ops,
++ &mptcp_join_request_sock_ipv4_ops,
++ meta_sk, skb);
++}
++
++/* We only process join requests here. (either the SYN or the final ACK) */
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *child, *rsk = NULL;
++ int ret;
++
++ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++ struct tcphdr *th = tcp_hdr(skb);
++ const struct iphdr *iph = ip_hdr(skb);
++ struct sock *sk;
++
++ sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
++ iph->saddr, th->source, iph->daddr,
++ th->dest, inet_iif(skb));
++
++ if (!sk) {
++ kfree_skb(skb);
++ return 0;
++ }
++ if (is_meta_sk(sk)) {
++ WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
++ kfree_skb(skb);
++ sock_put(sk);
++ return 0;
++ }
++
++ if (sk->sk_state == TCP_TIME_WAIT) {
++ inet_twsk_put(inet_twsk(sk));
++ kfree_skb(skb);
++ return 0;
++ }
++
++ ret = tcp_v4_do_rcv(sk, skb);
++ sock_put(sk);
++
++ return ret;
++ }
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++ /* Has been removed from the tk-table. Thus, no new subflows.
++ *
++ * Check for close-state is necessary, because we may have been closed
++ * without passing by mptcp_close().
++ *
++ * When falling back, no new subflows are allowed either.
++ */
++ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++ goto reset_and_discard;
++
++ child = tcp_v4_hnd_req(meta_sk, skb);
++
++ if (!child)
++ goto discard;
++
++ if (child != meta_sk) {
++ sock_rps_save_rxhash(child, skb);
++ /* We don't call tcp_child_process here, because we hold
++ * already the meta-sk-lock and are sure that it is not owned
++ * by the user.
++ */
++ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++ bh_unlock_sock(child);
++ sock_put(child);
++ if (ret) {
++ rsk = child;
++ goto reset_and_discard;
++ }
++ } else {
++ if (tcp_hdr(skb)->syn) {
++ mptcp_v4_join_request(meta_sk, skb);
++ goto discard;
++ }
++ goto reset_and_discard;
++ }
++ return 0;
++
++reset_and_discard:
++ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++ const struct tcphdr *th = tcp_hdr(skb);
++ const struct iphdr *iph = ip_hdr(skb);
++ struct request_sock **prev, *req;
++ /* If we end up here, it means we should not have matched on the
++ * request-socket. But, because the request-sock queue is only
++ * destroyed in mptcp_close, the socket may actually already be
++ * in close-state (e.g., through shutdown()) while still having
++ * pending request sockets.
++ */
++ req = inet_csk_search_req(meta_sk, &prev, th->source,
++ iph->saddr, iph->daddr);
++ if (req) {
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++ req);
++ reqsk_free(req);
++ }
++ }
++
++ tcp_v4_send_reset(rsk, skb);
++discard:
++ kfree_skb(skb);
++ return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++ const __be32 laddr, const struct net *net)
++{
++ const struct mptcp_request_sock *mtreq;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++ const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++ hash_entry) {
++ struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
++ meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++ if (ireq->ir_rmt_port == rport &&
++ ireq->ir_rmt_addr == raddr &&
++ ireq->ir_loc_addr == laddr &&
++ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
++ net_eq(net, sock_net(meta_sk)))
++ goto found;
++ meta_sk = NULL;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++ goto begin;
++
++found:
++ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ meta_sk = NULL;
++ rcu_read_unlock();
++
++ return meta_sk;
++}
++
++/* Create a new IPv4 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++ struct mptcp_rem4 *rem)
++{
++ struct tcp_sock *tp;
++ struct sock *sk;
++ struct sockaddr_in loc_in, rem_in;
++ struct socket sock;
++ int ret;
++
++ /** First, create and prepare the new socket */
++
++ sock.type = meta_sk->sk_socket->type;
++ sock.state = SS_UNCONNECTED;
++ sock.wq = meta_sk->sk_socket->wq;
++ sock.file = meta_sk->sk_socket->file;
++ sock.ops = NULL;
++
++ ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++ if (unlikely(ret < 0)) {
++ mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
++ return ret;
++ }
++
++ sk = sock.sk;
++ tp = tcp_sk(sk);
++
++ /* All subsockets need the MPTCP-lock-class */
++ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++ if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
++ goto error;
++
++ tp->mptcp->slave_sk = 1;
++ tp->mptcp->low_prio = loc->low_prio;
++
++ /* Initializing the timer for an MPTCP subflow */
++ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++ /** Then, connect the socket to the peer */
++ loc_in.sin_family = AF_INET;
++ rem_in.sin_family = AF_INET;
++ loc_in.sin_port = 0;
++ if (rem->port)
++ rem_in.sin_port = rem->port;
++ else
++ rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
++ loc_in.sin_addr = loc->addr;
++ rem_in.sin_addr = rem->addr;
++
++ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
++ if (ret < 0) {
++ mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
++ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &loc_in.sin_addr,
++ ntohs(loc_in.sin_port), &rem_in.sin_addr,
++ ntohs(rem_in.sin_port));
++
++ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
++ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
++
++ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++ sizeof(struct sockaddr_in), O_NONBLOCK);
++ if (ret < 0 && ret != -EINPROGRESS) {
++ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ sk_set_socket(sk, meta_sk->sk_socket);
++ sk->sk_wq = meta_sk->sk_wq;
++
++ return 0;
++
++error:
++ /* May happen if mptcp_add_sock fails first */
++ if (!mptcp(tp)) {
++ tcp_close(sk, 0);
++ } else {
++ local_bh_disable();
++ mptcp_sub_force_close(sk);
++ local_bh_enable();
++ }
++ return ret;
++}
++EXPORT_SYMBOL(mptcp_init4_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v4_specific = {
++ .queue_xmit = ip_queue_xmit,
++ .send_check = tcp_v4_send_check,
++ .rebuild_header = inet_sk_rebuild_header,
++ .sk_rx_dst_set = inet_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v4_syn_recv_sock,
++ .net_header_len = sizeof(struct iphdr),
++ .setsockopt = ip_setsockopt,
++ .getsockopt = ip_getsockopt,
++ .addr2sockaddr = inet_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in),
++ .bind_conflict = inet_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ip_setsockopt,
++ .compat_getsockopt = compat_ip_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++/* General initialization of IPv4 for MPTCP */
++int mptcp_pm_v4_init(void)
++{
++ int ret = 0;
++ struct request_sock_ops *ops = &mptcp_request_sock_ops;
++
++ mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++ mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
++
++ mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++ mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
++ mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
++
++ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
++ if (ops->slab_name == NULL) {
++ ret = -ENOMEM;
++ goto out;
++ }
++
++ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++
++ if (ops->slab == NULL) {
++ ret = -ENOMEM;
++ goto err_reqsk_create;
++ }
++
++out:
++ return ret;
++
++err_reqsk_create:
++ kfree(ops->slab_name);
++ ops->slab_name = NULL;
++ goto out;
++}
++
++void mptcp_pm_v4_undo(void)
++{
++ kmem_cache_destroy(mptcp_request_sock_ops.slab);
++ kfree(mptcp_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
+new file mode 100644
+index 000000000000..1036973aa855
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv6.c
+@@ -0,0 +1,518 @@
++/*
++ * MPTCP implementation - IPv6-specific functions
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/in6.h>
++#include <linux/kernel.h>
++
++#include <net/addrconf.h>
++#include <net/flow.h>
++#include <net/inet6_connection_sock.h>
++#include <net/inet6_hashtables.h>
++#include <net/inet_common.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/ip6_route.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
++#include <net/tcp.h>
++#include <net/transp_v6.h>
++
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport)
++{
++ u32 secret[MD5_MESSAGE_BYTES / 4];
++ u32 hash[MD5_DIGEST_WORDS];
++ u32 i;
++
++ memcpy(hash, saddr, 16);
++ for (i = 0; i < 4; i++)
++ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++ secret[4] = mptcp_secret[4] +
++ (((__force u16)sport << 16) + (__force u16)dport);
++ secret[5] = mptcp_seed++;
++ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++ secret[i] = mptcp_secret[i];
++
++ md5_transform(hash, secret);
++
++ return hash[0];
++}
++
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport)
++{
++ u32 secret[MD5_MESSAGE_BYTES / 4];
++ u32 hash[MD5_DIGEST_WORDS];
++ u32 i;
++
++ memcpy(hash, saddr, 16);
++ for (i = 0; i < 4; i++)
++ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++ secret[4] = mptcp_secret[4] +
++ (((__force u16)sport << 16) + (__force u16)dport);
++ secret[5] = mptcp_seed++;
++ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++ secret[i] = mptcp_secret[i];
++
++ md5_transform(hash, secret);
++
++ return *((u64 *)hash);
++}
++
++static void mptcp_v6_reqsk_destructor(struct request_sock *req)
++{
++ mptcp_reqsk_destructor(req);
++
++ tcp_v6_reqsk_destructor(req);
++}
++
++static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++ mptcp_reqsk_init(req, skb);
++
++ return 0;
++}
++
++static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ union inet_addr addr;
++ int loc_id;
++ bool low_prio = false;
++
++ /* We need to do this as early as possible. Because, if we fail later
++ * (e.g., get_local_id), then reqsk_free tries to remove the
++ * request-socket from the htb in mptcp_hash_request_remove as pprev
++ * may be different from NULL.
++ */
++ mtreq->hash_entry.pprev = NULL;
++
++ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++
++ mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
++ ipv6_hdr(skb)->daddr.s6_addr32,
++ tcp_hdr(skb)->source,
++ tcp_hdr(skb)->dest);
++ addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
++ loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
++ if (loc_id == -1)
++ return -1;
++ mtreq->loc_id = loc_id;
++ mtreq->low_prio = low_prio;
++
++ mptcp_join_reqsk_init(mpcb, req, skb);
++
++ return 0;
++}
++
++/* Similar to tcp6_request_sock_ops */
++struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
++ .family = AF_INET6,
++ .obj_size = sizeof(struct mptcp_request_sock),
++ .rtx_syn_ack = tcp_v6_rtx_synack,
++ .send_ack = tcp_v6_reqsk_send_ack,
++ .destructor = mptcp_v6_reqsk_destructor,
++ .send_reset = tcp_v6_send_reset,
++ .syn_ack_timeout = tcp_syn_ack_timeout,
++};
++
++static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
++ struct request_sock *req,
++ const unsigned long timeout)
++{
++ const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ 0, MPTCP_HASH_SIZE);
++ /* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
++ * want to reset the keepalive-timer (responsible for retransmitting
++ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++ * overload the keepalive timer. Also, it's not a big deal, because the
++ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++ * if the third ACK gets lost, the client will handle the retransmission
++ * anyways. If our SYN/ACK gets lost, the client will retransmit the
++ * SYN.
++ */
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++ const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ lopt->hash_rnd, lopt->nr_table_entries);
++
++ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++ mptcp_reset_synack_timer(meta_sk, timeout);
++
++ rcu_read_lock();
++ spin_lock(&mptcp_reqsk_hlock);
++ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++ spin_unlock(&mptcp_reqsk_hlock);
++ rcu_read_unlock();
++}
++
++static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return tcp_conn_request(&mptcp6_request_sock_ops,
++ &mptcp_join_request_sock_ipv6_ops,
++ meta_sk, skb);
++}
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *child, *rsk = NULL;
++ int ret;
++
++ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++ struct tcphdr *th = tcp_hdr(skb);
++ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
++ struct sock *sk;
++
++ sk = __inet6_lookup_established(sock_net(meta_sk),
++ &tcp_hashinfo,
++ &ip6h->saddr, th->source,
++ &ip6h->daddr, ntohs(th->dest),
++ inet6_iif(skb));
++
++ if (!sk) {
++ kfree_skb(skb);
++ return 0;
++ }
++ if (is_meta_sk(sk)) {
++ WARN("%s Did not find a sub-sk!\n", __func__);
++ kfree_skb(skb);
++ sock_put(sk);
++ return 0;
++ }
++
++ if (sk->sk_state == TCP_TIME_WAIT) {
++ inet_twsk_put(inet_twsk(sk));
++ kfree_skb(skb);
++ return 0;
++ }
++
++ ret = tcp_v6_do_rcv(sk, skb);
++ sock_put(sk);
++
++ return ret;
++ }
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++ /* Has been removed from the tk-table. Thus, no new subflows.
++ *
++ * Check for close-state is necessary, because we may have been closed
++ * without passing by mptcp_close().
++ *
++ * When falling back, no new subflows are allowed either.
++ */
++ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++ goto reset_and_discard;
++
++ child = tcp_v6_hnd_req(meta_sk, skb);
++
++ if (!child)
++ goto discard;
++
++ if (child != meta_sk) {
++ sock_rps_save_rxhash(child, skb);
++ /* We don't call tcp_child_process here, because we hold
++ * already the meta-sk-lock and are sure that it is not owned
++ * by the user.
++ */
++ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++ bh_unlock_sock(child);
++ sock_put(child);
++ if (ret) {
++ rsk = child;
++ goto reset_and_discard;
++ }
++ } else {
++ if (tcp_hdr(skb)->syn) {
++ mptcp_v6_join_request(meta_sk, skb);
++ goto discard;
++ }
++ goto reset_and_discard;
++ }
++ return 0;
++
++reset_and_discard:
++ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++ const struct tcphdr *th = tcp_hdr(skb);
++ struct request_sock **prev, *req;
++ /* If we end up here, it means we should not have matched on the
++ * request-socket. But, because the request-sock queue is only
++ * destroyed in mptcp_close, the socket may actually already be
++ * in close-state (e.g., through shutdown()) while still having
++ * pending request sockets.
++ */
++ req = inet6_csk_search_req(meta_sk, &prev, th->source,
++ &ipv6_hdr(skb)->saddr,
++ &ipv6_hdr(skb)->daddr, inet6_iif(skb));
++ if (req) {
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++ req);
++ reqsk_free(req);
++ }
++ }
++
++ tcp_v6_send_reset(rsk, skb);
++discard:
++ kfree_skb(skb);
++ return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++ const struct in6_addr *laddr, const struct net *net)
++{
++ const struct mptcp_request_sock *mtreq;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++ const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++ hash_entry) {
++ struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
++ meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++ if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
++ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
++ ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
++ ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
++ net_eq(net, sock_net(meta_sk)))
++ goto found;
++ meta_sk = NULL;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++ goto begin;
++
++found:
++ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ meta_sk = NULL;
++ rcu_read_unlock();
++
++ return meta_sk;
++}
++
++/* Create a new IPv6 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++ struct mptcp_rem6 *rem)
++{
++ struct tcp_sock *tp;
++ struct sock *sk;
++ struct sockaddr_in6 loc_in, rem_in;
++ struct socket sock;
++ int ret;
++
++ /** First, create and prepare the new socket */
++
++ sock.type = meta_sk->sk_socket->type;
++ sock.state = SS_UNCONNECTED;
++ sock.wq = meta_sk->sk_socket->wq;
++ sock.file = meta_sk->sk_socket->file;
++ sock.ops = NULL;
++
++ ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++ if (unlikely(ret < 0)) {
++ mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
++ return ret;
++ }
++
++ sk = sock.sk;
++ tp = tcp_sk(sk);
++
++ /* All subsockets need the MPTCP-lock-class */
++ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++ if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
++ goto error;
++
++ tp->mptcp->slave_sk = 1;
++ tp->mptcp->low_prio = loc->low_prio;
++
++ /* Initializing the timer for an MPTCP subflow */
++ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++ /** Then, connect the socket to the peer */
++ loc_in.sin6_family = AF_INET6;
++ rem_in.sin6_family = AF_INET6;
++ loc_in.sin6_port = 0;
++ if (rem->port)
++ rem_in.sin6_port = rem->port;
++ else
++ rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
++ loc_in.sin6_addr = loc->addr;
++ rem_in.sin6_addr = rem->addr;
++
++ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
++ if (ret < 0) {
++ mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
++ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &loc_in.sin6_addr,
++ ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
++ ntohs(rem_in.sin6_port));
++
++ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
++ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
++
++ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++ sizeof(struct sockaddr_in6), O_NONBLOCK);
++ if (ret < 0 && ret != -EINPROGRESS) {
++ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ sk_set_socket(sk, meta_sk->sk_socket);
++ sk->sk_wq = meta_sk->sk_wq;
++
++ return 0;
++
++error:
++ /* May happen if mptcp_add_sock fails first */
++ if (!mptcp(tp)) {
++ tcp_close(sk, 0);
++ } else {
++ local_bh_disable();
++ mptcp_sub_force_close(sk);
++ local_bh_enable();
++ }
++ return ret;
++}
++EXPORT_SYMBOL(mptcp_init6_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v6_specific = {
++ .queue_xmit = inet6_csk_xmit,
++ .send_check = tcp_v6_send_check,
++ .rebuild_header = inet6_sk_rebuild_header,
++ .sk_rx_dst_set = inet6_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v6_syn_recv_sock,
++ .net_header_len = sizeof(struct ipv6hdr),
++ .net_frag_header_len = sizeof(struct frag_hdr),
++ .setsockopt = ipv6_setsockopt,
++ .getsockopt = ipv6_getsockopt,
++ .addr2sockaddr = inet6_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in6),
++ .bind_conflict = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ipv6_setsockopt,
++ .compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
++ .queue_xmit = ip_queue_xmit,
++ .send_check = tcp_v4_send_check,
++ .rebuild_header = inet_sk_rebuild_header,
++ .sk_rx_dst_set = inet_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v6_syn_recv_sock,
++ .net_header_len = sizeof(struct iphdr),
++ .setsockopt = ipv6_setsockopt,
++ .getsockopt = ipv6_getsockopt,
++ .addr2sockaddr = inet6_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in6),
++ .bind_conflict = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ipv6_setsockopt,
++ .compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_pm_v6_init(void)
++{
++ int ret = 0;
++ struct request_sock_ops *ops = &mptcp6_request_sock_ops;
++
++ mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++ mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
++
++ mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++ mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
++ mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
++
++ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
++ if (ops->slab_name == NULL) {
++ ret = -ENOMEM;
++ goto out;
++ }
++
++ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++
++ if (ops->slab == NULL) {
++ ret = -ENOMEM;
++ goto err_reqsk_create;
++ }
++
++out:
++ return ret;
++
++err_reqsk_create:
++ kfree(ops->slab_name);
++ ops->slab_name = NULL;
++ goto out;
++}
++
++void mptcp_pm_v6_undo(void)
++{
++ kmem_cache_destroy(mptcp6_request_sock_ops.slab);
++ kfree(mptcp6_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
+new file mode 100644
+index 000000000000..6f5087983175
+--- /dev/null
++++ b/net/mptcp/mptcp_ndiffports.c
+@@ -0,0 +1,161 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++
++struct ndiffports_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++
++ struct mptcp_cb *mpcb;
++};
++
++static int num_subflows __read_mostly = 2;
++module_param(num_subflows, int, 0644);
++MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ const struct ndiffports_priv *pm_priv = container_of(work,
++ struct ndiffports_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = pm_priv->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ int iter = 0;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
++ if (meta_sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(meta_sk)) {
++ struct mptcp_loc4 loc;
++ struct mptcp_rem4 rem;
++
++ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++ loc.loc4_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem4_id = 0; /* Default 0 */
++
++ mptcp_init4_subsockets(meta_sk, &loc, &rem);
++ } else {
++#if IS_ENABLED(CONFIG_IPV6)
++ struct mptcp_loc6 loc;
++ struct mptcp_rem6 rem;
++
++ loc.addr = inet6_sk(meta_sk)->saddr;
++ loc.loc6_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr = meta_sk->sk_v6_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem6_id = 0; /* Default 0 */
++
++ mptcp_init6_subsockets(meta_sk, &loc, &rem);
++#endif
++ }
++ goto next_subflow;
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void ndiffports_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ fmp->mpcb = mpcb;
++}
++
++static void ndiffports_create_subflows(struct sock *meta_sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (!work_pending(&pm_priv->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &pm_priv->subflow_work);
++ }
++}
++
++static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++static struct mptcp_pm_ops ndiffports __read_mostly = {
++ .new_session = ndiffports_new_session,
++ .fully_established = ndiffports_create_subflows,
++ .get_local_id = ndiffports_get_local_id,
++ .name = "ndiffports",
++ .owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init ndiffports_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
++
++ if (mptcp_register_path_manager(&ndiffports))
++ goto exit;
++
++ return 0;
++
++exit:
++ return -1;
++}
++
++static void ndiffports_unregister(void)
++{
++ mptcp_unregister_path_manager(&ndiffports);
++}
++
++module_init(ndiffports_register);
++module_exit(ndiffports_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
+new file mode 100644
+index 000000000000..ec4e98622637
+--- /dev/null
++++ b/net/mptcp/mptcp_ofo_queue.c
+@@ -0,0 +1,295 @@
++/*
++ * MPTCP implementation - Fast algorithm for MPTCP meta-reordering
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <linux/slab.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb)
++{
++ struct tcp_sock *tp;
++
++ mptcp_for_each_tp(mpcb, tp) {
++ if (tp->mptcp->shortcut_ofoqueue == skb) {
++ tp->mptcp->shortcut_ofoqueue = NULL;
++ return;
++ }
++ }
++}
++
++/* Does 'skb' fits after 'here' in the queue 'head' ?
++ * If yes, we queue it and return 1
++ */
++static int mptcp_ofo_queue_after(struct sk_buff_head *head,
++ struct sk_buff *skb, struct sk_buff *here,
++ const struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++ /* We want to queue skb after here, thus seq >= end_seq */
++ if (before(seq, TCP_SKB_CB(here)->end_seq))
++ return 0;
++
++ if (seq == TCP_SKB_CB(here)->end_seq) {
++ bool fragstolen = false;
++
++ if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
++ __skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
++ return 1;
++ } else {
++ kfree_skb_partial(skb, fragstolen);
++ return -1;
++ }
++ }
++
++ /* If here is the last one, we can always queue it */
++ if (skb_queue_is_last(head, here)) {
++ __skb_queue_after(head, here, skb);
++ return 1;
++ } else {
++ struct sk_buff *skb1 = skb_queue_next(head, here);
++ /* It's not the last one, but does it fits between 'here' and
++ * the one after 'here' ? Thus, does end_seq <= after_here->seq
++ */
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
++ __skb_queue_after(head, here, skb);
++ return 1;
++ }
++ }
++
++ return 0;
++}
++
++static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
++ struct sk_buff_head *head, struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk;
++ struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb1, *best_shortcut = NULL;
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++ u32 distance = 0xffffffff;
++
++ /* First, check the tp's shortcut */
++ if (!shortcut) {
++ if (skb_queue_empty(head)) {
++ __skb_queue_head(head, skb);
++ goto end;
++ }
++ } else {
++ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++ /* Does the tp's shortcut is a hit? If yes, we insert. */
++
++ if (ret) {
++ skb = (ret > 0) ? skb : NULL;
++ goto end;
++ }
++ }
++
++ /* Check the shortcuts of the other subsockets. */
++ mptcp_for_each_tp(mpcb, tp_it) {
++ shortcut = tp_it->mptcp->shortcut_ofoqueue;
++ /* Can we queue it here? If yes, do so! */
++ if (shortcut) {
++ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++
++ if (ret) {
++ skb = (ret > 0) ? skb : NULL;
++ goto end;
++ }
++ }
++
++ /* Could not queue it, check if we are close.
++ * We are looking for a shortcut, close enough to seq to
++ * set skb1 prematurely and thus improve the subsequent lookup,
++ * which tries to find a skb1 so that skb1->seq <= seq.
++ *
++ * So, here we only take shortcuts, whose shortcut->seq > seq,
++ * and minimize the distance between shortcut->seq and seq and
++ * set best_shortcut to this one with the minimal distance.
++ *
++ * That way, the subsequent while-loop is shortest.
++ */
++ if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
++ /* Are we closer than the current best shortcut? */
++ if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
++ distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
++ best_shortcut = shortcut;
++ }
++ }
++ }
++
++ if (best_shortcut)
++ skb1 = best_shortcut;
++ else
++ skb1 = skb_peek_tail(head);
++
++ if (seq == TCP_SKB_CB(skb1)->end_seq) {
++ bool fragstolen = false;
++
++ if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
++ __skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
++ } else {
++ kfree_skb_partial(skb, fragstolen);
++ skb = NULL;
++ }
++
++ goto end;
++ }
++
++ /* Find the insertion point, starting from best_shortcut if available.
++ *
++ * Inspired from tcp_data_queue_ofo.
++ */
++ while (1) {
++ /* skb1->seq <= seq */
++ if (!after(TCP_SKB_CB(skb1)->seq, seq))
++ break;
++ if (skb_queue_is_first(head, skb1)) {
++ skb1 = NULL;
++ break;
++ }
++ skb1 = skb_queue_prev(head, skb1);
++ }
++
++ /* Do skb overlap to previous one? */
++ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* All the bits are present. */
++ __kfree_skb(skb);
++ skb = NULL;
++ goto end;
++ }
++ if (seq == TCP_SKB_CB(skb1)->seq) {
++ if (skb_queue_is_first(head, skb1))
++ skb1 = NULL;
++ else
++ skb1 = skb_queue_prev(head, skb1);
++ }
++ }
++ if (!skb1)
++ __skb_queue_head(head, skb);
++ else
++ __skb_queue_after(head, skb1, skb);
++
++ /* And clean segments covered by new one as whole. */
++ while (!skb_queue_is_last(head, skb)) {
++ skb1 = skb_queue_next(head, skb);
++
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++ break;
++
++ __skb_unlink(skb1, head);
++ mptcp_remove_shortcuts(mpcb, skb1);
++ __kfree_skb(skb1);
++ }
++
++end:
++ if (skb) {
++ skb_set_owner_r(skb, meta_sk);
++ tp->mptcp->shortcut_ofoqueue = skb;
++ }
++
++ return;
++}
++
++/**
++ * @sk: the subflow that received this skb.
++ */
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++ struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
++ &tcp_sk(meta_sk)->out_of_order_queue, tp);
++}
++
++bool mptcp_prune_ofo_queue(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ bool res = false;
++
++ if (!skb_queue_empty(&tp->out_of_order_queue)) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
++ mptcp_purge_ofo_queue(tp);
++
++ /* No sack at the mptcp-level */
++ sk_mem_reclaim(sk);
++ res = true;
++ }
++
++ return res;
++}
++
++void mptcp_ofo_queue(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++
++ while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
++ u32 old_rcv_nxt = meta_tp->rcv_nxt;
++ if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
++ break;
++
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
++ __skb_unlink(skb, &meta_tp->out_of_order_queue);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++ __kfree_skb(skb);
++ continue;
++ }
++
++ __skb_unlink(skb, &meta_tp->out_of_order_queue);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++
++ __skb_queue_tail(&meta_sk->sk_receive_queue, skb);
++ meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++ if (tcp_hdr(skb)->fin)
++ mptcp_fin(meta_sk);
++ }
++}
++
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
++{
++ struct sk_buff_head *head = &meta_tp->out_of_order_queue;
++ struct sk_buff *skb, *tmp;
++
++ skb_queue_walk_safe(head, skb, tmp) {
++ __skb_unlink(skb, head);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++ kfree_skb(skb);
++ }
++}
+diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
+new file mode 100644
+index 000000000000..53f5c43bb488
+--- /dev/null
++++ b/net/mptcp/mptcp_olia.c
+@@ -0,0 +1,311 @@
++/*
++ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
++ *
++ * Algorithm design:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ * Nicolas Gast <nicolas.gast@epfl.ch>
++ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
++ *
++ * Implementation:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++static int scale = 10;
++
++struct mptcp_olia {
++ u32 mptcp_loss1;
++ u32 mptcp_loss2;
++ u32 mptcp_loss3;
++ int epsilon_num;
++ u32 epsilon_den;
++ int mptcp_snd_cwnd_cnt;
++};
++
++static inline int mptcp_olia_sk_can_send(const struct sock *sk)
++{
++ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_olia_scale(u64 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++/* take care of artificially inflate (see RFC5681)
++ * of cwnd during fast-retransmit phase
++ */
++static u32 mptcp_get_crt_cwnd(struct sock *sk)
++{
++ const struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (icsk->icsk_ca_state == TCP_CA_Recovery)
++ return tcp_sk(sk)->snd_ssthresh;
++ else
++ return tcp_sk(sk)->snd_cwnd;
++}
++
++/* return the dominator of the first term of the increasing term */
++static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
++{
++ struct sock *sk;
++ u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ u64 scaled_num;
++ u32 tmp_cwnd;
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
++ rate += div_u64(scaled_num , tp->srtt_us);
++ }
++ rate *= rate;
++ return rate;
++}
++
++/* find the maximum cwnd, used to find set M */
++static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
++{
++ struct sock *sk;
++ u32 best_cwnd = 0;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ u32 tmp_cwnd;
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if (tmp_cwnd > best_cwnd)
++ best_cwnd = tmp_cwnd;
++ }
++ return best_cwnd;
++}
++
++static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
++{
++ struct mptcp_olia *ca;
++ struct tcp_sock *tp;
++ struct sock *sk;
++ u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
++ u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
++ u8 M = 0, B_not_M = 0;
++
++ /* TODO - integrate this in the following loop - we just want to iterate once */
++
++ max_cwnd = mptcp_get_max_cwnd(mpcb);
++
++ /* find the best path */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ /* TODO - check here and rename variables */
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
++ best_rtt = tmp_rtt;
++ best_int = tmp_int;
++ best_cwnd = tmp_cwnd;
++ }
++ }
++
++ /* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
++ /* find the size of M and B_not_M */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if (tmp_cwnd == max_cwnd) {
++ M++;
++ } else {
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++
++ if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
++ B_not_M++;
++ }
++ }
++
++ /* check if the path is in M or B_not_M and set the value of epsilon accordingly */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ if (B_not_M == 0) {
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ } else {
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++
++ if (tmp_cwnd < max_cwnd &&
++ (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
++ ca->epsilon_num = 1;
++ ca->epsilon_den = mpcb->cnt_established * B_not_M;
++ } else if (tmp_cwnd == max_cwnd) {
++ ca->epsilon_num = -1;
++ ca->epsilon_den = mpcb->cnt_established * M;
++ } else {
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ }
++ }
++ }
++}
++
++/* setting the initial values */
++static void mptcp_olia_init(struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++
++ if (mptcp(tp)) {
++ ca->mptcp_loss1 = tp->snd_una;
++ ca->mptcp_loss2 = tp->snd_una;
++ ca->mptcp_loss3 = tp->snd_una;
++ ca->mptcp_snd_cwnd_cnt = 0;
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ }
++}
++
++/* updating inter-loss distance and ssthresh */
++static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
++{
++ if (!mptcp(tcp_sk(sk)))
++ return;
++
++ if (new_state == TCP_CA_Loss ||
++ new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++
++ if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
++ !inet_csk(sk)->icsk_retransmits) {
++ ca->mptcp_loss1 = ca->mptcp_loss2;
++ ca->mptcp_loss2 = ca->mptcp_loss3;
++ }
++ }
++}
++
++/* main algorithm */
++static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++
++ u64 inc_num, inc_den, rate, cwnd_scaled;
++
++ if (!mptcp(tp)) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ ca->mptcp_loss3 = tp->snd_una;
++
++ if (!tcp_is_cwnd_limited(sk))
++ return;
++
++ /* slow start if it is in the safe area */
++ if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ tcp_slow_start(tp, acked);
++ return;
++ }
++
++ mptcp_get_epsilon(mpcb);
++ rate = mptcp_get_rate(mpcb, tp->srtt_us);
++ cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
++ inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
++
++ /* calculate the increasing term, scaling is used to reduce the rounding effect */
++ if (ca->epsilon_num == -1) {
++ if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
++ inc_num = rate - ca->epsilon_den *
++ cwnd_scaled * cwnd_scaled;
++ ca->mptcp_snd_cwnd_cnt -= div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ } else {
++ inc_num = ca->epsilon_den *
++ cwnd_scaled * cwnd_scaled - rate;
++ ca->mptcp_snd_cwnd_cnt += div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ }
++ } else {
++ inc_num = ca->epsilon_num * rate +
++ ca->epsilon_den * cwnd_scaled * cwnd_scaled;
++ ca->mptcp_snd_cwnd_cnt += div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ }
++
++
++ if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
++ if (tp->snd_cwnd < tp->snd_cwnd_clamp)
++ tp->snd_cwnd++;
++ ca->mptcp_snd_cwnd_cnt = 0;
++ } else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
++ tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
++ ca->mptcp_snd_cwnd_cnt = 0;
++ }
++}
++
++static struct tcp_congestion_ops mptcp_olia = {
++ .init = mptcp_olia_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_olia_cong_avoid,
++ .set_state = mptcp_olia_set_state,
++ .owner = THIS_MODULE,
++ .name = "olia",
++};
++
++static int __init mptcp_olia_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
++ return tcp_register_congestion_control(&mptcp_olia);
++}
++
++static void __exit mptcp_olia_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_olia);
++}
++
++module_init(mptcp_olia_register);
++module_exit(mptcp_olia_unregister);
++
++MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
+new file mode 100644
+index 000000000000..400ea254c078
+--- /dev/null
++++ b/net/mptcp/mptcp_output.c
+@@ -0,0 +1,1743 @@
++/*
++ * MPTCP implementation - Sending side
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/kconfig.h>
++#include <linux/skbuff.h>
++#include <linux/tcp.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++#include <net/sock.h>
++
++static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
++ MPTCP_SUB_LEN_ACK_ALIGN +
++ MPTCP_SUB_LEN_SEQ_ALIGN;
++
++static inline int mptcp_sub_len_remove_addr(u16 bitfield)
++{
++ unsigned int c;
++ for (c = 0; bitfield; c++)
++ bitfield &= bitfield - 1;
++ return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
++}
++
++int mptcp_sub_len_remove_addr_align(u16 bitfield)
++{
++ return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
++}
++EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
++
++/* get the data-seq and end-data-seq and store them again in the
++ * tcp_skb_cb
++ */
++static int mptcp_reconstruct_mapping(struct sk_buff *skb)
++{
++ const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
++ u32 *p32;
++ u16 *p16;
++
++ if (!mpdss->M)
++ return 1;
++
++ /* Move the pointer to the data-seq */
++ p32 = (u32 *)mpdss;
++ p32++;
++ if (mpdss->A) {
++ p32++;
++ if (mpdss->a)
++ p32++;
++ }
++
++ TCP_SKB_CB(skb)->seq = ntohl(*p32);
++
++ /* Get the data_len to calculate the end_data_seq */
++ p32++;
++ p32++;
++ p16 = (u16 *)p32;
++ TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
++
++ return 0;
++}
++
++static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
++{
++ struct sk_buff *skb_it;
++
++ skb_it = tcp_write_queue_head(meta_sk);
++
++ tcp_for_write_queue_from(skb_it, meta_sk) {
++ if (skb_it == tcp_send_head(meta_sk))
++ break;
++
++ if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
++ TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
++ break;
++ }
++ }
++}
++
++/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
++ * coming from the meta-retransmit-timer
++ */
++static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
++ struct sock *sk, int clone_it)
++{
++ struct sk_buff *skb, *skb1;
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ u32 seq, end_seq;
++
++ if (clone_it) {
++ /* pskb_copy is necessary here, because the TCP/IP-headers
++ * will be changed when it's going to be reinjected on another
++ * subflow.
++ */
++ skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
++ } else {
++ __skb_unlink(orig_skb, &sk->sk_write_queue);
++ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++ sk->sk_wmem_queued -= orig_skb->truesize;
++ sk_mem_uncharge(sk, orig_skb->truesize);
++ skb = orig_skb;
++ }
++ if (unlikely(!skb))
++ return;
++
++ if (sk && mptcp_reconstruct_mapping(skb)) {
++ __kfree_skb(skb);
++ return;
++ }
++
++ skb->sk = meta_sk;
++
++ /* If it reached already the destination, we don't have to reinject it */
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++ __kfree_skb(skb);
++ return;
++ }
++
++ /* Only reinject segments that are fully covered by the mapping */
++ if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
++ TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++ __kfree_skb(skb);
++
++ /* Ok, now we have to look for the full mapping in the meta
++ * send-queue :S
++ */
++ tcp_for_write_queue(skb, meta_sk) {
++ /* Not yet at the mapping? */
++ if (before(TCP_SKB_CB(skb)->seq, seq))
++ continue;
++ /* We have passed by the mapping */
++ if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
++ return;
++
++ __mptcp_reinject_data(skb, meta_sk, NULL, 1);
++ }
++ return;
++ }
++
++ /* Segment goes back to the MPTCP-layer. So, we need to zero the
++ * path_mask/dss.
++ */
++ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++ /* We need to find out the path-mask from the meta-write-queue
++ * to properly select a subflow.
++ */
++ mptcp_find_and_set_pathmask(meta_sk, skb);
++
++ /* If it's empty, just add */
++ if (skb_queue_empty(&mpcb->reinject_queue)) {
++ skb_queue_head(&mpcb->reinject_queue, skb);
++ return;
++ }
++
++ /* Find place to insert skb - or even we can 'drop' it, as the
++ * data is already covered by other skb's in the reinject-queue.
++ *
++ * This is inspired by code from tcp_data_queue.
++ */
++
++ skb1 = skb_peek_tail(&mpcb->reinject_queue);
++ seq = TCP_SKB_CB(skb)->seq;
++ while (1) {
++ if (!after(TCP_SKB_CB(skb1)->seq, seq))
++ break;
++ if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
++ skb1 = NULL;
++ break;
++ }
++ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++ }
++
++ /* Do skb overlap to previous one? */
++ end_seq = TCP_SKB_CB(skb)->end_seq;
++ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* All the bits are present. Don't reinject */
++ __kfree_skb(skb);
++ return;
++ }
++ if (seq == TCP_SKB_CB(skb1)->seq) {
++ if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
++ skb1 = NULL;
++ else
++ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++ }
++ }
++ if (!skb1)
++ __skb_queue_head(&mpcb->reinject_queue, skb);
++ else
++ __skb_queue_after(&mpcb->reinject_queue, skb1, skb);
++
++ /* And clean segments covered by new one as whole. */
++ while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
++ skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
++
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++ break;
++
++ __skb_unlink(skb1, &mpcb->reinject_queue);
++ __kfree_skb(skb1);
++ }
++ return;
++}
++
++/* Inserts data into the reinject queue */
++void mptcp_reinject_data(struct sock *sk, int clone_it)
++{
++ struct sk_buff *skb_it, *tmp;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = tp->meta_sk;
++
++ /* It has already been closed - there is really no point in reinjecting */
++ if (meta_sk->sk_state == TCP_CLOSE)
++ return;
++
++ skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
++ /* Subflow syn's and fin's are not reinjected.
++ *
++ * As well as empty subflow-fins with a data-fin.
++ * They are reinjected below (without the subflow-fin-flag)
++ */
++ if (tcb->tcp_flags & TCPHDR_SYN ||
++ (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
++ (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
++ continue;
++
++ __mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
++ }
++
++ skb_it = tcp_write_queue_tail(meta_sk);
++ /* If sk has sent the empty data-fin, we have to reinject it too. */
++ if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
++ TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
++ __mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
++ }
++
++ mptcp_push_pending_frames(meta_sk);
++
++ tp->pf = 1;
++}
++EXPORT_SYMBOL(mptcp_reinject_data);
++
++static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
++ struct sock *subsk)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *sk_it;
++ int all_empty = 1, all_acked;
++
++ /* In infinite mapping we always try to combine */
++ if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
++ subsk->sk_shutdown |= SEND_SHUTDOWN;
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ return;
++ }
++
++ /* Don't combine, if they didn't combine - otherwise we end up in
++ * TIME_WAIT, even if our app is smart enough to avoid it
++ */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++ if (!mpcb->dfin_combined)
++ return;
++ }
++
++ /* If no other subflow has data to send, we can combine */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (!mptcp_sk_can_send(sk_it))
++ continue;
++
++ if (!tcp_write_queue_empty(sk_it))
++ all_empty = 0;
++ }
++
++ /* If all data has been DATA_ACKed, we can combine.
++ * -1, because the data_fin consumed one byte
++ */
++ all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
++
++ if ((all_empty || all_acked) && tcp_close_state(subsk)) {
++ subsk->sk_shutdown |= SEND_SHUTDOWN;
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ }
++}
++
++static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
++ __be32 *ptr)
++{
++ const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ __be32 *start = ptr;
++ __u16 data_len;
++
++ *ptr++ = htonl(tcb->seq); /* data_seq */
++
++ /* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
++ if (mptcp_is_data_fin(skb) && skb->len == 0)
++ *ptr++ = 0; /* subseq */
++ else
++ *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
++
++ if (tcb->mptcp_flags & MPTCPHDR_INF)
++ data_len = 0;
++ else
++ data_len = tcb->end_seq - tcb->seq;
++
++ if (tp->mpcb->dss_csum && data_len) {
++ __be16 *p16 = (__be16 *)ptr;
++ __be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
++ __wsum csum;
++
++ *ptr = htonl(((data_len) << 16) |
++ (TCPOPT_EOL << 8) |
++ (TCPOPT_EOL));
++ csum = csum_partial(ptr - 2, 12, skb->csum);
++ p16++;
++ *p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
++ } else {
++ *ptr++ = htonl(((data_len) << 16) |
++ (TCPOPT_NOP << 8) |
++ (TCPOPT_NOP));
++ }
++
++ return ptr - start;
++}
++
++static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
++ __be32 *ptr)
++{
++ struct mp_dss *mdss = (struct mp_dss *)ptr;
++ __be32 *start = ptr;
++
++ mdss->kind = TCPOPT_MPTCP;
++ mdss->sub = MPTCP_SUB_DSS;
++ mdss->rsv1 = 0;
++ mdss->rsv2 = 0;
++ mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
++ mdss->m = 0;
++ mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
++ mdss->a = 0;
++ mdss->A = 1;
++ mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
++ ptr++;
++
++ *ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++ return ptr - start;
++}
++
++/* RFC6824 states that once a particular subflow mapping has been sent
++ * out it must never be changed. However, packets may be split while
++ * they are in the retransmission queue (due to SACK or ACKs) and that
++ * arguably means that we would change the mapping (e.g. it splits it,
++ * our sends out a subset of the initial mapping).
++ *
++ * Furthermore, the skb checksum is not always preserved across splits
++ * (e.g. mptcp_fragment) which would mean that we need to recompute
++ * the DSS checksum in this case.
++ *
++ * To avoid this we save the initial DSS mapping which allows us to
++ * send the same DSS mapping even for fragmented retransmits.
++ */
++static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
++{
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ __be32 *ptr = (__be32 *)tcb->dss;
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++ ptr += mptcp_write_dss_mapping(tp, skb, ptr);
++}
++
++/* Write the saved DSS mapping to the header */
++static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
++ __be32 *ptr)
++{
++ __be32 *start = ptr;
++
++ memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
++
++ /* update the data_ack */
++ start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++ /* dss is in a union with inet_skb_parm and
++ * the IP layer expects zeroed IPCB fields.
++ */
++ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++ return mptcp_dss_len/sizeof(*ptr);
++}
++
++static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct sock *meta_sk = mptcp_meta_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++ struct tcp_skb_cb *tcb;
++ struct sk_buff *subskb = NULL;
++
++ if (!reinject)
++ TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
++ MPTCPHDR_SEQ64_INDEX : 0);
++
++ subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
++ if (!subskb)
++ return false;
++
++ /* At the subflow-level we need to call again tcp_init_tso_segs. We
++ * force this, by setting gso_segs to 0. It has been set to 1 prior to
++ * the call to mptcp_skb_entail.
++ */
++ skb_shinfo(subskb)->gso_segs = 0;
++
++ TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
++
++ if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
++ skb->ip_summed == CHECKSUM_PARTIAL) {
++ subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
++ subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
++ }
++
++ tcb = TCP_SKB_CB(subskb);
++
++ if (tp->mpcb->send_infinite_mapping &&
++ !tp->mpcb->infinite_mapping_snd &&
++ !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
++ tp->mptcp->fully_established = 1;
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mptcp->infinite_cutoff_seq = tp->write_seq;
++ tcb->mptcp_flags |= MPTCPHDR_INF;
++ }
++
++ if (mptcp_is_data_fin(subskb))
++ mptcp_combine_dfin(subskb, meta_sk, sk);
++
++ mptcp_save_dss_data_seq(tp, subskb);
++
++ tcb->seq = tp->write_seq;
++ tcb->sacked = 0; /* reset the sacked field: from the point of view
++ * of this subflow, we are sending a brand new
++ * segment
++ */
++ /* Take into account seg len */
++ tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
++ tcb->end_seq = tp->write_seq;
++
++ /* If it's a non-payload DATA_FIN (also no subflow-fin), the
++ * segment is not part of the subflow but on a meta-only-level.
++ */
++ if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
++ tcp_add_write_queue_tail(sk, subskb);
++ sk->sk_wmem_queued += subskb->truesize;
++ sk_mem_charge(sk, subskb->truesize);
++ } else {
++ int err;
++
++ /* Necessary to initialize for tcp_transmit_skb. mss of 1, as
++ * skb->len = 0 will force tso_segs to 1.
++ */
++ tcp_init_tso_segs(sk, subskb, 1);
++ /* Empty data-fins are sent immediatly on the subflow */
++ TCP_SKB_CB(subskb)->when = tcp_time_stamp;
++ err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
++
++ /* It has not been queued, we can free it now. */
++ kfree_skb(subskb);
++
++ if (err)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ tp->mptcp->second_packet = 1;
++ tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
++ }
++
++ return true;
++}
++
++/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
++ * might need to undo some operations done by tcp_fragment.
++ */
++static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
++ gfp_t gfp, int reinject)
++{
++ int ret, diff, old_factor;
++ struct sk_buff *buff;
++ u8 flags;
++
++ if (skb_headlen(skb) < len)
++ diff = skb->len - len;
++ else
++ diff = skb->data_len;
++ old_factor = tcp_skb_pcount(skb);
++
++ /* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
++ * At the MPTCP-level we do not care about the absolute value. All we
++ * care about is that it is set to 1 for accurate packets_out
++ * accounting.
++ */
++ ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
++ if (ret)
++ return ret;
++
++ buff = skb->next;
++
++ flags = TCP_SKB_CB(skb)->mptcp_flags;
++ TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
++ TCP_SKB_CB(buff)->mptcp_flags = flags;
++ TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
++
++ /* If reinject == 1, the buff will be added to the reinject
++ * queue, which is currently not part of memory accounting. So
++ * undo the changes done by tcp_fragment and update the
++ * reinject queue. Also, undo changes to the packet counters.
++ */
++ if (reinject == 1) {
++ int undo = buff->truesize - diff;
++ meta_sk->sk_wmem_queued -= undo;
++ sk_mem_uncharge(meta_sk, undo);
++
++ tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
++ meta_sk->sk_write_queue.qlen--;
++
++ if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
++ undo = old_factor - tcp_skb_pcount(skb) -
++ tcp_skb_pcount(buff);
++ if (undo)
++ tcp_adjust_pcount(meta_sk, skb, -undo);
++ }
++ }
++
++ return 0;
++}
++
++/* Inspired by tcp_write_wakeup */
++int mptcp_write_wakeup(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++ struct sock *sk_it;
++ int ans = 0;
++
++ if (meta_sk->sk_state == TCP_CLOSE)
++ return -1;
++
++ skb = tcp_send_head(meta_sk);
++ if (skb &&
++ before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
++ unsigned int mss;
++ unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
++ struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
++ struct tcp_sock *subtp;
++ if (!subsk)
++ goto window_probe;
++ subtp = tcp_sk(subsk);
++ mss = tcp_current_mss(subsk);
++
++ seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
++ tcp_wnd_end(subtp) - subtp->write_seq);
++
++ if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
++ meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
++
++ /* We are probing the opening of a window
++ * but the window size is != 0
++ * must have been a result SWS avoidance ( sender )
++ */
++ if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
++ skb->len > mss) {
++ seg_size = min(seg_size, mss);
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++ if (mptcp_fragment(meta_sk, skb, seg_size,
++ GFP_ATOMIC, 0))
++ return -1;
++ } else if (!tcp_skb_pcount(skb)) {
++ /* see mptcp_write_xmit on why we use UINT_MAX */
++ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++ }
++
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++ if (!mptcp_skb_entail(subsk, skb, 0))
++ return -1;
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
++ TCP_SKB_CB(skb)->seq);
++ tcp_event_new_data_sent(meta_sk, skb);
++
++ __tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
++
++ return 0;
++ } else {
++window_probe:
++ if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
++ meta_tp->snd_una + 0xFFFF)) {
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ if (mptcp_sk_can_send_ack(sk_it))
++ tcp_xmit_probe_skb(sk_it, 1);
++ }
++ }
++
++ /* At least one of the tcp_xmit_probe_skb's has to succeed */
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ int ret;
++
++ if (!mptcp_sk_can_send_ack(sk_it))
++ continue;
++
++ ret = tcp_xmit_probe_skb(sk_it, 0);
++ if (unlikely(ret > 0))
++ ans = ret;
++ }
++ return ans;
++ }
++}
++
++bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
++ struct sock *subsk = NULL;
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb;
++ unsigned int sent_pkts;
++ int reinject = 0;
++ unsigned int sublimit;
++
++ sent_pkts = 0;
++
++ while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
++ &sublimit))) {
++ unsigned int limit;
++
++ subtp = tcp_sk(subsk);
++ mss_now = tcp_current_mss(subsk);
++
++ if (reinject == 1) {
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++ /* Segment already reached the peer, take the next one */
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ __kfree_skb(skb);
++ continue;
++ }
++ }
++
++ /* If the segment was cloned (e.g. a meta retransmission),
++ * the header must be expanded/copied so that there is no
++ * corruption of TSO information.
++ */
++ if (skb_unclone(skb, GFP_ATOMIC))
++ break;
++
++ if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
++ break;
++
++ /* Force tso_segs to 1 by using UINT_MAX.
++ * We actually don't care about the exact number of segments
++ * emitted on the subflow. We need just to set tso_segs, because
++ * we still need an accurate packets_out count in
++ * tcp_event_new_data_sent.
++ */
++ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++
++ /* Check for nagle, irregardless of tso_segs. If the segment is
++ * actually larger than mss_now (TSO segment), then
++ * tcp_nagle_check will have partial == false and always trigger
++ * the transmission.
++ * tcp_write_xmit has a TSO-level nagle check which is not
++ * subject to the MPTCP-level. It is based on the properties of
++ * the subflow, not the MPTCP-level.
++ */
++ if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
++ (tcp_skb_is_last(meta_sk, skb) ?
++ nonagle : TCP_NAGLE_PUSH))))
++ break;
++
++ limit = mss_now;
++ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
++ * tcp_write_xmit. Otherwise split-point would return 0.
++ */
++ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++ /* We limit the size of the skb so that it fits into the
++ * window. Call tcp_mss_split_point to avoid duplicating
++ * code.
++ * We really only care about fitting the skb into the
++ * window. That's why we use UINT_MAX. If the skb does
++ * not fit into the cwnd_quota or the NIC's max-segs
++ * limitation, it will be split by the subflow's
++ * tcp_write_xmit which does the appropriate call to
++ * tcp_mss_split_point.
++ */
++ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++ UINT_MAX / mss_now,
++ nonagle);
++
++ if (sublimit)
++ limit = min(limit, sublimit);
++
++ if (skb->len > limit &&
++ unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
++ break;
++
++ if (!mptcp_skb_entail(subsk, skb, reinject))
++ break;
++ /* Nagle is handled at the MPTCP-layer, so
++ * always push on the subflow
++ */
++ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ if (!reinject) {
++ mptcp_check_sndseq_wrap(meta_tp,
++ TCP_SKB_CB(skb)->end_seq -
++ TCP_SKB_CB(skb)->seq);
++ tcp_event_new_data_sent(meta_sk, skb);
++ }
++
++ tcp_minshall_update(meta_tp, mss_now, skb);
++ sent_pkts += tcp_skb_pcount(skb);
++
++ if (reinject > 0) {
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ kfree_skb(skb);
++ }
++
++ if (push_one)
++ break;
++ }
++
++ return !meta_tp->packets_out && tcp_send_head(meta_sk);
++}
++
++void mptcp_write_space(struct sock *sk)
++{
++ mptcp_push_pending_frames(mptcp_meta_sk(sk));
++}
++
++u32 __mptcp_select_window(struct sock *sk)
++{
++ struct inet_connection_sock *icsk = inet_csk(sk);
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ int mss, free_space, full_space, window;
++
++ /* MSS for the peer's data. Previous versions used mss_clamp
++ * here. I don't know if the value based on our guesses
++ * of peer's MSS is better for the performance. It's more correct
++ * but may be worse for the performance because of rcv_mss
++ * fluctuations. --SAW 1998/11/1
++ */
++ mss = icsk->icsk_ack.rcv_mss;
++ free_space = tcp_space(sk);
++ full_space = min_t(int, meta_tp->window_clamp,
++ tcp_full_space(sk));
++
++ if (mss > full_space)
++ mss = full_space;
++
++ if (free_space < (full_space >> 1)) {
++ icsk->icsk_ack.quick = 0;
++
++ if (tcp_memory_pressure)
++ /* TODO this has to be adapted when we support different
++ * MSS's among the subflows.
++ */
++ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
++ 4U * meta_tp->advmss);
++
++ if (free_space < mss)
++ return 0;
++ }
++
++ if (free_space > meta_tp->rcv_ssthresh)
++ free_space = meta_tp->rcv_ssthresh;
++
++ /* Don't do rounding if we are using window scaling, since the
++ * scaled window will not line up with the MSS boundary anyway.
++ */
++ window = meta_tp->rcv_wnd;
++ if (tp->rx_opt.rcv_wscale) {
++ window = free_space;
++
++ /* Advertise enough space so that it won't get scaled away.
++ * Import case: prevent zero window announcement if
++ * 1<<rcv_wscale > mss.
++ */
++ if (((window >> tp->rx_opt.rcv_wscale) << tp->
++ rx_opt.rcv_wscale) != window)
++ window = (((window >> tp->rx_opt.rcv_wscale) + 1)
++ << tp->rx_opt.rcv_wscale);
++ } else {
++ /* Get the largest window that is a nice multiple of mss.
++ * Window clamp already applied above.
++ * If our current window offering is within 1 mss of the
++ * free space we just keep it. This prevents the divide
++ * and multiply from happening most of the time.
++ * We also don't do any window rounding when the free space
++ * is too small.
++ */
++ if (window <= free_space - mss || window > free_space)
++ window = (free_space / mss) * mss;
++ else if (mss == full_space &&
++ free_space > window + (full_space >> 1))
++ window = free_space;
++ }
++
++ return window;
++}
++
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++ unsigned *remaining)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++
++ opts->options |= OPTION_MPTCP;
++ if (is_master_tp(tp)) {
++ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
++ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++ opts->mp_capable.sender_key = tp->mptcp_loc_key;
++ opts->dss_csum = !!sysctl_mptcp_checksum;
++ } else {
++ const struct mptcp_cb *mpcb = tp->mpcb;
++
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
++ *remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
++ opts->mp_join_syns.token = mpcb->mptcp_rem_token;
++ opts->mp_join_syns.low_prio = tp->mptcp->low_prio;
++ opts->addr_id = tp->mptcp->loc_id;
++ opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
++ }
++}
++
++void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts, unsigned *remaining)
++{
++ struct mptcp_request_sock *mtreq;
++ mtreq = mptcp_rsk(req);
++
++ opts->options |= OPTION_MPTCP;
++ /* MPCB not yet set - thus it's a new MPTCP-session */
++ if (!mtreq->is_sub) {
++ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
++ opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
++ opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
++ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++ } else {
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
++ opts->mp_join_syns.sender_truncated_mac =
++ mtreq->mptcp_hash_tmac;
++ opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
++ opts->mp_join_syns.low_prio = mtreq->low_prio;
++ opts->addr_id = mtreq->loc_id;
++ *remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
++ }
++}
++
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++ struct tcp_out_options *opts, unsigned *size)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
++
++ /* We are coming from tcp_current_mss with the meta_sk as an argument.
++ * It does not make sense to check for the options, because when the
++ * segment gets sent, another subflow will be chosen.
++ */
++ if (!skb && is_meta_sk(sk))
++ return;
++
++ /* In fallback mp_fail-mode, we have to repeat it until the fallback
++ * has been done by the sender
++ */
++ if (unlikely(tp->mptcp->send_mp_fail)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_FAIL;
++ *size += MPTCP_SUB_LEN_FAIL;
++ return;
++ }
++
++ if (unlikely(tp->send_mp_fclose)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_FCLOSE;
++ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++ *size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
++ return;
++ }
++
++ /* 1. If we are the sender of the infinite-mapping, we need the
++ * MPTCPHDR_INF-flag, because a retransmission of the
++ * infinite-announcment still needs the mptcp-option.
++ *
++ * We need infinite_cutoff_seq, because retransmissions from before
++ * the infinite-cutoff-moment still need the MPTCP-signalling to stay
++ * consistent.
++ *
++ * 2. If we are the receiver of the infinite-mapping, we always skip
++ * mptcp-options, because acknowledgments from before the
++ * infinite-mapping point have already been sent out.
++ *
++ * I know, the whole infinite-mapping stuff is ugly...
++ *
++ * TODO: Handle wrapped data-sequence numbers
++ * (even if it's very unlikely)
++ */
++ if (unlikely(mpcb->infinite_mapping_snd) &&
++ ((mpcb->send_infinite_mapping && tcb &&
++ mptcp_is_data_seq(skb) &&
++ !(tcb->mptcp_flags & MPTCPHDR_INF) &&
++ !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
++ !mpcb->send_infinite_mapping))
++ return;
++
++ if (unlikely(tp->mptcp->include_mpc)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_CAPABLE |
++ OPTION_TYPE_ACK;
++ *size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
++ opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
++ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++ opts->dss_csum = mpcb->dss_csum;
++
++ if (skb)
++ tp->mptcp->include_mpc = 0;
++ }
++ if (unlikely(tp->mptcp->pre_established)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
++ *size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
++ }
++
++ if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_DATA_ACK;
++ /* If !skb, we come from tcp_current_mss and thus we always
++ * assume that the DSS-option will be set for the data-packet.
++ */
++ if (skb && !mptcp_is_data_seq(skb)) {
++ *size += MPTCP_SUB_LEN_ACK_ALIGN;
++ } else {
++ /* Doesn't matter, if csum included or not. It will be
++ * either 10 or 12, and thus aligned = 12
++ */
++ *size += MPTCP_SUB_LEN_ACK_ALIGN +
++ MPTCP_SUB_LEN_SEQ_ALIGN;
++ }
++
++ *size += MPTCP_SUB_LEN_DSS_ALIGN;
++ }
++
++ if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
++ mpcb->pm_ops->addr_signal(sk, size, opts, skb);
++
++ if (unlikely(tp->mptcp->send_mp_prio) &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_PRIO;
++ if (skb)
++ tp->mptcp->send_mp_prio = 0;
++ *size += MPTCP_SUB_LEN_PRIO_ALIGN;
++ }
++
++ return;
++}
++
++u16 mptcp_select_window(struct sock *sk)
++{
++ u16 new_win = tcp_select_window(sk);
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
++
++ meta_tp->rcv_wnd = tp->rcv_wnd;
++ meta_tp->rcv_wup = meta_tp->rcv_nxt;
++
++ return new_win;
++}
++
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb)
++{
++ if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
++ struct mp_capable *mpc = (struct mp_capable *)ptr;
++
++ mpc->kind = TCPOPT_MPTCP;
++
++ if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
++ (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
++ mpc->sender_key = opts->mp_capable.sender_key;
++ mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
++ ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
++ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++ mpc->sender_key = opts->mp_capable.sender_key;
++ mpc->receiver_key = opts->mp_capable.receiver_key;
++ mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
++ ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
++ }
++
++ mpc->sub = MPTCP_SUB_CAPABLE;
++ mpc->ver = 0;
++ mpc->a = opts->dss_csum;
++ mpc->b = 0;
++ mpc->rsv = 0;
++ mpc->h = 1;
++ }
++
++ if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
++ struct mp_join *mpj = (struct mp_join *)ptr;
++
++ mpj->kind = TCPOPT_MPTCP;
++ mpj->sub = MPTCP_SUB_JOIN;
++ mpj->rsv = 0;
++
++ if (OPTION_TYPE_SYN & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
++ mpj->u.syn.token = opts->mp_join_syns.token;
++ mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
++ mpj->b = opts->mp_join_syns.low_prio;
++ mpj->addr_id = opts->addr_id;
++ ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
++ } else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
++ mpj->u.synack.mac =
++ opts->mp_join_syns.sender_truncated_mac;
++ mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
++ mpj->b = opts->mp_join_syns.low_prio;
++ mpj->addr_id = opts->addr_id;
++ ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
++ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
++ mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
++ memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
++ ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
++ }
++ }
++ if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++ mpadd->kind = TCPOPT_MPTCP;
++ if (opts->add_addr_v4) {
++ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
++ mpadd->sub = MPTCP_SUB_ADD_ADDR;
++ mpadd->ipver = 4;
++ mpadd->addr_id = opts->add_addr4.addr_id;
++ mpadd->u.v4.addr = opts->add_addr4.addr;
++ ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
++ } else if (opts->add_addr_v6) {
++ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
++ mpadd->sub = MPTCP_SUB_ADD_ADDR;
++ mpadd->ipver = 6;
++ mpadd->addr_id = opts->add_addr6.addr_id;
++ memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
++ sizeof(mpadd->u.v6.addr));
++ ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
++ }
++ }
++ if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
++ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++ u8 *addrs_id;
++ int id, len, len_align;
++
++ len = mptcp_sub_len_remove_addr(opts->remove_addrs);
++ len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
++
++ mprem->kind = TCPOPT_MPTCP;
++ mprem->len = len;
++ mprem->sub = MPTCP_SUB_REMOVE_ADDR;
++ mprem->rsv = 0;
++ addrs_id = &mprem->addrs_id;
++
++ mptcp_for_each_bit_set(opts->remove_addrs, id)
++ *(addrs_id++) = id;
++
++ /* Fill the rest with NOP's */
++ if (len_align > len) {
++ int i;
++ for (i = 0; i < len_align - len; i++)
++ *(addrs_id++) = TCPOPT_NOP;
++ }
++
++ ptr += len_align >> 2;
++ }
++ if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
++ struct mp_fail *mpfail = (struct mp_fail *)ptr;
++
++ mpfail->kind = TCPOPT_MPTCP;
++ mpfail->len = MPTCP_SUB_LEN_FAIL;
++ mpfail->sub = MPTCP_SUB_FAIL;
++ mpfail->rsv1 = 0;
++ mpfail->rsv2 = 0;
++ mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
++
++ ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
++ }
++ if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
++ struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
++
++ mpfclose->kind = TCPOPT_MPTCP;
++ mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
++ mpfclose->sub = MPTCP_SUB_FCLOSE;
++ mpfclose->rsv1 = 0;
++ mpfclose->rsv2 = 0;
++ mpfclose->key = opts->mp_capable.receiver_key;
++
++ ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
++ }
++
++ if (OPTION_DATA_ACK & opts->mptcp_options) {
++ if (!mptcp_is_data_seq(skb))
++ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++ else
++ ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
++ }
++ if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
++ struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++ mpprio->kind = TCPOPT_MPTCP;
++ mpprio->len = MPTCP_SUB_LEN_PRIO;
++ mpprio->sub = MPTCP_SUB_PRIO;
++ mpprio->rsv = 0;
++ mpprio->b = tp->mptcp->low_prio;
++ mpprio->addr_id = TCPOPT_NOP;
++
++ ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
++ }
++}
++
++/* Sends the datafin */
++void mptcp_send_fin(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
++ int mss_now;
++
++ if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
++ meta_tp->mpcb->passive_close = 1;
++
++ /* Optimization, tack on the FIN if we have a queue of
++ * unsent frames. But be careful about outgoing SACKS
++ * and IP options.
++ */
++ mss_now = mptcp_current_mss(meta_sk);
++
++ if (tcp_send_head(meta_sk) != NULL) {
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++ TCP_SKB_CB(skb)->end_seq++;
++ meta_tp->write_seq++;
++ } else {
++ /* Socket is locked, keep trying until memory is available. */
++ for (;;) {
++ skb = alloc_skb_fclone(MAX_TCP_HEADER,
++ meta_sk->sk_allocation);
++ if (skb)
++ break;
++ yield();
++ }
++ /* Reserve space for headers and prepare control bits. */
++ skb_reserve(skb, MAX_TCP_HEADER);
++
++ tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
++ TCP_SKB_CB(skb)->end_seq++;
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++ tcp_queue_skb(meta_sk, skb);
++ }
++ __tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
++}
++
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
++
++ if (!mpcb->cnt_subflows)
++ return;
++
++ WARN_ON(meta_tp->send_mp_fclose);
++
++ /* First - select a socket */
++ sk = mptcp_select_ack_sock(meta_sk);
++
++ /* May happen if no subflow is in an appropriate state */
++ if (!sk)
++ return;
++
++ /* We are in infinite mode - just send a reset */
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
++ sk->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk->sk_state))
++ tcp_send_active_reset(sk, priority);
++ mptcp_sub_force_close(sk);
++ return;
++ }
++
++
++ tcp_sk(sk)->send_mp_fclose = 1;
++ /** Reset all other subflows */
++
++ /* tcp_done must be handled with bh disabled */
++ if (!in_serving_softirq())
++ local_bh_disable();
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++
++ sk_it->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk_it->sk_state))
++ tcp_send_active_reset(sk_it, GFP_ATOMIC);
++ mptcp_sub_force_close(sk_it);
++ }
++
++ if (!in_serving_softirq())
++ local_bh_enable();
++
++ tcp_send_ack(sk);
++ inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
++
++ meta_tp->send_mp_fclose = 1;
++}
++
++static void mptcp_ack_retransmit_timer(struct sock *sk)
++{
++ struct sk_buff *skb;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
++ goto out; /* Routing failure or similar */
++
++ if (!tp->retrans_stamp)
++ tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++ if (tcp_write_timeout(sk)) {
++ tp->mptcp->pre_established = 0;
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
++ goto out;
++ }
++
++ skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
++ if (skb == NULL) {
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ return;
++ }
++
++ /* Reserve space for headers and prepare control bits */
++ skb_reserve(skb, MAX_TCP_HEADER);
++ tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
++
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++ if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
++ /* Retransmission failed because of local congestion,
++ * do not backoff.
++ */
++ if (!icsk->icsk_retransmits)
++ icsk->icsk_retransmits = 1;
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ return;
++ }
++
++
++ icsk->icsk_retransmits++;
++ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
++ __sk_dst_reset(sk);
++
++out:;
++}
++
++void mptcp_ack_handler(unsigned long data)
++{
++ struct sock *sk = (struct sock *)data;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ /* Try again later */
++ sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
++ jiffies + (HZ / 20));
++ goto out_unlock;
++ }
++
++ if (sk->sk_state == TCP_CLOSE)
++ goto out_unlock;
++ if (!tcp_sk(sk)->mptcp->pre_established)
++ goto out_unlock;
++
++ mptcp_ack_retransmit_timer(sk);
++
++ sk_mem_reclaim(sk);
++
++out_unlock:
++ bh_unlock_sock(meta_sk);
++ sock_put(sk);
++}
++
++/* Similar to tcp_retransmit_skb
++ *
++ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
++ * meta-level.
++ */
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *subsk;
++ unsigned int limit, mss_now;
++ int err = -1;
++
++ /* Do not sent more than we queued. 1/4 is reserved for possible
++ * copying overhead: fragmentation, tunneling, mangling etc.
++ *
++ * This is a meta-retransmission thus we check on the meta-socket.
++ */
++ if (atomic_read(&meta_sk->sk_wmem_alloc) >
++ min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
++ return -EAGAIN;
++ }
++
++ /* We need to make sure that the retransmitted segment can be sent on a
++ * subflow right now. If it is too big, it needs to be fragmented.
++ */
++ subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
++ if (!subsk) {
++ /* We want to increase icsk_retransmits, thus return 0, so that
++ * mptcp_retransmit_timer enters the desired branch.
++ */
++ err = 0;
++ goto failed;
++ }
++ mss_now = tcp_current_mss(subsk);
++
++ /* If the segment was cloned (e.g. a meta retransmission), the header
++ * must be expanded/copied so that there is no corruption of TSO
++ * information.
++ */
++ if (skb_unclone(skb, GFP_ATOMIC)) {
++ err = -ENOMEM;
++ goto failed;
++ }
++
++ /* Must have been set by mptcp_write_xmit before */
++ BUG_ON(!tcp_skb_pcount(skb));
++
++ limit = mss_now;
++ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
++ * tcp_write_xmit. Otherwise split-point would return 0.
++ */
++ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++ UINT_MAX / mss_now,
++ TCP_NAGLE_OFF);
++
++ if (skb->len > limit &&
++ unlikely(mptcp_fragment(meta_sk, skb, limit,
++ GFP_ATOMIC, 0)))
++ goto failed;
++
++ if (!mptcp_skb_entail(subsk, skb, -1))
++ goto failed;
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ /* Update global TCP statistics. */
++ TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
++
++ /* Diff to tcp_retransmit_skb */
++
++ /* Save stamp of the first retransmit. */
++ if (!meta_tp->retrans_stamp)
++ meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
++
++ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++
++ return 0;
++
++failed:
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
++ return err;
++}
++
++/* Similar to tcp_retransmit_timer
++ *
++ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
++ * and that we don't have an srtt estimation at the meta-level.
++ */
++void mptcp_retransmit_timer(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ int err;
++
++ /* In fallback, retransmission is handled at the subflow-level */
++ if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
++ mpcb->send_infinite_mapping)
++ return;
++
++ WARN_ON(tcp_write_queue_empty(meta_sk));
++
++ if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
++ !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
++ /* Receiver dastardly shrinks window. Our retransmits
++ * become zero probes, but we should not timeout this
++ * connection. If the socket is an orphan, time it out,
++ * we cannot allow such beasts to hang infinitely.
++ */
++ struct inet_sock *meta_inet = inet_sk(meta_sk);
++ if (meta_sk->sk_family == AF_INET) {
++ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++ &meta_inet->inet_daddr,
++ ntohs(meta_inet->inet_dport),
++ meta_inet->inet_num, meta_tp->snd_una,
++ meta_tp->snd_nxt);
++ }
++#if IS_ENABLED(CONFIG_IPV6)
++ else if (meta_sk->sk_family == AF_INET6) {
++ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++ &meta_sk->sk_v6_daddr,
++ ntohs(meta_inet->inet_dport),
++ meta_inet->inet_num, meta_tp->snd_una,
++ meta_tp->snd_nxt);
++ }
++#endif
++ if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
++ tcp_write_err(meta_sk);
++ return;
++ }
++
++ mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++ goto out_reset_timer;
++ }
++
++ if (tcp_write_timeout(meta_sk))
++ return;
++
++ if (meta_icsk->icsk_retransmits == 0)
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
++
++ meta_icsk->icsk_ca_state = TCP_CA_Loss;
++
++ err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++ if (err > 0) {
++ /* Retransmission failed because of local congestion,
++ * do not backoff.
++ */
++ if (!meta_icsk->icsk_retransmits)
++ meta_icsk->icsk_retransmits = 1;
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++ min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
++ TCP_RTO_MAX);
++ return;
++ }
++
++ /* Increase the timeout each time we retransmit. Note that
++ * we do not increase the rtt estimate. rto is initialized
++ * from rtt, but increases here. Jacobson (SIGCOMM 88) suggests
++ * that doubling rto each time is the least we can get away with.
++ * In KA9Q, Karn uses this for the first few times, and then
++ * goes to quadratic. netBSD doubles, but only goes up to *64,
++ * and clamps at 1 to 64 sec afterwards. Note that 120 sec is
++ * defined in the protocol as the maximum possible RTT. I guess
++ * we'll have to use something other than TCP to talk to the
++ * University of Mars.
++ *
++ * PAWS allows us longer timeouts and large windows, so once
++ * implemented ftp to mars will work nicely. We will have to fix
++ * the 120 second clamps though!
++ */
++ meta_icsk->icsk_backoff++;
++ meta_icsk->icsk_retransmits++;
++
++out_reset_timer:
++ /* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
++ * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
++ * might be increased if the stream oscillates between thin and thick,
++ * thus the old value might already be too high compared to the value
++ * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
++ * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
++ * exponential backoff behaviour to avoid continue hammering
++ * linear-timeout retransmissions into a black hole
++ */
++ if (meta_sk->sk_state == TCP_ESTABLISHED &&
++ (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
++ tcp_stream_is_thin(meta_tp) &&
++ meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
++ meta_icsk->icsk_backoff = 0;
++ /* We cannot do the same as in tcp_write_timer because the
++ * srtt is not set here.
++ */
++ mptcp_set_rto(meta_sk);
++ } else {
++ /* Use normal (exponential) backoff */
++ meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
++ }
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
++
++ return;
++}
++
++/* Modify values to an mptcp-level for the initial window of new subflows */
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ *window_clamp = mpcb->orig_window_clamp;
++ __space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
++
++ tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
++ wscale_ok, rcv_wscale, init_rcv_wnd, sk);
++}
++
++static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
++ unsigned int (*mss_cb)(struct sock *sk))
++{
++ struct sock *sk;
++ u64 rate = 0;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ int this_mss;
++ u64 this_rate;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ /* Do not consider subflows without a RTT estimation yet
++ * otherwise this_rate >>> rate.
++ */
++ if (unlikely(!tp->srtt_us))
++ continue;
++
++ this_mss = mss_cb(sk);
++
++ /* If this_mss is smaller than mss, it means that a segment will
++ * be splitted in two (or more) when pushed on this subflow. If
++ * you consider that mss = 1428 and this_mss = 1420 then two
++ * segments will be generated: a 1420-byte and 8-byte segment.
++ * The latter will introduce a large overhead as for a single
++ * data segment 2 slots will be used in the congestion window.
++ * Therefore reducing by ~2 the potential throughput of this
++ * subflow. Indeed, 1428 will be send while 2840 could have been
++ * sent if mss == 1420 reducing the throughput by 2840 / 1428.
++ *
++ * The following algorithm take into account this overhead
++ * when computing the potential throughput that MPTCP can
++ * achieve when generating mss-byte segments.
++ *
++ * The formulae is the following:
++ * \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
++ * Where ratio is computed as follows:
++ * \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
++ *
++ * ratio gives the reduction factor of the theoretical
++ * throughput a subflow can achieve if MPTCP uses a specific
++ * MSS value.
++ */
++ this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
++ max(tp->snd_cwnd, tp->packets_out),
++ (u64)tp->srtt_us *
++ DIV_ROUND_UP(mss, this_mss) * this_mss);
++ rate += this_rate;
++ }
++
++ return rate;
++}
++
++static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
++ unsigned int (*mss_cb)(struct sock *sk))
++{
++ unsigned int mss = 0;
++ u64 rate = 0;
++ struct sock *sk;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ int this_mss;
++ u64 this_rate;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ this_mss = mss_cb(sk);
++
++ /* Same mss values will produce the same throughput. */
++ if (this_mss == mss)
++ continue;
++
++ /* See whether using this mss value can theoretically improve
++ * the performances.
++ */
++ this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
++ if (this_rate >= rate) {
++ mss = this_mss;
++ rate = this_rate;
++ }
++ }
++
++ return mss;
++}
++
++unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++ unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
++
++ /* If no subflow is available, we take a default-mss from the
++ * meta-socket.
++ */
++ return !mss ? tcp_current_mss(meta_sk) : mss;
++}
++
++static unsigned int mptcp_select_size_mss(struct sock *sk)
++{
++ return tcp_sk(sk)->mss_cache;
++}
++
++int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++ unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
++
++ if (sg) {
++ if (mptcp_sk_can_gso(meta_sk)) {
++ mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
++ } else {
++ int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
++
++ if (mss >= pgbreak &&
++ mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
++ mss = pgbreak;
++ }
++ }
++
++ return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
++}
++
++int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++ const struct sock *sk;
++ u32 rtt_max = tp->srtt_us;
++ u64 bw_est;
++
++ if (!tp->srtt_us)
++ return tp->reordering + 1;
++
++ mptcp_for_each_sk(tp->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ if (rtt_max < tcp_sk(sk)->srtt_us)
++ rtt_max = tcp_sk(sk)->srtt_us;
++ }
++
++ bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
++ (u64)tp->srtt_us);
++
++ return max_t(unsigned int, (u32)(bw_est >> 16),
++ tp->reordering + 1);
++}
++
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++ int large_allowed)
++{
++ struct sock *sk;
++ u32 xmit_size_goal = 0;
++
++ if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ int this_size_goal;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
++ if (this_size_goal > xmit_size_goal)
++ xmit_size_goal = this_size_goal;
++ }
++ }
++
++ return max(xmit_size_goal, mss_now);
++}
++
++/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++ if (skb_cloned(skb)) {
++ if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
++ return -ENOMEM;
++ }
++
++ __pskb_trim_head(skb, len);
++
++ TCP_SKB_CB(skb)->seq += len;
++ skb->ip_summed = CHECKSUM_PARTIAL;
++
++ skb->truesize -= len;
++ sk->sk_wmem_queued -= len;
++ sk_mem_uncharge(sk, len);
++ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++
++ /* Any change of skb->len requires recalculation of tso factor. */
++ if (tcp_skb_pcount(skb) > 1)
++ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
++
++ return 0;
++}
+diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
+new file mode 100644
+index 000000000000..9542f950729f
+--- /dev/null
++++ b/net/mptcp/mptcp_pm.c
+@@ -0,0 +1,169 @@
++/*
++ * MPTCP implementation - MPTCP-subflow-management
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_pm_list_lock);
++static LIST_HEAD(mptcp_pm_list);
++
++static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++struct mptcp_pm_ops mptcp_pm_default = {
++ .get_local_id = mptcp_default_id, /* We do not care */
++ .name = "default",
++ .owner = THIS_MODULE,
++};
++
++static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
++{
++ struct mptcp_pm_ops *e;
++
++ list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
++ if (strcmp(e->name, name) == 0)
++ return e;
++ }
++
++ return NULL;
++}
++
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
++{
++ int ret = 0;
++
++ if (!pm->get_local_id)
++ return -EINVAL;
++
++ spin_lock(&mptcp_pm_list_lock);
++ if (mptcp_pm_find(pm->name)) {
++ pr_notice("%s already registered\n", pm->name);
++ ret = -EEXIST;
++ } else {
++ list_add_tail_rcu(&pm->list, &mptcp_pm_list);
++ pr_info("%s registered\n", pm->name);
++ }
++ spin_unlock(&mptcp_pm_list_lock);
++
++ return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
++
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
++{
++ spin_lock(&mptcp_pm_list_lock);
++ list_del_rcu(&pm->list);
++ spin_unlock(&mptcp_pm_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
++
++void mptcp_get_default_path_manager(char *name)
++{
++ struct mptcp_pm_ops *pm;
++
++ BUG_ON(list_empty(&mptcp_pm_list));
++
++ rcu_read_lock();
++ pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
++ strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
++ rcu_read_unlock();
++}
++
++int mptcp_set_default_path_manager(const char *name)
++{
++ struct mptcp_pm_ops *pm;
++ int ret = -ENOENT;
++
++ spin_lock(&mptcp_pm_list_lock);
++ pm = mptcp_pm_find(name);
++#ifdef CONFIG_MODULES
++ if (!pm && capable(CAP_NET_ADMIN)) {
++ spin_unlock(&mptcp_pm_list_lock);
++
++ request_module("mptcp_%s", name);
++ spin_lock(&mptcp_pm_list_lock);
++ pm = mptcp_pm_find(name);
++ }
++#endif
++
++ if (pm) {
++ list_move(&pm->list, &mptcp_pm_list);
++ ret = 0;
++ } else {
++ pr_info("%s is not available\n", name);
++ }
++ spin_unlock(&mptcp_pm_list_lock);
++
++ return ret;
++}
++
++void mptcp_init_path_manager(struct mptcp_cb *mpcb)
++{
++ struct mptcp_pm_ops *pm;
++
++ rcu_read_lock();
++ list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
++ if (try_module_get(pm->owner)) {
++ mpcb->pm_ops = pm;
++ break;
++ }
++ }
++ rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
++{
++ module_put(mpcb->pm_ops->owner);
++}
++
++/* Fallback to the default path-manager. */
++void mptcp_fallback_default(struct mptcp_cb *mpcb)
++{
++ struct mptcp_pm_ops *pm;
++
++ mptcp_cleanup_path_manager(mpcb);
++ pm = mptcp_pm_find("default");
++
++ /* Cannot fail - it's the default module */
++ try_module_get(pm->owner);
++ mpcb->pm_ops = pm;
++}
++EXPORT_SYMBOL_GPL(mptcp_fallback_default);
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_path_manager_default(void)
++{
++ return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
++}
++late_initcall(mptcp_path_manager_default);
+diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
+new file mode 100644
+index 000000000000..93278f684069
+--- /dev/null
++++ b/net/mptcp/mptcp_rr.c
+@@ -0,0 +1,301 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static unsigned char num_segments __read_mostly = 1;
++module_param(num_segments, byte, 0644);
++MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
++
++static bool cwnd_limited __read_mostly = 1;
++module_param(cwnd_limited, bool, 0644);
++MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
++
++struct rrsched_priv {
++ unsigned char quota;
++};
++
++static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
++{
++ return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
++ bool zero_wnd_test, bool cwnd_test)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ unsigned int space, in_flight;
++
++ /* Set of states for which we are allowed to send data */
++ if (!mptcp_sk_can_send(sk))
++ return false;
++
++ /* We do not send data on this subflow unless it is
++ * fully established, i.e. the 4th ack has been received.
++ */
++ if (tp->mptcp->pre_established)
++ return false;
++
++ if (tp->pf)
++ return false;
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++ /* If SACK is disabled, and we got a loss, TCP does not exit
++ * the loss-state until something above high_seq has been acked.
++ * (see tcp_try_undo_recovery)
++ *
++ * high_seq is the snd_nxt at the moment of the RTO. As soon
++ * as we have an RTO, we won't push data on the subflow.
++ * Thus, snd_una can never go beyond high_seq.
++ */
++ if (!tcp_is_reno(tp))
++ return false;
++ else if (tp->snd_una != tp->high_seq)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ /* Make sure that we send in-order data */
++ if (skb && tp->mptcp->second_packet &&
++ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++ return false;
++ }
++
++ if (!cwnd_test)
++ goto zero_wnd_test;
++
++ in_flight = tcp_packets_in_flight(tp);
++ /* Not even a single spot in the cwnd */
++ if (in_flight >= tp->snd_cwnd)
++ return false;
++
++ /* Now, check if what is queued in the subflow's send-queue
++ * already fills the cwnd.
++ */
++ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++ if (tp->write_seq - tp->snd_nxt > space)
++ return false;
++
++zero_wnd_test:
++ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++ return false;
++
++ return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++ /* If the skb has already been enqueued in this sk, try to find
++ * another one.
++ */
++ return skb &&
++ /* Has the skb already been enqueued into this subsocket? */
++ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* We just look for any subflow that is available */
++static struct sock *rr_get_available_subflow(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk, *bestsk = NULL, *backupsk = NULL;
++
++ /* Answer data_fin on same subflow!!! */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++ skb && mptcp_is_data_fin(skb)) {
++ mptcp_for_each_sk(mpcb, sk) {
++ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++ mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++ return sk;
++ }
++ }
++
++ /* First, find the best subflow */
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++ continue;
++
++ if (mptcp_rr_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ bestsk = sk;
++ }
++
++ if (bestsk) {
++ sk = bestsk;
++ } else if (backupsk) {
++ /* It has been sent on all subflows once - let's give it a
++ * chance again by restarting its pathmask.
++ */
++ if (skb)
++ TCP_SKB_CB(skb)->path_mask = 0;
++ sk = backupsk;
++ }
++
++ return sk;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sk_buff *skb = NULL;
++
++ *reinject = 0;
++
++ /* If we are in fallback-mode, just take from the meta-send-queue */
++ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++ return tcp_send_head(meta_sk);
++
++ skb = skb_peek(&mpcb->reinject_queue);
++
++ if (skb)
++ *reinject = 1;
++ else
++ skb = tcp_send_head(meta_sk);
++ return skb;
++}
++
++static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk_it, *choose_sk = NULL;
++ struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
++ unsigned char split = num_segments;
++ unsigned char iter = 0, full_subs = 0;
++
++ /* As we set it, we have to reset it as well. */
++ *limit = 0;
++
++ if (!skb)
++ return NULL;
++
++ if (*reinject) {
++ *subsk = rr_get_available_subflow(meta_sk, skb, false);
++ if (!*subsk)
++ return NULL;
++
++ return skb;
++ }
++
++retry:
++
++ /* First, we look for a subflow who is currently being used */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++ continue;
++
++ iter++;
++
++ /* Is this subflow currently being used? */
++ if (rsp->quota > 0 && rsp->quota < num_segments) {
++ split = num_segments - rsp->quota;
++ choose_sk = sk_it;
++ goto found;
++ }
++
++ /* Or, it's totally unused */
++ if (!rsp->quota) {
++ split = num_segments;
++ choose_sk = sk_it;
++ }
++
++ /* Or, it must then be fully used */
++ if (rsp->quota == num_segments)
++ full_subs++;
++ }
++
++ /* All considered subflows have a full quota, and we considered at
++ * least one.
++ */
++ if (iter && iter == full_subs) {
++ /* So, we restart this round by setting quota to 0 and retry
++ * to find a subflow.
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++ continue;
++
++ rsp->quota = 0;
++ }
++
++ goto retry;
++ }
++
++found:
++ if (choose_sk) {
++ unsigned int mss_now;
++ struct tcp_sock *choose_tp = tcp_sk(choose_sk);
++ struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
++
++ if (!mptcp_rr_is_available(choose_sk, skb, false, true))
++ return NULL;
++
++ *subsk = choose_sk;
++ mss_now = tcp_current_mss(*subsk);
++ *limit = split * mss_now;
++
++ if (skb->len > mss_now)
++ rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
++ else
++ rsp->quota++;
++
++ return skb;
++ }
++
++ return NULL;
++}
++
++static struct mptcp_sched_ops mptcp_sched_rr = {
++ .get_subflow = rr_get_available_subflow,
++ .next_segment = mptcp_rr_next_segment,
++ .name = "roundrobin",
++ .owner = THIS_MODULE,
++};
++
++static int __init rr_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
++
++ if (mptcp_register_scheduler(&mptcp_sched_rr))
++ return -1;
++
++ return 0;
++}
++
++static void rr_unregister(void)
++{
++ mptcp_unregister_scheduler(&mptcp_sched_rr);
++}
++
++module_init(rr_register);
++module_exit(rr_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
++MODULE_VERSION("0.89");
+diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
+new file mode 100644
+index 000000000000..6c7ff4eceac1
+--- /dev/null
++++ b/net/mptcp/mptcp_sched.c
+@@ -0,0 +1,493 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_sched_list_lock);
++static LIST_HEAD(mptcp_sched_list);
++
++struct defsched_priv {
++ u32 last_rbuf_opti;
++};
++
++static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
++{
++ return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ unsigned int mss_now, space, in_flight;
++
++ /* Set of states for which we are allowed to send data */
++ if (!mptcp_sk_can_send(sk))
++ return false;
++
++ /* We do not send data on this subflow unless it is
++ * fully established, i.e. the 4th ack has been received.
++ */
++ if (tp->mptcp->pre_established)
++ return false;
++
++ if (tp->pf)
++ return false;
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++ /* If SACK is disabled, and we got a loss, TCP does not exit
++ * the loss-state until something above high_seq has been acked.
++ * (see tcp_try_undo_recovery)
++ *
++ * high_seq is the snd_nxt at the moment of the RTO. As soon
++ * as we have an RTO, we won't push data on the subflow.
++ * Thus, snd_una can never go beyond high_seq.
++ */
++ if (!tcp_is_reno(tp))
++ return false;
++ else if (tp->snd_una != tp->high_seq)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ /* Make sure that we send in-order data */
++ if (skb && tp->mptcp->second_packet &&
++ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++ return false;
++ }
++
++ /* If TSQ is already throttling us, do not send on this subflow. When
++ * TSQ gets cleared the subflow becomes eligible again.
++ */
++ if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
++ return false;
++
++ in_flight = tcp_packets_in_flight(tp);
++ /* Not even a single spot in the cwnd */
++ if (in_flight >= tp->snd_cwnd)
++ return false;
++
++ /* Now, check if what is queued in the subflow's send-queue
++ * already fills the cwnd.
++ */
++ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++ if (tp->write_seq - tp->snd_nxt > space)
++ return false;
++
++ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++ return false;
++
++ mss_now = tcp_current_mss(sk);
++
++ /* Don't send on this subflow if we bypass the allowed send-window at
++ * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
++ * calculated end_seq (because here at this point end_seq is still at
++ * the meta-level).
++ */
++ if (skb && !zero_wnd_test &&
++ after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
++ return false;
++
++ return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++ /* If the skb has already been enqueued in this sk, try to find
++ * another one.
++ */
++ return skb &&
++ /* Has the skb already been enqueued into this subsocket? */
++ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* This is the scheduler. This function decides on which flow to send
++ * a given MSS. If all subflows are found to be busy, NULL is returned
++ * The flow is selected based on the shortest RTT.
++ * If all paths have full cong windows, we simply return NULL.
++ *
++ * Additionally, this function is aware of the backup-subflows.
++ */
++static struct sock *get_available_subflow(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
++ u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
++ int cnt_backups = 0;
++
++ /* if there is only one subflow, bypass the scheduling function */
++ if (mpcb->cnt_subflows == 1) {
++ bestsk = (struct sock *)mpcb->connection_list;
++ if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
++ bestsk = NULL;
++ return bestsk;
++ }
++
++ /* Answer data_fin on same subflow!!! */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++ skb && mptcp_is_data_fin(skb)) {
++ mptcp_for_each_sk(mpcb, sk) {
++ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++ mptcp_is_available(sk, skb, zero_wnd_test))
++ return sk;
++ }
++ }
++
++ /* First, find the best subflow */
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
++ cnt_backups++;
++
++ if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++ tp->srtt_us < lowprio_min_time_to_peer) {
++ if (!mptcp_is_available(sk, skb, zero_wnd_test))
++ continue;
++
++ if (mptcp_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ lowprio_min_time_to_peer = tp->srtt_us;
++ lowpriosk = sk;
++ } else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++ tp->srtt_us < min_time_to_peer) {
++ if (!mptcp_is_available(sk, skb, zero_wnd_test))
++ continue;
++
++ if (mptcp_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ min_time_to_peer = tp->srtt_us;
++ bestsk = sk;
++ }
++ }
++
++ if (mpcb->cnt_established == cnt_backups && lowpriosk) {
++ sk = lowpriosk;
++ } else if (bestsk) {
++ sk = bestsk;
++ } else if (backupsk) {
++ /* It has been sent on all subflows once - let's give it a
++ * chance again by restarting its pathmask.
++ */
++ if (skb)
++ TCP_SKB_CB(skb)->path_mask = 0;
++ sk = backupsk;
++ }
++
++ return sk;
++}
++
++static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
++{
++ struct sock *meta_sk;
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp_it;
++ struct sk_buff *skb_head;
++ struct defsched_priv *dsp = defsched_get_priv(tp);
++
++ if (tp->mpcb->cnt_subflows == 1)
++ return NULL;
++
++ meta_sk = mptcp_meta_sk(sk);
++ skb_head = tcp_write_queue_head(meta_sk);
++
++ if (!skb_head || skb_head == tcp_send_head(meta_sk))
++ return NULL;
++
++ /* If penalization is optional (coming from mptcp_next_segment() and
++ * We are not send-buffer-limited we do not penalize. The retransmission
++ * is just an optimization to fix the idle-time due to the delay before
++ * we wake up the application.
++ */
++ if (!penal && sk_stream_memory_free(meta_sk))
++ goto retrans;
++
++ /* Only penalize again after an RTT has elapsed */
++ if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
++ goto retrans;
++
++ /* Half the cwnd of the slow flow */
++ mptcp_for_each_tp(tp->mpcb, tp_it) {
++ if (tp_it != tp &&
++ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++ if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
++ tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
++ if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
++ tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
++
++ dsp->last_rbuf_opti = tcp_time_stamp;
++ }
++ break;
++ }
++ }
++
++retrans:
++
++ /* Segment not yet injected into this path? Take it!!! */
++ if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
++ bool do_retrans = false;
++ mptcp_for_each_tp(tp->mpcb, tp_it) {
++ if (tp_it != tp &&
++ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++ if (tp_it->snd_cwnd <= 4) {
++ do_retrans = true;
++ break;
++ }
++
++ if (4 * tp->srtt_us >= tp_it->srtt_us) {
++ do_retrans = false;
++ break;
++ } else {
++ do_retrans = true;
++ }
++ }
++ }
++
++ if (do_retrans && mptcp_is_available(sk, skb_head, false))
++ return skb_head;
++ }
++ return NULL;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sk_buff *skb = NULL;
++
++ *reinject = 0;
++
++ /* If we are in fallback-mode, just take from the meta-send-queue */
++ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++ return tcp_send_head(meta_sk);
++
++ skb = skb_peek(&mpcb->reinject_queue);
++
++ if (skb) {
++ *reinject = 1;
++ } else {
++ skb = tcp_send_head(meta_sk);
++
++ if (!skb && meta_sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
++ sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
++ struct sock *subsk = get_available_subflow(meta_sk, NULL,
++ false);
++ if (!subsk)
++ return NULL;
++
++ skb = mptcp_rcv_buf_optimization(subsk, 0);
++ if (skb)
++ *reinject = -1;
++ }
++ }
++ return skb;
++}
++
++static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit)
++{
++ struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
++ unsigned int mss_now;
++ struct tcp_sock *subtp;
++ u16 gso_max_segs;
++ u32 max_len, max_segs, window, needed;
++
++ /* As we set it, we have to reset it as well. */
++ *limit = 0;
++
++ if (!skb)
++ return NULL;
++
++ *subsk = get_available_subflow(meta_sk, skb, false);
++ if (!*subsk)
++ return NULL;
++
++ subtp = tcp_sk(*subsk);
++ mss_now = tcp_current_mss(*subsk);
++
++ if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
++ skb = mptcp_rcv_buf_optimization(*subsk, 1);
++ if (skb)
++ *reinject = -1;
++ else
++ return NULL;
++ }
++
++ /* No splitting required, as we will only send one single segment */
++ if (skb->len <= mss_now)
++ return skb;
++
++ /* The following is similar to tcp_mss_split_point, but
++ * we do not care about nagle, because we will anyways
++ * use TCP_NAGLE_PUSH, which overrides this.
++ *
++ * So, we first limit according to the cwnd/gso-size and then according
++ * to the subflow's window.
++ */
++
++ gso_max_segs = (*subsk)->sk_gso_max_segs;
++ if (!gso_max_segs) /* No gso supported on the subflow's NIC */
++ gso_max_segs = 1;
++ max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
++ if (!max_segs)
++ return NULL;
++
++ max_len = mss_now * max_segs;
++ window = tcp_wnd_end(subtp) - subtp->write_seq;
++
++ needed = min(skb->len, window);
++ if (max_len <= skb->len)
++ /* Take max_win, which is actually the cwnd/gso-size */
++ *limit = max_len;
++ else
++ /* Or, take the window */
++ *limit = needed;
++
++ return skb;
++}
++
++static void defsched_init(struct sock *sk)
++{
++ struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
++
++ dsp->last_rbuf_opti = tcp_time_stamp;
++}
++
++struct mptcp_sched_ops mptcp_sched_default = {
++ .get_subflow = get_available_subflow,
++ .next_segment = mptcp_next_segment,
++ .init = defsched_init,
++ .name = "default",
++ .owner = THIS_MODULE,
++};
++
++static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
++{
++ struct mptcp_sched_ops *e;
++
++ list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
++ if (strcmp(e->name, name) == 0)
++ return e;
++ }
++
++ return NULL;
++}
++
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
++{
++ int ret = 0;
++
++ if (!sched->get_subflow || !sched->next_segment)
++ return -EINVAL;
++
++ spin_lock(&mptcp_sched_list_lock);
++ if (mptcp_sched_find(sched->name)) {
++ pr_notice("%s already registered\n", sched->name);
++ ret = -EEXIST;
++ } else {
++ list_add_tail_rcu(&sched->list, &mptcp_sched_list);
++ pr_info("%s registered\n", sched->name);
++ }
++ spin_unlock(&mptcp_sched_list_lock);
++
++ return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
++
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
++{
++ spin_lock(&mptcp_sched_list_lock);
++ list_del_rcu(&sched->list);
++ spin_unlock(&mptcp_sched_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
++
++void mptcp_get_default_scheduler(char *name)
++{
++ struct mptcp_sched_ops *sched;
++
++ BUG_ON(list_empty(&mptcp_sched_list));
++
++ rcu_read_lock();
++ sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
++ strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
++ rcu_read_unlock();
++}
++
++int mptcp_set_default_scheduler(const char *name)
++{
++ struct mptcp_sched_ops *sched;
++ int ret = -ENOENT;
++
++ spin_lock(&mptcp_sched_list_lock);
++ sched = mptcp_sched_find(name);
++#ifdef CONFIG_MODULES
++ if (!sched && capable(CAP_NET_ADMIN)) {
++ spin_unlock(&mptcp_sched_list_lock);
++
++ request_module("mptcp_%s", name);
++ spin_lock(&mptcp_sched_list_lock);
++ sched = mptcp_sched_find(name);
++ }
++#endif
++
++ if (sched) {
++ list_move(&sched->list, &mptcp_sched_list);
++ ret = 0;
++ } else {
++ pr_info("%s is not available\n", name);
++ }
++ spin_unlock(&mptcp_sched_list_lock);
++
++ return ret;
++}
++
++void mptcp_init_scheduler(struct mptcp_cb *mpcb)
++{
++ struct mptcp_sched_ops *sched;
++
++ rcu_read_lock();
++ list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
++ if (try_module_get(sched->owner)) {
++ mpcb->sched_ops = sched;
++ break;
++ }
++ }
++ rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
++{
++ module_put(mpcb->sched_ops->owner);
++}
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_scheduler_default(void)
++{
++ BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
++
++ return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
++}
++late_initcall(mptcp_scheduler_default);
+diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
+new file mode 100644
+index 000000000000..29ca1d868d17
+--- /dev/null
++++ b/net/mptcp/mptcp_wvegas.c
+@@ -0,0 +1,268 @@
++/*
++ * MPTCP implementation - WEIGHTED VEGAS
++ *
++ * Algorithm design:
++ * Yu Cao <cyAnalyst@126.com>
++ * Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
++ * Xiaoming Fu <fu@cs.uni-goettinggen.de>
++ *
++ * Implementation:
++ * Yu Cao <cyAnalyst@126.com>
++ * Enhuan Dong <deh13@mails.tsinghua.edu.cn>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++#include <linux/module.h>
++#include <linux/tcp.h>
++
++static int initial_alpha = 2;
++static int total_alpha = 10;
++static int gamma = 1;
++
++module_param(initial_alpha, int, 0644);
++MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
++module_param(total_alpha, int, 0644);
++MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
++module_param(gamma, int, 0644);
++MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
++
++#define MPTCP_WVEGAS_SCALE 16
++
++/* wVegas variables */
++struct wvegas {
++ u32 beg_snd_nxt; /* right edge during last RTT */
++ u8 doing_wvegas_now;/* if true, do wvegas for this RTT */
++
++ u16 cnt_rtt; /* # of RTTs measured within last RTT */
++ u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
++ u32 base_rtt; /* the min of all wVegas RTT measurements seen (in usec) */
++
++ u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
++ u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
++ int alpha; /* alpha for each subflows */
++
++ u32 queue_delay; /* queue delay*/
++};
++
++
++static inline u64 mptcp_wvegas_scale(u32 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++static void wvegas_enable(const struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->doing_wvegas_now = 1;
++
++ wvegas->beg_snd_nxt = tp->snd_nxt;
++
++ wvegas->cnt_rtt = 0;
++ wvegas->sampled_rtt = 0;
++
++ wvegas->instant_rate = 0;
++ wvegas->alpha = initial_alpha;
++ wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
++
++ wvegas->queue_delay = 0;
++}
++
++static inline void wvegas_disable(const struct sock *sk)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->doing_wvegas_now = 0;
++}
++
++static void mptcp_wvegas_init(struct sock *sk)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->base_rtt = 0x7fffffff;
++ wvegas_enable(sk);
++}
++
++static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
++{
++ return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
++}
++
++static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++ u32 vrtt;
++
++ if (rtt_us < 0)
++ return;
++
++ vrtt = rtt_us + 1;
++
++ if (vrtt < wvegas->base_rtt)
++ wvegas->base_rtt = vrtt;
++
++ wvegas->sampled_rtt += vrtt;
++ wvegas->cnt_rtt++;
++}
++
++static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
++{
++ if (ca_state == TCP_CA_Open)
++ wvegas_enable(sk);
++ else
++ wvegas_disable(sk);
++}
++
++static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++ if (event == CA_EVENT_CWND_RESTART) {
++ mptcp_wvegas_init(sk);
++ } else if (event == CA_EVENT_LOSS) {
++ struct wvegas *wvegas = inet_csk_ca(sk);
++ wvegas->instant_rate = 0;
++ }
++}
++
++static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
++{
++ return min(tp->snd_ssthresh, tp->snd_cwnd - 1);
++}
++
++static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
++{
++ u64 total_rate = 0;
++ struct sock *sub_sk;
++ const struct wvegas *wvegas = inet_csk_ca(sk);
++
++ if (!mpcb)
++ return wvegas->weight;
++
++
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
++
++ /* sampled_rtt is initialized by 0 */
++ if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
++ total_rate += sub_wvegas->instant_rate;
++ }
++
++ if (total_rate && wvegas->instant_rate)
++ return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
++ else
++ return wvegas->weight;
++}
++
++static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ if (!wvegas->doing_wvegas_now) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ if (after(ack, wvegas->beg_snd_nxt)) {
++ wvegas->beg_snd_nxt = tp->snd_nxt;
++
++ if (wvegas->cnt_rtt <= 2) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ } else {
++ u32 rtt, diff, q_delay;
++ u64 target_cwnd;
++
++ rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
++ target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
++
++ diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
++
++ if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
++ tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
++ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++
++ } else if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ tcp_slow_start(tp, acked);
++ } else {
++ if (diff >= wvegas->alpha) {
++ wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
++ wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
++ wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
++ }
++ if (diff > wvegas->alpha) {
++ tp->snd_cwnd--;
++ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++ } else if (diff < wvegas->alpha) {
++ tp->snd_cwnd++;
++ }
++
++ /* Try to drain link queue if needed*/
++ q_delay = rtt - wvegas->base_rtt;
++ if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
++ wvegas->queue_delay = q_delay;
++
++ if (q_delay >= 2 * wvegas->queue_delay) {
++ u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
++ tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
++ wvegas->queue_delay = 0;
++ }
++ }
++
++ if (tp->snd_cwnd < 2)
++ tp->snd_cwnd = 2;
++ else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
++ tp->snd_cwnd = tp->snd_cwnd_clamp;
++
++ tp->snd_ssthresh = tcp_current_ssthresh(sk);
++ }
++
++ wvegas->cnt_rtt = 0;
++ wvegas->sampled_rtt = 0;
++ }
++ /* Use normal slow start */
++ else if (tp->snd_cwnd <= tp->snd_ssthresh)
++ tcp_slow_start(tp, acked);
++}
++
++
++static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
++ .init = mptcp_wvegas_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_wvegas_cong_avoid,
++ .pkts_acked = mptcp_wvegas_pkts_acked,
++ .set_state = mptcp_wvegas_state,
++ .cwnd_event = mptcp_wvegas_cwnd_event,
++
++ .owner = THIS_MODULE,
++ .name = "wvegas",
++};
++
++static int __init mptcp_wvegas_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
++ tcp_register_congestion_control(&mptcp_wvegas);
++ return 0;
++}
++
++static void __exit mptcp_wvegas_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_wvegas);
++}
++
++module_init(mptcp_wvegas_register);
++module_exit(mptcp_wvegas_unregister);
++
++MODULE_AUTHOR("Yu Cao, Enhuan Dong");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP wVegas");
++MODULE_VERSION("0.1");
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-06 11:16 Anthony G. Basile
0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-10-06 11:16 UTC (permalink / raw
To: gentoo-commits
commit: 767ed99241e0cc05f2ef12e42c95efcc2898492d
Author: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Mon Oct 6 11:16:37 2014 +0000
Commit: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Mon Oct 6 11:16:37 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=767ed992
Linux patch 3.16.4
---
1003_linux-3.16.4.patch | 14205 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 14205 insertions(+)
diff --git a/1003_linux-3.16.4.patch b/1003_linux-3.16.4.patch
new file mode 100644
index 0000000..c50eb2d
--- /dev/null
+++ b/1003_linux-3.16.4.patch
@@ -0,0 +1,14205 @@
+diff --git a/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt b/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
+index 1486497a24c1..ce6a1a072028 100644
+--- a/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
++++ b/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
+@@ -4,11 +4,13 @@ Specifying interrupt information for devices
+ 1) Interrupt client nodes
+ -------------------------
+
+-Nodes that describe devices which generate interrupts must contain an either an
+-"interrupts" property or an "interrupts-extended" property. These properties
+-contain a list of interrupt specifiers, one per output interrupt. The format of
+-the interrupt specifier is determined by the interrupt controller to which the
+-interrupts are routed; see section 2 below for details.
++Nodes that describe devices which generate interrupts must contain an
++"interrupts" property, an "interrupts-extended" property, or both. If both are
++present, the latter should take precedence; the former may be provided simply
++for compatibility with software that does not recognize the latter. These
++properties contain a list of interrupt specifiers, one per output interrupt. The
++format of the interrupt specifier is determined by the interrupt controller to
++which the interrupts are routed; see section 2 below for details.
+
+ Example:
+ interrupt-parent = <&intc1>;
+diff --git a/Documentation/devicetree/bindings/staging/imx-drm/ldb.txt b/Documentation/devicetree/bindings/staging/imx-drm/ldb.txt
+index 578a1fca366e..443bcb6134d5 100644
+--- a/Documentation/devicetree/bindings/staging/imx-drm/ldb.txt
++++ b/Documentation/devicetree/bindings/staging/imx-drm/ldb.txt
+@@ -56,6 +56,9 @@ Required properties:
+ - fsl,data-width : should be <18> or <24>
+ - port: A port node with endpoint definitions as defined in
+ Documentation/devicetree/bindings/media/video-interfaces.txt.
++ On i.MX5, the internal two-input-multiplexer is used.
++ Due to hardware limitations, only one port (port@[0,1])
++ can be used for each channel (lvds-channel@[0,1], respectively)
+ On i.MX6, there should be four ports (port@[0-3]) that correspond
+ to the four LVDS multiplexer inputs.
+
+@@ -78,6 +81,8 @@ ldb: ldb@53fa8008 {
+ "di0", "di1";
+
+ lvds-channel@0 {
++ #address-cells = <1>;
++ #size-cells = <0>;
+ reg = <0>;
+ fsl,data-mapping = "spwg";
+ fsl,data-width = <24>;
+@@ -86,7 +91,9 @@ ldb: ldb@53fa8008 {
+ /* ... */
+ };
+
+- port {
++ port@0 {
++ reg = <0>;
++
+ lvds0_in: endpoint {
+ remote-endpoint = <&ipu_di0_lvds0>;
+ };
+@@ -94,6 +101,8 @@ ldb: ldb@53fa8008 {
+ };
+
+ lvds-channel@1 {
++ #address-cells = <1>;
++ #size-cells = <0>;
+ reg = <1>;
+ fsl,data-mapping = "spwg";
+ fsl,data-width = <24>;
+@@ -102,7 +111,9 @@ ldb: ldb@53fa8008 {
+ /* ... */
+ };
+
+- port {
++ port@1 {
++ reg = <1>;
++
+ lvds1_in: endpoint {
+ remote-endpoint = <&ipu_di1_lvds1>;
+ };
+diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
+index b7fa2f599459..f896f68a3ba3 100644
+--- a/Documentation/kernel-parameters.txt
++++ b/Documentation/kernel-parameters.txt
+@@ -3478,6 +3478,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
+ bogus residue values);
+ s = SINGLE_LUN (the device has only one
+ Logical Unit);
++ u = IGNORE_UAS (don't bind to the uas driver);
+ w = NO_WP_DETECT (don't test whether the
+ medium is write-protected).
+ Example: quirks=0419:aaf5:rl,0421:0433:rc
+diff --git a/Makefile b/Makefile
+index 9b25a830a9d7..e75c75f0ec35 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 3
++SUBLEVEL = 4
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+
+diff --git a/arch/arm/boot/dts/dra7-evm.dts b/arch/arm/boot/dts/dra7-evm.dts
+index 83089540e324..780d66119f3c 100644
+--- a/arch/arm/boot/dts/dra7-evm.dts
++++ b/arch/arm/boot/dts/dra7-evm.dts
+@@ -50,13 +50,13 @@
+
+ mcspi1_pins: pinmux_mcspi1_pins {
+ pinctrl-single,pins = <
+- 0x3a4 (PIN_INPUT | MUX_MODE0) /* spi2_clk */
+- 0x3a8 (PIN_INPUT | MUX_MODE0) /* spi2_d1 */
+- 0x3ac (PIN_INPUT | MUX_MODE0) /* spi2_d0 */
+- 0x3b0 (PIN_INPUT_SLEW | MUX_MODE0) /* spi2_cs0 */
+- 0x3b4 (PIN_INPUT_SLEW | MUX_MODE0) /* spi2_cs1 */
+- 0x3b8 (PIN_INPUT_SLEW | MUX_MODE6) /* spi2_cs2 */
+- 0x3bc (PIN_INPUT_SLEW | MUX_MODE6) /* spi2_cs3 */
++ 0x3a4 (PIN_INPUT | MUX_MODE0) /* spi1_sclk */
++ 0x3a8 (PIN_INPUT | MUX_MODE0) /* spi1_d1 */
++ 0x3ac (PIN_INPUT | MUX_MODE0) /* spi1_d0 */
++ 0x3b0 (PIN_INPUT_SLEW | MUX_MODE0) /* spi1_cs0 */
++ 0x3b4 (PIN_INPUT_SLEW | MUX_MODE0) /* spi1_cs1 */
++ 0x3b8 (PIN_INPUT_SLEW | MUX_MODE6) /* spi1_cs2.hdmi1_hpd */
++ 0x3bc (PIN_INPUT_SLEW | MUX_MODE6) /* spi1_cs3.hdmi1_cec */
+ >;
+ };
+
+@@ -427,22 +427,19 @@
+ gpmc,device-width = <2>;
+ gpmc,sync-clk-ps = <0>;
+ gpmc,cs-on-ns = <0>;
+- gpmc,cs-rd-off-ns = <40>;
+- gpmc,cs-wr-off-ns = <40>;
++ gpmc,cs-rd-off-ns = <80>;
++ gpmc,cs-wr-off-ns = <80>;
+ gpmc,adv-on-ns = <0>;
+- gpmc,adv-rd-off-ns = <30>;
+- gpmc,adv-wr-off-ns = <30>;
+- gpmc,we-on-ns = <5>;
+- gpmc,we-off-ns = <25>;
+- gpmc,oe-on-ns = <2>;
+- gpmc,oe-off-ns = <20>;
+- gpmc,access-ns = <20>;
+- gpmc,wr-access-ns = <40>;
+- gpmc,rd-cycle-ns = <40>;
+- gpmc,wr-cycle-ns = <40>;
+- gpmc,wait-pin = <0>;
+- gpmc,wait-on-read;
+- gpmc,wait-on-write;
++ gpmc,adv-rd-off-ns = <60>;
++ gpmc,adv-wr-off-ns = <60>;
++ gpmc,we-on-ns = <10>;
++ gpmc,we-off-ns = <50>;
++ gpmc,oe-on-ns = <4>;
++ gpmc,oe-off-ns = <40>;
++ gpmc,access-ns = <40>;
++ gpmc,wr-access-ns = <80>;
++ gpmc,rd-cycle-ns = <80>;
++ gpmc,wr-cycle-ns = <80>;
+ gpmc,bus-turnaround-ns = <0>;
+ gpmc,cycle2cycle-delay-ns = <0>;
+ gpmc,clk-activation-ns = <0>;
+diff --git a/arch/arm/boot/dts/dra7.dtsi b/arch/arm/boot/dts/dra7.dtsi
+index 80127638b379..f21ef396902f 100644
+--- a/arch/arm/boot/dts/dra7.dtsi
++++ b/arch/arm/boot/dts/dra7.dtsi
+@@ -172,7 +172,7 @@
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+- #interrupt-cells = <1>;
++ #interrupt-cells = <2>;
+ };
+
+ gpio2: gpio@48055000 {
+@@ -183,7 +183,7 @@
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+- #interrupt-cells = <1>;
++ #interrupt-cells = <2>;
+ };
+
+ gpio3: gpio@48057000 {
+@@ -194,7 +194,7 @@
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+- #interrupt-cells = <1>;
++ #interrupt-cells = <2>;
+ };
+
+ gpio4: gpio@48059000 {
+@@ -205,7 +205,7 @@
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+- #interrupt-cells = <1>;
++ #interrupt-cells = <2>;
+ };
+
+ gpio5: gpio@4805b000 {
+@@ -216,7 +216,7 @@
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+- #interrupt-cells = <1>;
++ #interrupt-cells = <2>;
+ };
+
+ gpio6: gpio@4805d000 {
+@@ -227,7 +227,7 @@
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+- #interrupt-cells = <1>;
++ #interrupt-cells = <2>;
+ };
+
+ gpio7: gpio@48051000 {
+@@ -238,7 +238,7 @@
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+- #interrupt-cells = <1>;
++ #interrupt-cells = <2>;
+ };
+
+ gpio8: gpio@48053000 {
+@@ -249,7 +249,7 @@
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupt-controller;
+- #interrupt-cells = <1>;
++ #interrupt-cells = <2>;
+ };
+
+ uart1: serial@4806a000 {
+diff --git a/arch/arm/boot/dts/imx53-qsrb.dts b/arch/arm/boot/dts/imx53-qsrb.dts
+index f1bbf9a32991..82d623d05915 100644
+--- a/arch/arm/boot/dts/imx53-qsrb.dts
++++ b/arch/arm/boot/dts/imx53-qsrb.dts
+@@ -28,6 +28,12 @@
+ MX53_PAD_CSI0_DAT9__I2C1_SCL 0x400001ec
+ >;
+ };
++
++ pinctrl_pmic: pmicgrp {
++ fsl,pins = <
++ MX53_PAD_CSI0_DAT5__GPIO5_23 0x1e4 /* IRQ */
++ >;
++ };
+ };
+ };
+
+@@ -38,6 +44,8 @@
+
+ pmic: mc34708@8 {
+ compatible = "fsl,mc34708";
++ pinctrl-names = "default";
++ pinctrl-0 = <&pinctrl_pmic>;
+ reg = <0x08>;
+ interrupt-parent = <&gpio5>;
+ interrupts = <23 0x8>;
+diff --git a/arch/arm/boot/dts/imx53.dtsi b/arch/arm/boot/dts/imx53.dtsi
+index 6456a0084388..7d42db36d6bb 100644
+--- a/arch/arm/boot/dts/imx53.dtsi
++++ b/arch/arm/boot/dts/imx53.dtsi
+@@ -419,10 +419,14 @@
+ status = "disabled";
+
+ lvds-channel@0 {
++ #address-cells = <1>;
++ #size-cells = <0>;
+ reg = <0>;
+ status = "disabled";
+
+- port {
++ port@0 {
++ reg = <0>;
++
+ lvds0_in: endpoint {
+ remote-endpoint = <&ipu_di0_lvds0>;
+ };
+@@ -430,10 +434,14 @@
+ };
+
+ lvds-channel@1 {
++ #address-cells = <1>;
++ #size-cells = <0>;
+ reg = <1>;
+ status = "disabled";
+
+- port {
++ port@1 {
++ reg = <1>;
++
+ lvds1_in: endpoint {
+ remote-endpoint = <&ipu_di1_lvds1>;
+ };
+@@ -724,7 +732,7 @@
+ compatible = "fsl,imx53-vpu";
+ reg = <0x63ff4000 0x1000>;
+ interrupts = <9>;
+- clocks = <&clks IMX5_CLK_VPU_GATE>,
++ clocks = <&clks IMX5_CLK_VPU_REFERENCE_GATE>,
+ <&clks IMX5_CLK_VPU_GATE>;
+ clock-names = "per", "ahb";
+ resets = <&src 1>;
+diff --git a/arch/arm/boot/dts/vf610-twr.dts b/arch/arm/boot/dts/vf610-twr.dts
+index 11d733406c7e..b8a5e8c68f06 100644
+--- a/arch/arm/boot/dts/vf610-twr.dts
++++ b/arch/arm/boot/dts/vf610-twr.dts
+@@ -168,7 +168,7 @@
+ };
+
+ pinctrl_esdhc1: esdhc1grp {
+- fsl,fsl,pins = <
++ fsl,pins = <
+ VF610_PAD_PTA24__ESDHC1_CLK 0x31ef
+ VF610_PAD_PTA25__ESDHC1_CMD 0x31ef
+ VF610_PAD_PTA26__ESDHC1_DAT0 0x31ef
+diff --git a/arch/arm/common/edma.c b/arch/arm/common/edma.c
+index 485be42519b9..ea97e14e1f0b 100644
+--- a/arch/arm/common/edma.c
++++ b/arch/arm/common/edma.c
+@@ -1415,14 +1415,14 @@ void edma_clear_event(unsigned channel)
+ EXPORT_SYMBOL(edma_clear_event);
+
+ static int edma_setup_from_hw(struct device *dev, struct edma_soc_info *pdata,
+- struct edma *edma_cc)
++ struct edma *edma_cc, int cc_id)
+ {
+ int i;
+ u32 value, cccfg;
+ s8 (*queue_priority_map)[2];
+
+ /* Decode the eDMA3 configuration from CCCFG register */
+- cccfg = edma_read(0, EDMA_CCCFG);
++ cccfg = edma_read(cc_id, EDMA_CCCFG);
+
+ value = GET_NUM_REGN(cccfg);
+ edma_cc->num_region = BIT(value);
+@@ -1436,7 +1436,8 @@ static int edma_setup_from_hw(struct device *dev, struct edma_soc_info *pdata,
+ value = GET_NUM_EVQUE(cccfg);
+ edma_cc->num_tc = value + 1;
+
+- dev_dbg(dev, "eDMA3 HW configuration (cccfg: 0x%08x):\n", cccfg);
++ dev_dbg(dev, "eDMA3 CC%d HW configuration (cccfg: 0x%08x):\n", cc_id,
++ cccfg);
+ dev_dbg(dev, "num_region: %u\n", edma_cc->num_region);
+ dev_dbg(dev, "num_channel: %u\n", edma_cc->num_channels);
+ dev_dbg(dev, "num_slot: %u\n", edma_cc->num_slots);
+@@ -1655,7 +1656,7 @@ static int edma_probe(struct platform_device *pdev)
+ return -ENOMEM;
+
+ /* Get eDMA3 configuration from IP */
+- ret = edma_setup_from_hw(dev, info[j], edma_cc[j]);
++ ret = edma_setup_from_hw(dev, info[j], edma_cc[j], j);
+ if (ret)
+ return ret;
+
+diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
+index fd43f7f55b70..79ecb4f34ffb 100644
+--- a/arch/arm/include/asm/cacheflush.h
++++ b/arch/arm/include/asm/cacheflush.h
+@@ -472,7 +472,6 @@ static inline void __sync_cache_range_r(volatile void *p, size_t size)
+ "mcr p15, 0, r0, c1, c0, 0 @ set SCTLR \n\t" \
+ "isb \n\t" \
+ "bl v7_flush_dcache_"__stringify(level)" \n\t" \
+- "clrex \n\t" \
+ "mrc p15, 0, r0, c1, c0, 1 @ get ACTLR \n\t" \
+ "bic r0, r0, #(1 << 6) @ disable local coherency \n\t" \
+ "mcr p15, 0, r0, c1, c0, 1 @ set ACTLR \n\t" \
+diff --git a/arch/arm/include/asm/tls.h b/arch/arm/include/asm/tls.h
+index 83259b873333..5f833f7adba1 100644
+--- a/arch/arm/include/asm/tls.h
++++ b/arch/arm/include/asm/tls.h
+@@ -1,6 +1,9 @@
+ #ifndef __ASMARM_TLS_H
+ #define __ASMARM_TLS_H
+
++#include <linux/compiler.h>
++#include <asm/thread_info.h>
++
+ #ifdef __ASSEMBLY__
+ #include <asm/asm-offsets.h>
+ .macro switch_tls_none, base, tp, tpuser, tmp1, tmp2
+@@ -50,6 +53,49 @@
+ #endif
+
+ #ifndef __ASSEMBLY__
++
++static inline void set_tls(unsigned long val)
++{
++ struct thread_info *thread;
++
++ thread = current_thread_info();
++
++ thread->tp_value[0] = val;
++
++ /*
++ * This code runs with preemption enabled and therefore must
++ * be reentrant with respect to switch_tls.
++ *
++ * We need to ensure ordering between the shadow state and the
++ * hardware state, so that we don't corrupt the hardware state
++ * with a stale shadow state during context switch.
++ *
++ * If we're preempted here, switch_tls will load TPIDRURO from
++ * thread_info upon resuming execution and the following mcr
++ * is merely redundant.
++ */
++ barrier();
++
++ if (!tls_emu) {
++ if (has_tls_reg) {
++ asm("mcr p15, 0, %0, c13, c0, 3"
++ : : "r" (val));
++ } else {
++#ifdef CONFIG_KUSER_HELPERS
++ /*
++ * User space must never try to access this
++ * directly. Expect your app to break
++ * eventually if you do so. The user helper
++ * at 0xffff0fe0 must be used instead. (see
++ * entry-armv.S for details)
++ */
++ *((unsigned int *)0xffff0ff0) = val;
++#endif
++ }
++
++ }
++}
++
+ static inline unsigned long get_tpuser(void)
+ {
+ unsigned long reg = 0;
+@@ -59,5 +105,23 @@ static inline unsigned long get_tpuser(void)
+
+ return reg;
+ }
++
++static inline void set_tpuser(unsigned long val)
++{
++ /* Since TPIDRURW is fully context-switched (unlike TPIDRURO),
++ * we need not update thread_info.
++ */
++ if (has_tls_reg && !tls_emu) {
++ asm("mcr p15, 0, %0, c13, c0, 2"
++ : : "r" (val));
++ }
++}
++
++static inline void flush_tls(void)
++{
++ set_tls(0);
++ set_tpuser(0);
++}
++
+ #endif
+ #endif /* __ASMARM_TLS_H */
+diff --git a/arch/arm/kernel/entry-header.S b/arch/arm/kernel/entry-header.S
+index 5d702f8900b1..0325dbf6e762 100644
+--- a/arch/arm/kernel/entry-header.S
++++ b/arch/arm/kernel/entry-header.S
+@@ -208,26 +208,21 @@
+ #endif
+ .endif
+ msr spsr_cxsf, \rpsr
+-#if defined(CONFIG_CPU_V6)
+- ldr r0, [sp]
+- strex r1, r2, [sp] @ clear the exclusive monitor
+- ldmib sp, {r1 - pc}^ @ load r1 - pc, cpsr
+-#elif defined(CONFIG_CPU_32v6K)
+- clrex @ clear the exclusive monitor
+- ldmia sp, {r0 - pc}^ @ load r0 - pc, cpsr
+-#else
+- ldmia sp, {r0 - pc}^ @ load r0 - pc, cpsr
++#if defined(CONFIG_CPU_V6) || defined(CONFIG_CPU_32v6K)
++ @ We must avoid clrex due to Cortex-A15 erratum #830321
++ sub r0, sp, #4 @ uninhabited address
++ strex r1, r2, [r0] @ clear the exclusive monitor
+ #endif
++ ldmia sp, {r0 - pc}^ @ load r0 - pc, cpsr
+ .endm
+
+ .macro restore_user_regs, fast = 0, offset = 0
+ ldr r1, [sp, #\offset + S_PSR] @ get calling cpsr
+ ldr lr, [sp, #\offset + S_PC]! @ get pc
+ msr spsr_cxsf, r1 @ save in spsr_svc
+-#if defined(CONFIG_CPU_V6)
++#if defined(CONFIG_CPU_V6) || defined(CONFIG_CPU_32v6K)
++ @ We must avoid clrex due to Cortex-A15 erratum #830321
+ strex r1, r2, [sp] @ clear the exclusive monitor
+-#elif defined(CONFIG_CPU_32v6K)
+- clrex @ clear the exclusive monitor
+ #endif
+ .if \fast
+ ldmdb sp, {r1 - lr}^ @ get calling r1 - lr
+@@ -267,7 +262,10 @@
+ .endif
+ ldr lr, [sp, #S_SP] @ top of the stack
+ ldrd r0, r1, [sp, #S_LR] @ calling lr and pc
+- clrex @ clear the exclusive monitor
++
++ @ We must avoid clrex due to Cortex-A15 erratum #830321
++ strex r2, r1, [sp, #S_LR] @ clear the exclusive monitor
++
+ stmdb lr!, {r0, r1, \rpsr} @ calling lr and rfe context
+ ldmia sp, {r0 - r12}
+ mov sp, lr
+@@ -288,13 +286,16 @@
+ .endm
+ #else /* ifdef CONFIG_CPU_V7M */
+ .macro restore_user_regs, fast = 0, offset = 0
+- clrex @ clear the exclusive monitor
+ mov r2, sp
+ load_user_sp_lr r2, r3, \offset + S_SP @ calling sp, lr
+ ldr r1, [sp, #\offset + S_PSR] @ get calling cpsr
+ ldr lr, [sp, #\offset + S_PC] @ get pc
+ add sp, sp, #\offset + S_SP
+ msr spsr_cxsf, r1 @ save in spsr_svc
++
++ @ We must avoid clrex due to Cortex-A15 erratum #830321
++ strex r1, r2, [sp] @ clear the exclusive monitor
++
+ .if \fast
+ ldmdb sp, {r1 - r12} @ get calling r1 - r12
+ .else
+diff --git a/arch/arm/kernel/irq.c b/arch/arm/kernel/irq.c
+index 2c4257604513..5c4d38e32a51 100644
+--- a/arch/arm/kernel/irq.c
++++ b/arch/arm/kernel/irq.c
+@@ -175,7 +175,7 @@ static bool migrate_one_irq(struct irq_desc *desc)
+ c = irq_data_get_irq_chip(d);
+ if (!c->irq_set_affinity)
+ pr_debug("IRQ%u: unable to set affinity\n", d->irq);
+- else if (c->irq_set_affinity(d, affinity, true) == IRQ_SET_MASK_OK && ret)
++ else if (c->irq_set_affinity(d, affinity, false) == IRQ_SET_MASK_OK && ret)
+ cpumask_copy(d->affinity, affinity);
+
+ return ret;
+diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
+index af9e35e8836f..290ad8170d7a 100644
+--- a/arch/arm/kernel/perf_event_cpu.c
++++ b/arch/arm/kernel/perf_event_cpu.c
+@@ -76,21 +76,15 @@ static struct pmu_hw_events *cpu_pmu_get_cpu_events(void)
+
+ static void cpu_pmu_enable_percpu_irq(void *data)
+ {
+- struct arm_pmu *cpu_pmu = data;
+- struct platform_device *pmu_device = cpu_pmu->plat_device;
+- int irq = platform_get_irq(pmu_device, 0);
++ int irq = *(int *)data;
+
+ enable_percpu_irq(irq, IRQ_TYPE_NONE);
+- cpumask_set_cpu(smp_processor_id(), &cpu_pmu->active_irqs);
+ }
+
+ static void cpu_pmu_disable_percpu_irq(void *data)
+ {
+- struct arm_pmu *cpu_pmu = data;
+- struct platform_device *pmu_device = cpu_pmu->plat_device;
+- int irq = platform_get_irq(pmu_device, 0);
++ int irq = *(int *)data;
+
+- cpumask_clear_cpu(smp_processor_id(), &cpu_pmu->active_irqs);
+ disable_percpu_irq(irq);
+ }
+
+@@ -103,7 +97,7 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu)
+
+ irq = platform_get_irq(pmu_device, 0);
+ if (irq >= 0 && irq_is_percpu(irq)) {
+- on_each_cpu(cpu_pmu_disable_percpu_irq, cpu_pmu, 1);
++ on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
+ free_percpu_irq(irq, &percpu_pmu);
+ } else {
+ for (i = 0; i < irqs; ++i) {
+@@ -138,7 +132,7 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, irq_handler_t handler)
+ irq);
+ return err;
+ }
+- on_each_cpu(cpu_pmu_enable_percpu_irq, cpu_pmu, 1);
++ on_each_cpu(cpu_pmu_enable_percpu_irq, &irq, 1);
+ } else {
+ for (i = 0; i < irqs; ++i) {
+ err = 0;
+diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c
+index 1d37568c547a..ac8dc747264c 100644
+--- a/arch/arm/kernel/perf_event_v7.c
++++ b/arch/arm/kernel/perf_event_v7.c
+@@ -157,6 +157,7 @@ static const unsigned armv7_a8_perf_map[PERF_COUNT_HW_MAX] = {
+ [PERF_COUNT_HW_BUS_CYCLES] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = ARMV7_A8_PERFCTR_STALL_ISIDE,
+ [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = HW_OP_UNSUPPORTED,
++ [PERF_COUNT_HW_REF_CPU_CYCLES] = HW_OP_UNSUPPORTED,
+ };
+
+ static const unsigned armv7_a8_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+@@ -281,6 +282,7 @@ static const unsigned armv7_a9_perf_map[PERF_COUNT_HW_MAX] = {
+ [PERF_COUNT_HW_BUS_CYCLES] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = ARMV7_A9_PERFCTR_STALL_ICACHE,
+ [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = ARMV7_A9_PERFCTR_STALL_DISPATCH,
++ [PERF_COUNT_HW_REF_CPU_CYCLES] = HW_OP_UNSUPPORTED,
+ };
+
+ static const unsigned armv7_a9_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+@@ -405,6 +407,7 @@ static const unsigned armv7_a5_perf_map[PERF_COUNT_HW_MAX] = {
+ [PERF_COUNT_HW_BUS_CYCLES] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = HW_OP_UNSUPPORTED,
++ [PERF_COUNT_HW_REF_CPU_CYCLES] = HW_OP_UNSUPPORTED,
+ };
+
+ static const unsigned armv7_a5_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+@@ -527,6 +530,7 @@ static const unsigned armv7_a15_perf_map[PERF_COUNT_HW_MAX] = {
+ [PERF_COUNT_HW_BUS_CYCLES] = ARMV7_PERFCTR_BUS_CYCLES,
+ [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = HW_OP_UNSUPPORTED,
++ [PERF_COUNT_HW_REF_CPU_CYCLES] = HW_OP_UNSUPPORTED,
+ };
+
+ static const unsigned armv7_a15_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+@@ -651,6 +655,7 @@ static const unsigned armv7_a7_perf_map[PERF_COUNT_HW_MAX] = {
+ [PERF_COUNT_HW_BUS_CYCLES] = ARMV7_PERFCTR_BUS_CYCLES,
+ [PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = HW_OP_UNSUPPORTED,
+ [PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = HW_OP_UNSUPPORTED,
++ [PERF_COUNT_HW_REF_CPU_CYCLES] = HW_OP_UNSUPPORTED,
+ };
+
+ static const unsigned armv7_a7_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
+index 81ef686a91ca..a35f6ebbd2c2 100644
+--- a/arch/arm/kernel/process.c
++++ b/arch/arm/kernel/process.c
+@@ -334,6 +334,8 @@ void flush_thread(void)
+ memset(&tsk->thread.debug, 0, sizeof(struct debug_info));
+ memset(&thread->fpstate, 0, sizeof(union fp_state));
+
++ flush_tls();
++
+ thread_notify(THREAD_NOTIFY_FLUSH, thread);
+ }
+
+diff --git a/arch/arm/kernel/thumbee.c b/arch/arm/kernel/thumbee.c
+index 7b8403b76666..80f0d69205e7 100644
+--- a/arch/arm/kernel/thumbee.c
++++ b/arch/arm/kernel/thumbee.c
+@@ -45,7 +45,7 @@ static int thumbee_notifier(struct notifier_block *self, unsigned long cmd, void
+
+ switch (cmd) {
+ case THREAD_NOTIFY_FLUSH:
+- thread->thumbee_state = 0;
++ teehbr_write(0);
+ break;
+ case THREAD_NOTIFY_SWITCH:
+ current_thread_info()->thumbee_state = teehbr_read();
+diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
+index abd2fc067736..da11b28a72da 100644
+--- a/arch/arm/kernel/traps.c
++++ b/arch/arm/kernel/traps.c
+@@ -579,7 +579,6 @@ do_cache_op(unsigned long start, unsigned long end, int flags)
+ #define NR(x) ((__ARM_NR_##x) - __ARM_NR_BASE)
+ asmlinkage int arm_syscall(int no, struct pt_regs *regs)
+ {
+- struct thread_info *thread = current_thread_info();
+ siginfo_t info;
+
+ if ((no >> 16) != (__ARM_NR_BASE>> 16))
+@@ -630,21 +629,7 @@ asmlinkage int arm_syscall(int no, struct pt_regs *regs)
+ return regs->ARM_r0;
+
+ case NR(set_tls):
+- thread->tp_value[0] = regs->ARM_r0;
+- if (tls_emu)
+- return 0;
+- if (has_tls_reg) {
+- asm ("mcr p15, 0, %0, c13, c0, 3"
+- : : "r" (regs->ARM_r0));
+- } else {
+- /*
+- * User space must never try to access this directly.
+- * Expect your app to break eventually if you do so.
+- * The user helper at 0xffff0fe0 must be used instead.
+- * (see entry-armv.S for details)
+- */
+- *((unsigned int *)0xffff0ff0) = regs->ARM_r0;
+- }
++ set_tls(regs->ARM_r0);
+ return 0;
+
+ #ifdef CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG
+diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
+index 4c979d466cc1..a96a8043277c 100644
+--- a/arch/arm/kvm/handle_exit.c
++++ b/arch/arm/kvm/handle_exit.c
+@@ -93,6 +93,8 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
+ else
+ kvm_vcpu_block(vcpu);
+
++ kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
++
+ return 1;
+ }
+
+diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
+index 1b9844d369cc..ee4f7447a1d3 100644
+--- a/arch/arm/kvm/init.S
++++ b/arch/arm/kvm/init.S
+@@ -98,6 +98,10 @@ __do_hyp_init:
+ mrc p15, 0, r0, c10, c2, 1
+ mcr p15, 4, r0, c10, c2, 1
+
++ @ Invalidate the stale TLBs from Bootloader
++ mcr p15, 4, r0, c8, c7, 0 @ TLBIALLH
++ dsb ish
++
+ @ Set the HSCTLR to:
+ @ - ARM/THUMB exceptions: Kernel config (Thumb-2 kernel)
+ @ - Endianness: Kernel config
+diff --git a/arch/arm/mach-exynos/mcpm-exynos.c b/arch/arm/mach-exynos/mcpm-exynos.c
+index ace0ed617476..25ef73278a26 100644
+--- a/arch/arm/mach-exynos/mcpm-exynos.c
++++ b/arch/arm/mach-exynos/mcpm-exynos.c
+@@ -39,7 +39,6 @@
+ "mcr p15, 0, r0, c1, c0, 0 @ set SCTLR\n\t" \
+ "isb\n\t"\
+ "bl v7_flush_dcache_"__stringify(level)"\n\t" \
+- "clrex\n\t"\
+ "mrc p15, 0, r0, c1, c0, 1 @ get ACTLR\n\t" \
+ "bic r0, r0, #(1 << 6) @ disable local coherency\n\t" \
+ /* Dummy Load of a device register to avoid Erratum 799270 */ \
+diff --git a/arch/arm/mach-imx/clk-gate2.c b/arch/arm/mach-imx/clk-gate2.c
+index 84acdfd1d715..5a75cdc81891 100644
+--- a/arch/arm/mach-imx/clk-gate2.c
++++ b/arch/arm/mach-imx/clk-gate2.c
+@@ -97,7 +97,7 @@ static int clk_gate2_is_enabled(struct clk_hw *hw)
+ struct clk_gate2 *gate = to_clk_gate2(hw);
+
+ if (gate->share_count)
+- return !!(*gate->share_count);
++ return !!__clk_get_enable_count(hw->clk);
+ else
+ return clk_gate2_reg_is_enabled(gate->reg, gate->bit_idx);
+ }
+@@ -127,10 +127,6 @@ struct clk *clk_register_gate2(struct device *dev, const char *name,
+ gate->bit_idx = bit_idx;
+ gate->flags = clk_gate2_flags;
+ gate->lock = lock;
+-
+- /* Initialize share_count per hardware state */
+- if (share_count)
+- *share_count = clk_gate2_reg_is_enabled(reg, bit_idx) ? 1 : 0;
+ gate->share_count = share_count;
+
+ init.name = name;
+diff --git a/arch/arm/mach-imx/suspend-imx6.S b/arch/arm/mach-imx/suspend-imx6.S
+index fe123b079c05..87bdf7a629a5 100644
+--- a/arch/arm/mach-imx/suspend-imx6.S
++++ b/arch/arm/mach-imx/suspend-imx6.S
+@@ -172,6 +172,8 @@ ENTRY(imx6_suspend)
+ ldr r6, [r11, #0x0]
+ ldr r11, [r0, #PM_INFO_MX6Q_GPC_V_OFFSET]
+ ldr r6, [r11, #0x0]
++ ldr r11, [r0, #PM_INFO_MX6Q_IOMUXC_V_OFFSET]
++ ldr r6, [r11, #0x0]
+
+ /* use r11 to store the IO address */
+ ldr r11, [r0, #PM_INFO_MX6Q_SRC_V_OFFSET]
+diff --git a/arch/arm/mach-omap2/omap_hwmod.c b/arch/arm/mach-omap2/omap_hwmod.c
+index da1b256caccc..8fd87a3055bf 100644
+--- a/arch/arm/mach-omap2/omap_hwmod.c
++++ b/arch/arm/mach-omap2/omap_hwmod.c
+@@ -3349,6 +3349,9 @@ int __init omap_hwmod_register_links(struct omap_hwmod_ocp_if **ois)
+ if (!ois)
+ return 0;
+
++ if (ois[0] == NULL) /* Empty list */
++ return 0;
++
+ if (!linkspace) {
+ if (_alloc_linkspace(ois)) {
+ pr_err("omap_hwmod: could not allocate link space\n");
+diff --git a/arch/arm/mach-omap2/omap_hwmod_7xx_data.c b/arch/arm/mach-omap2/omap_hwmod_7xx_data.c
+index 284324f2b98a..c95033c1029b 100644
+--- a/arch/arm/mach-omap2/omap_hwmod_7xx_data.c
++++ b/arch/arm/mach-omap2/omap_hwmod_7xx_data.c
+@@ -35,6 +35,7 @@
+ #include "i2c.h"
+ #include "mmc.h"
+ #include "wd_timer.h"
++#include "soc.h"
+
+ /* Base offset for all DRA7XX interrupts external to MPUSS */
+ #define DRA7XX_IRQ_GIC_START 32
+@@ -2705,7 +2706,6 @@ static struct omap_hwmod_ocp_if *dra7xx_hwmod_ocp_ifs[] __initdata = {
+ &dra7xx_l4_per3__usb_otg_ss1,
+ &dra7xx_l4_per3__usb_otg_ss2,
+ &dra7xx_l4_per3__usb_otg_ss3,
+- &dra7xx_l4_per3__usb_otg_ss4,
+ &dra7xx_l3_main_1__vcp1,
+ &dra7xx_l4_per2__vcp1,
+ &dra7xx_l3_main_1__vcp2,
+@@ -2714,8 +2714,26 @@ static struct omap_hwmod_ocp_if *dra7xx_hwmod_ocp_ifs[] __initdata = {
+ NULL,
+ };
+
++static struct omap_hwmod_ocp_if *dra74x_hwmod_ocp_ifs[] __initdata = {
++ &dra7xx_l4_per3__usb_otg_ss4,
++ NULL,
++};
++
++static struct omap_hwmod_ocp_if *dra72x_hwmod_ocp_ifs[] __initdata = {
++ NULL,
++};
++
+ int __init dra7xx_hwmod_init(void)
+ {
++ int ret;
++
+ omap_hwmod_init();
+- return omap_hwmod_register_links(dra7xx_hwmod_ocp_ifs);
++ ret = omap_hwmod_register_links(dra7xx_hwmod_ocp_ifs);
++
++ if (!ret && soc_is_dra74x())
++ return omap_hwmod_register_links(dra74x_hwmod_ocp_ifs);
++ else if (!ret && soc_is_dra72x())
++ return omap_hwmod_register_links(dra72x_hwmod_ocp_ifs);
++
++ return ret;
+ }
+diff --git a/arch/arm/mach-omap2/soc.h b/arch/arm/mach-omap2/soc.h
+index 01ca8086fb6c..4376f59626d1 100644
+--- a/arch/arm/mach-omap2/soc.h
++++ b/arch/arm/mach-omap2/soc.h
+@@ -245,6 +245,8 @@ IS_AM_SUBCLASS(437x, 0x437)
+ #define soc_is_omap54xx() 0
+ #define soc_is_omap543x() 0
+ #define soc_is_dra7xx() 0
++#define soc_is_dra74x() 0
++#define soc_is_dra72x() 0
+
+ #if defined(MULTI_OMAP2)
+ # if defined(CONFIG_ARCH_OMAP2)
+@@ -393,7 +395,11 @@ IS_OMAP_TYPE(3430, 0x3430)
+
+ #if defined(CONFIG_SOC_DRA7XX)
+ #undef soc_is_dra7xx
++#undef soc_is_dra74x
++#undef soc_is_dra72x
+ #define soc_is_dra7xx() (of_machine_is_compatible("ti,dra7"))
++#define soc_is_dra74x() (of_machine_is_compatible("ti,dra74"))
++#define soc_is_dra72x() (of_machine_is_compatible("ti,dra72"))
+ #endif
+
+ /* Various silicon revisions for omap2 */
+diff --git a/arch/arm/mm/abort-ev6.S b/arch/arm/mm/abort-ev6.S
+index 3815a8262af0..8c48c5c22a33 100644
+--- a/arch/arm/mm/abort-ev6.S
++++ b/arch/arm/mm/abort-ev6.S
+@@ -17,12 +17,6 @@
+ */
+ .align 5
+ ENTRY(v6_early_abort)
+-#ifdef CONFIG_CPU_V6
+- sub r1, sp, #4 @ Get unused stack location
+- strex r0, r1, [r1] @ Clear the exclusive monitor
+-#elif defined(CONFIG_CPU_32v6K)
+- clrex
+-#endif
+ mrc p15, 0, r1, c5, c0, 0 @ get FSR
+ mrc p15, 0, r0, c6, c0, 0 @ get FAR
+ /*
+diff --git a/arch/arm/mm/abort-ev7.S b/arch/arm/mm/abort-ev7.S
+index 703375277ba6..4812ad054214 100644
+--- a/arch/arm/mm/abort-ev7.S
++++ b/arch/arm/mm/abort-ev7.S
+@@ -13,12 +13,6 @@
+ */
+ .align 5
+ ENTRY(v7_early_abort)
+- /*
+- * The effect of data aborts on on the exclusive access monitor are
+- * UNPREDICTABLE. Do a CLREX to clear the state
+- */
+- clrex
+-
+ mrc p15, 0, r1, c5, c0, 0 @ get FSR
+ mrc p15, 0, r0, c6, c0, 0 @ get FAR
+
+diff --git a/arch/arm/mm/alignment.c b/arch/arm/mm/alignment.c
+index b8cb1a2688a0..33ca98085cd5 100644
+--- a/arch/arm/mm/alignment.c
++++ b/arch/arm/mm/alignment.c
+@@ -41,6 +41,7 @@
+ * This code is not portable to processors with late data abort handling.
+ */
+ #define CODING_BITS(i) (i & 0x0e000000)
++#define COND_BITS(i) (i & 0xf0000000)
+
+ #define LDST_I_BIT(i) (i & (1 << 26)) /* Immediate constant */
+ #define LDST_P_BIT(i) (i & (1 << 24)) /* Preindex */
+@@ -819,6 +820,8 @@ do_alignment(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
+ break;
+
+ case 0x04000000: /* ldr or str immediate */
++ if (COND_BITS(instr) == 0xf0000000) /* NEON VLDn, VSTn */
++ goto bad;
+ offset.un = OFFSET_BITS(instr);
+ handler = do_alignment_ldrstr;
+ break;
+diff --git a/arch/arm64/include/asm/hw_breakpoint.h b/arch/arm64/include/asm/hw_breakpoint.h
+index d064047612b1..52b484b6aa1a 100644
+--- a/arch/arm64/include/asm/hw_breakpoint.h
++++ b/arch/arm64/include/asm/hw_breakpoint.h
+@@ -79,7 +79,6 @@ static inline void decode_ctrl_reg(u32 reg,
+ */
+ #define ARM_MAX_BRP 16
+ #define ARM_MAX_WRP 16
+-#define ARM_MAX_HBP_SLOTS (ARM_MAX_BRP + ARM_MAX_WRP)
+
+ /* Virtual debug register bases. */
+ #define AARCH64_DBG_REG_BVR 0
+diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
+index 501000fadb6f..41ed9e13795e 100644
+--- a/arch/arm64/include/asm/ptrace.h
++++ b/arch/arm64/include/asm/ptrace.h
+@@ -137,7 +137,7 @@ struct pt_regs {
+ (!((regs)->pstate & PSR_F_BIT))
+
+ #define user_stack_pointer(regs) \
+- (!compat_user_mode(regs)) ? ((regs)->sp) : ((regs)->compat_sp)
++ (!compat_user_mode(regs) ? (regs)->sp : (regs)->compat_sp)
+
+ static inline unsigned long regs_return_value(struct pt_regs *regs)
+ {
+diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
+index 0f08dfd69ebc..dfa6e3e74fdd 100644
+--- a/arch/arm64/kernel/irq.c
++++ b/arch/arm64/kernel/irq.c
+@@ -97,19 +97,15 @@ static bool migrate_one_irq(struct irq_desc *desc)
+ if (irqd_is_per_cpu(d) || !cpumask_test_cpu(smp_processor_id(), affinity))
+ return false;
+
+- if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids)
++ if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) {
++ affinity = cpu_online_mask;
+ ret = true;
++ }
+
+- /*
+- * when using forced irq_set_affinity we must ensure that the cpu
+- * being offlined is not present in the affinity mask, it may be
+- * selected as the target CPU otherwise
+- */
+- affinity = cpu_online_mask;
+ c = irq_data_get_irq_chip(d);
+ if (!c->irq_set_affinity)
+ pr_debug("IRQ%u: unable to set affinity\n", d->irq);
+- else if (c->irq_set_affinity(d, affinity, true) == IRQ_SET_MASK_OK && ret)
++ else if (c->irq_set_affinity(d, affinity, false) == IRQ_SET_MASK_OK && ret)
+ cpumask_copy(d->affinity, affinity);
+
+ return ret;
+diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
+index 43b7c34f92cb..7b0827ae402d 100644
+--- a/arch/arm64/kernel/process.c
++++ b/arch/arm64/kernel/process.c
+@@ -224,9 +224,27 @@ void exit_thread(void)
+ {
+ }
+
++static void tls_thread_flush(void)
++{
++ asm ("msr tpidr_el0, xzr");
++
++ if (is_compat_task()) {
++ current->thread.tp_value = 0;
++
++ /*
++ * We need to ensure ordering between the shadow state and the
++ * hardware state, so that we don't corrupt the hardware state
++ * with a stale shadow state during context switch.
++ */
++ barrier();
++ asm ("msr tpidrro_el0, xzr");
++ }
++}
++
+ void flush_thread(void)
+ {
+ fpsimd_flush_thread();
++ tls_thread_flush();
+ flush_ptrace_hw_breakpoint(current);
+ }
+
+diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
+index 9fde010c945f..167c5edecad4 100644
+--- a/arch/arm64/kernel/ptrace.c
++++ b/arch/arm64/kernel/ptrace.c
+@@ -85,7 +85,8 @@ static void ptrace_hbptriggered(struct perf_event *bp,
+ break;
+ }
+ }
+- for (i = ARM_MAX_BRP; i < ARM_MAX_HBP_SLOTS && !bp; ++i) {
++
++ for (i = 0; i < ARM_MAX_WRP; ++i) {
+ if (current->thread.debug.hbp_watch[i] == bp) {
+ info.si_errno = -((i << 1) + 1);
+ break;
+diff --git a/arch/arm64/kernel/sys_compat.c b/arch/arm64/kernel/sys_compat.c
+index 26e9c4eeaba8..78039927c807 100644
+--- a/arch/arm64/kernel/sys_compat.c
++++ b/arch/arm64/kernel/sys_compat.c
+@@ -79,6 +79,12 @@ long compat_arm_syscall(struct pt_regs *regs)
+
+ case __ARM_NR_compat_set_tls:
+ current->thread.tp_value = regs->regs[0];
++
++ /*
++ * Protect against register corruption from context switch.
++ * See comment in tls_thread_flush.
++ */
++ barrier();
+ asm ("msr tpidrro_el0, %0" : : "r" (regs->regs[0]));
+ return 0;
+
+diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
+index 182415e1a952..2ca885c3eb0f 100644
+--- a/arch/arm64/kvm/handle_exit.c
++++ b/arch/arm64/kvm/handle_exit.c
+@@ -66,6 +66,8 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
+ else
+ kvm_vcpu_block(vcpu);
+
++ kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
++
+ return 1;
+ }
+
+diff --git a/arch/arm64/kvm/hyp-init.S b/arch/arm64/kvm/hyp-init.S
+index d968796f4b2d..c3191168a994 100644
+--- a/arch/arm64/kvm/hyp-init.S
++++ b/arch/arm64/kvm/hyp-init.S
+@@ -80,6 +80,10 @@ __do_hyp_init:
+ msr mair_el2, x4
+ isb
+
++ /* Invalidate the stale TLBs from Bootloader */
++ tlbi alle2
++ dsb sy
++
+ mrs x4, sctlr_el2
+ and x4, x4, #SCTLR_EL2_EE // preserve endianness of EL2
+ ldr x5, =SCTLR_EL2_FLAGS
+diff --git a/arch/mips/boot/compressed/decompress.c b/arch/mips/boot/compressed/decompress.c
+index c00c4ddf4514..5244cecf1e45 100644
+--- a/arch/mips/boot/compressed/decompress.c
++++ b/arch/mips/boot/compressed/decompress.c
+@@ -13,6 +13,7 @@
+
+ #include <linux/types.h>
+ #include <linux/kernel.h>
++#include <linux/string.h>
+
+ #include <asm/addrspace.h>
+
+diff --git a/arch/mips/kernel/mcount.S b/arch/mips/kernel/mcount.S
+index 539b6294b613..8f89ff4ed524 100644
+--- a/arch/mips/kernel/mcount.S
++++ b/arch/mips/kernel/mcount.S
+@@ -123,7 +123,11 @@ NESTED(_mcount, PT_SIZE, ra)
+ nop
+ #endif
+ b ftrace_stub
++#ifdef CONFIG_32BIT
++ addiu sp, sp, 8
++#else
+ nop
++#endif
+
+ static_trace:
+ MCOUNT_SAVE_REGS
+@@ -133,6 +137,9 @@ static_trace:
+ move a1, AT /* arg2: parent's return address */
+
+ MCOUNT_RESTORE_REGS
++#ifdef CONFIG_32BIT
++ addiu sp, sp, 8
++#endif
+ .globl ftrace_stub
+ ftrace_stub:
+ RETURN_BACK
+@@ -177,6 +184,11 @@ NESTED(ftrace_graph_caller, PT_SIZE, ra)
+ jal prepare_ftrace_return
+ nop
+ MCOUNT_RESTORE_REGS
++#ifndef CONFIG_DYNAMIC_FTRACE
++#ifdef CONFIG_32BIT
++ addiu sp, sp, 8
++#endif
++#endif
+ RETURN_BACK
+ END(ftrace_graph_caller)
+
+diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
+index bf0fc6b16ad9..7a4727795a70 100644
+--- a/arch/mips/math-emu/cp1emu.c
++++ b/arch/mips/math-emu/cp1emu.c
+@@ -650,9 +650,9 @@ static inline int cop1_64bit(struct pt_regs *xcp)
+ #define SIFROMREG(si, x) \
+ do { \
+ if (cop1_64bit(xcp)) \
+- (si) = get_fpr32(&ctx->fpr[x], 0); \
++ (si) = (int)get_fpr32(&ctx->fpr[x], 0); \
+ else \
+- (si) = get_fpr32(&ctx->fpr[(x) & ~1], (x) & 1); \
++ (si) = (int)get_fpr32(&ctx->fpr[(x) & ~1], (x) & 1); \
+ } while (0)
+
+ #define SITOREG(si, x) \
+@@ -667,7 +667,7 @@ do { \
+ } \
+ } while (0)
+
+-#define SIFROMHREG(si, x) ((si) = get_fpr32(&ctx->fpr[x], 1))
++#define SIFROMHREG(si, x) ((si) = (int)get_fpr32(&ctx->fpr[x], 1))
+
+ #define SITOHREG(si, x) \
+ do { \
+diff --git a/arch/parisc/Makefile b/arch/parisc/Makefile
+index 7187664034c3..5db8882f732c 100644
+--- a/arch/parisc/Makefile
++++ b/arch/parisc/Makefile
+@@ -48,7 +48,12 @@ cflags-y := -pipe
+
+ # These flags should be implied by an hppa-linux configuration, but they
+ # are not in gcc 3.2.
+-cflags-y += -mno-space-regs -mfast-indirect-calls
++cflags-y += -mno-space-regs
++
++# -mfast-indirect-calls is only relevant for 32-bit kernels.
++ifndef CONFIG_64BIT
++cflags-y += -mfast-indirect-calls
++endif
+
+ # Currently we save and restore fpregs on all kernel entry/interruption paths.
+ # If that gets optimized, we might need to disable the use of fpregs in the
+diff --git a/arch/parisc/kernel/syscall.S b/arch/parisc/kernel/syscall.S
+index 838786011037..7ef22e3387e0 100644
+--- a/arch/parisc/kernel/syscall.S
++++ b/arch/parisc/kernel/syscall.S
+@@ -74,7 +74,7 @@ ENTRY(linux_gateway_page)
+ /* ADDRESS 0xb0 to 0xb8, lws uses two insns for entry */
+ /* Light-weight-syscall entry must always be located at 0xb0 */
+ /* WARNING: Keep this number updated with table size changes */
+-#define __NR_lws_entries (2)
++#define __NR_lws_entries (3)
+
+ lws_entry:
+ gate lws_start, %r0 /* increase privilege */
+@@ -502,7 +502,7 @@ lws_exit:
+
+
+ /***************************************************
+- Implementing CAS as an atomic operation:
++ Implementing 32bit CAS as an atomic operation:
+
+ %r26 - Address to examine
+ %r25 - Old value to check (old)
+@@ -659,6 +659,230 @@ cas_action:
+ ASM_EXCEPTIONTABLE_ENTRY(2b-linux_gateway_page, 3b-linux_gateway_page)
+
+
++ /***************************************************
++ New CAS implementation which uses pointers and variable size
++ information. The value pointed by old and new MUST NOT change
++ while performing CAS. The lock only protect the value at %r26.
++
++ %r26 - Address to examine
++ %r25 - Pointer to the value to check (old)
++ %r24 - Pointer to the value to set (new)
++ %r23 - Size of the variable (0/1/2/3 for 8/16/32/64 bit)
++ %r28 - Return non-zero on failure
++ %r21 - Kernel error code
++
++ %r21 has the following meanings:
++
++ EAGAIN - CAS is busy, ldcw failed, try again.
++ EFAULT - Read or write failed.
++
++ Scratch: r20, r22, r28, r29, r1, fr4 (32bit for 64bit CAS only)
++
++ ****************************************************/
++
++ /* ELF32 Process entry path */
++lws_compare_and_swap_2:
++#ifdef CONFIG_64BIT
++ /* Clip the input registers */
++ depdi 0, 31, 32, %r26
++ depdi 0, 31, 32, %r25
++ depdi 0, 31, 32, %r24
++ depdi 0, 31, 32, %r23
++#endif
++
++ /* Check the validity of the size pointer */
++ subi,>>= 4, %r23, %r0
++ b,n lws_exit_nosys
++
++ /* Jump to the functions which will load the old and new values into
++ registers depending on the their size */
++ shlw %r23, 2, %r29
++ blr %r29, %r0
++ nop
++
++ /* 8bit load */
++4: ldb 0(%sr3,%r25), %r25
++ b cas2_lock_start
++5: ldb 0(%sr3,%r24), %r24
++ nop
++ nop
++ nop
++ nop
++ nop
++
++ /* 16bit load */
++6: ldh 0(%sr3,%r25), %r25
++ b cas2_lock_start
++7: ldh 0(%sr3,%r24), %r24
++ nop
++ nop
++ nop
++ nop
++ nop
++
++ /* 32bit load */
++8: ldw 0(%sr3,%r25), %r25
++ b cas2_lock_start
++9: ldw 0(%sr3,%r24), %r24
++ nop
++ nop
++ nop
++ nop
++ nop
++
++ /* 64bit load */
++#ifdef CONFIG_64BIT
++10: ldd 0(%sr3,%r25), %r25
++11: ldd 0(%sr3,%r24), %r24
++#else
++ /* Load new value into r22/r23 - high/low */
++10: ldw 0(%sr3,%r25), %r22
++11: ldw 4(%sr3,%r25), %r23
++ /* Load new value into fr4 for atomic store later */
++12: flddx 0(%sr3,%r24), %fr4
++#endif
++
++cas2_lock_start:
++ /* Load start of lock table */
++ ldil L%lws_lock_start, %r20
++ ldo R%lws_lock_start(%r20), %r28
++
++ /* Extract four bits from r26 and hash lock (Bits 4-7) */
++ extru %r26, 27, 4, %r20
++
++ /* Find lock to use, the hash is either one of 0 to
++ 15, multiplied by 16 (keep it 16-byte aligned)
++ and add to the lock table offset. */
++ shlw %r20, 4, %r20
++ add %r20, %r28, %r20
++
++ rsm PSW_SM_I, %r0 /* Disable interrupts */
++ /* COW breaks can cause contention on UP systems */
++ LDCW 0(%sr2,%r20), %r28 /* Try to acquire the lock */
++ cmpb,<>,n %r0, %r28, cas2_action /* Did we get it? */
++cas2_wouldblock:
++ ldo 2(%r0), %r28 /* 2nd case */
++ ssm PSW_SM_I, %r0
++ b lws_exit /* Contended... */
++ ldo -EAGAIN(%r0), %r21 /* Spin in userspace */
++
++ /*
++ prev = *addr;
++ if ( prev == old )
++ *addr = new;
++ return prev;
++ */
++
++ /* NOTES:
++ This all works becuse intr_do_signal
++ and schedule both check the return iasq
++ and see that we are on the kernel page
++ so this process is never scheduled off
++ or is ever sent any signal of any sort,
++ thus it is wholly atomic from usrspaces
++ perspective
++ */
++cas2_action:
++ /* Jump to the correct function */
++ blr %r29, %r0
++ /* Set %r28 as non-zero for now */
++ ldo 1(%r0),%r28
++
++ /* 8bit CAS */
++13: ldb,ma 0(%sr3,%r26), %r29
++ sub,= %r29, %r25, %r0
++ b,n cas2_end
++14: stb,ma %r24, 0(%sr3,%r26)
++ b cas2_end
++ copy %r0, %r28
++ nop
++ nop
++
++ /* 16bit CAS */
++15: ldh,ma 0(%sr3,%r26), %r29
++ sub,= %r29, %r25, %r0
++ b,n cas2_end
++16: sth,ma %r24, 0(%sr3,%r26)
++ b cas2_end
++ copy %r0, %r28
++ nop
++ nop
++
++ /* 32bit CAS */
++17: ldw,ma 0(%sr3,%r26), %r29
++ sub,= %r29, %r25, %r0
++ b,n cas2_end
++18: stw,ma %r24, 0(%sr3,%r26)
++ b cas2_end
++ copy %r0, %r28
++ nop
++ nop
++
++ /* 64bit CAS */
++#ifdef CONFIG_64BIT
++19: ldd,ma 0(%sr3,%r26), %r29
++ sub,= %r29, %r25, %r0
++ b,n cas2_end
++20: std,ma %r24, 0(%sr3,%r26)
++ copy %r0, %r28
++#else
++ /* Compare first word */
++19: ldw,ma 0(%sr3,%r26), %r29
++ sub,= %r29, %r22, %r0
++ b,n cas2_end
++ /* Compare second word */
++20: ldw,ma 4(%sr3,%r26), %r29
++ sub,= %r29, %r23, %r0
++ b,n cas2_end
++ /* Perform the store */
++21: fstdx %fr4, 0(%sr3,%r26)
++ copy %r0, %r28
++#endif
++
++cas2_end:
++ /* Free lock */
++ stw,ma %r20, 0(%sr2,%r20)
++ /* Enable interrupts */
++ ssm PSW_SM_I, %r0
++ /* Return to userspace, set no error */
++ b lws_exit
++ copy %r0, %r21
++
++22:
++ /* Error occurred on load or store */
++ /* Free lock */
++ stw %r20, 0(%sr2,%r20)
++ ssm PSW_SM_I, %r0
++ ldo 1(%r0),%r28
++ b lws_exit
++ ldo -EFAULT(%r0),%r21 /* set errno */
++ nop
++ nop
++ nop
++
++ /* Exception table entries, for the load and store, return EFAULT.
++ Each of the entries must be relocated. */
++ ASM_EXCEPTIONTABLE_ENTRY(4b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(5b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(6b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(7b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(8b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(9b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(10b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(11b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(13b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(14b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(15b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(16b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(17b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(18b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(19b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(20b-linux_gateway_page, 22b-linux_gateway_page)
++#ifndef CONFIG_64BIT
++ ASM_EXCEPTIONTABLE_ENTRY(12b-linux_gateway_page, 22b-linux_gateway_page)
++ ASM_EXCEPTIONTABLE_ENTRY(21b-linux_gateway_page, 22b-linux_gateway_page)
++#endif
++
+ /* Make sure nothing else is placed on this page */
+ .align PAGE_SIZE
+ END(linux_gateway_page)
+@@ -675,8 +899,9 @@ ENTRY(end_linux_gateway_page)
+ /* Light-weight-syscall table */
+ /* Start of lws table. */
+ ENTRY(lws_table)
+- LWS_ENTRY(compare_and_swap32) /* 0 - ELF32 Atomic compare and swap */
+- LWS_ENTRY(compare_and_swap64) /* 1 - ELF64 Atomic compare and swap */
++ LWS_ENTRY(compare_and_swap32) /* 0 - ELF32 Atomic 32bit CAS */
++ LWS_ENTRY(compare_and_swap64) /* 1 - ELF64 Atomic 32bit CAS */
++ LWS_ENTRY(compare_and_swap_2) /* 2 - ELF32 Atomic 64bit CAS */
+ END(lws_table)
+ /* End of lws table */
+
+diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
+index 279b80f3bb29..c0c61fa9cd9e 100644
+--- a/arch/powerpc/include/asm/ptrace.h
++++ b/arch/powerpc/include/asm/ptrace.h
+@@ -47,6 +47,12 @@
+ STACK_FRAME_OVERHEAD + KERNEL_REDZONE_SIZE)
+ #define STACK_FRAME_MARKER 12
+
++#if defined(_CALL_ELF) && _CALL_ELF == 2
++#define STACK_FRAME_MIN_SIZE 32
++#else
++#define STACK_FRAME_MIN_SIZE STACK_FRAME_OVERHEAD
++#endif
++
+ /* Size of dummy stack frame allocated when calling signal handler. */
+ #define __SIGNAL_FRAMESIZE 128
+ #define __SIGNAL_FRAMESIZE32 64
+@@ -60,6 +66,7 @@
+ #define STACK_FRAME_REGS_MARKER ASM_CONST(0x72656773)
+ #define STACK_INT_FRAME_SIZE (sizeof(struct pt_regs) + STACK_FRAME_OVERHEAD)
+ #define STACK_FRAME_MARKER 2
++#define STACK_FRAME_MIN_SIZE STACK_FRAME_OVERHEAD
+
+ /* Size of stack frame allocated when calling signal handler. */
+ #define __SIGNAL_FRAMESIZE 64
+diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
+index 35aa339410bd..4dbe072eecbe 100644
+--- a/arch/powerpc/include/asm/spinlock.h
++++ b/arch/powerpc/include/asm/spinlock.h
+@@ -61,6 +61,7 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+
+ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
+ {
++ smp_mb();
+ return !arch_spin_value_unlocked(*lock);
+ }
+
+diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
+index 0c9c8d7d0734..170a0346f756 100644
+--- a/arch/powerpc/lib/locks.c
++++ b/arch/powerpc/lib/locks.c
+@@ -70,12 +70,16 @@ void __rw_yield(arch_rwlock_t *rw)
+
+ void arch_spin_unlock_wait(arch_spinlock_t *lock)
+ {
++ smp_mb();
++
+ while (lock->slock) {
+ HMT_low();
+ if (SHARED_PROCESSOR)
+ __spin_yield(lock);
+ }
+ HMT_medium();
++
++ smp_mb();
+ }
+
+ EXPORT_SYMBOL(arch_spin_unlock_wait);
+diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
+index 74d1e780748b..2396dda282cd 100644
+--- a/arch/powerpc/perf/callchain.c
++++ b/arch/powerpc/perf/callchain.c
+@@ -35,7 +35,7 @@ static int valid_next_sp(unsigned long sp, unsigned long prev_sp)
+ return 0; /* must be 16-byte aligned */
+ if (!validate_sp(sp, current, STACK_FRAME_OVERHEAD))
+ return 0;
+- if (sp >= prev_sp + STACK_FRAME_OVERHEAD)
++ if (sp >= prev_sp + STACK_FRAME_MIN_SIZE)
+ return 1;
+ /*
+ * sp could decrease when we jump off an interrupt stack
+diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
+index fcba5e03839f..8904e1282562 100644
+--- a/arch/s390/include/asm/pgtable.h
++++ b/arch/s390/include/asm/pgtable.h
+@@ -1115,7 +1115,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+ {
+ pgste_t pgste;
+- pte_t pte;
++ pte_t pte, oldpte;
+ int young;
+
+ if (mm_has_pgste(vma->vm_mm)) {
+@@ -1123,12 +1123,13 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+ pgste = pgste_ipte_notify(vma->vm_mm, ptep, pgste);
+ }
+
+- pte = *ptep;
++ oldpte = pte = *ptep;
+ ptep_flush_direct(vma->vm_mm, addr, ptep);
+ young = pte_young(pte);
+ pte = pte_mkold(pte);
+
+ if (mm_has_pgste(vma->vm_mm)) {
++ pgste = pgste_update_all(&oldpte, pgste, vma->vm_mm);
+ pgste = pgste_set_pte(ptep, pgste, pte);
+ pgste_set_unlock(ptep, pgste);
+ } else
+@@ -1318,6 +1319,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+ ptep_flush_direct(vma->vm_mm, address, ptep);
+
+ if (mm_has_pgste(vma->vm_mm)) {
++ pgste_set_key(ptep, pgste, entry, vma->vm_mm);
+ pgste = pgste_set_pte(ptep, pgste, entry);
+ pgste_set_unlock(ptep, pgste);
+ } else
+diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
+index 2f3e14fe91a4..0eaf87281f45 100644
+--- a/arch/s390/kvm/kvm-s390.c
++++ b/arch/s390/kvm/kvm-s390.c
+@@ -1286,19 +1286,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+
+ kvm_s390_vcpu_start(vcpu);
+
+- switch (kvm_run->exit_reason) {
+- case KVM_EXIT_S390_SIEIC:
+- case KVM_EXIT_UNKNOWN:
+- case KVM_EXIT_INTR:
+- case KVM_EXIT_S390_RESET:
+- case KVM_EXIT_S390_UCONTROL:
+- case KVM_EXIT_S390_TSCH:
+- case KVM_EXIT_DEBUG:
+- break;
+- default:
+- BUG();
+- }
+-
+ vcpu->arch.sie_block->gpsw.mask = kvm_run->psw_mask;
+ vcpu->arch.sie_block->gpsw.addr = kvm_run->psw_addr;
+ if (kvm_run->kvm_dirty_regs & KVM_SYNC_PREFIX) {
+diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
+index f90ad8592b36..98eeb823342c 100644
+--- a/arch/s390/mm/pgtable.c
++++ b/arch/s390/mm/pgtable.c
+@@ -986,11 +986,21 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep;
+
+ down_read(&mm->mmap_sem);
++retry:
+ ptep = get_locked_pte(current->mm, addr, &ptl);
+ if (unlikely(!ptep)) {
+ up_read(&mm->mmap_sem);
+ return -EFAULT;
+ }
++ if (!(pte_val(*ptep) & _PAGE_INVALID) &&
++ (pte_val(*ptep) & _PAGE_PROTECT)) {
++ pte_unmap_unlock(*ptep, ptl);
++ if (fixup_user_fault(current, mm, addr, FAULT_FLAG_WRITE)) {
++ up_read(&mm->mmap_sem);
++ return -EFAULT;
++ }
++ goto retry;
++ }
+
+ new = old = pgste_get_lock(ptep);
+ pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
+diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
+index fc6091abedb7..d39189ba7f8e 100644
+--- a/arch/x86/boot/compressed/aslr.c
++++ b/arch/x86/boot/compressed/aslr.c
+@@ -183,12 +183,27 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
+ static bool mem_avoid_overlap(struct mem_vector *img)
+ {
+ int i;
++ struct setup_data *ptr;
+
+ for (i = 0; i < MEM_AVOID_MAX; i++) {
+ if (mem_overlaps(img, &mem_avoid[i]))
+ return true;
+ }
+
++ /* Avoid all entries in the setup_data linked list. */
++ ptr = (struct setup_data *)(unsigned long)real_mode->hdr.setup_data;
++ while (ptr) {
++ struct mem_vector avoid;
++
++ avoid.start = (u64)ptr;
++ avoid.size = sizeof(*ptr) + ptr->len;
++
++ if (mem_overlaps(img, &avoid))
++ return true;
++
++ ptr = (struct setup_data *)(unsigned long)ptr->next;
++ }
++
+ return false;
+ }
+
+diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
+index b0910f97a3ea..ffb1733ac91f 100644
+--- a/arch/x86/include/asm/fixmap.h
++++ b/arch/x86/include/asm/fixmap.h
+@@ -106,14 +106,14 @@ enum fixed_addresses {
+ __end_of_permanent_fixed_addresses,
+
+ /*
+- * 256 temporary boot-time mappings, used by early_ioremap(),
++ * 512 temporary boot-time mappings, used by early_ioremap(),
+ * before ioremap() is functional.
+ *
+- * If necessary we round it up to the next 256 pages boundary so
++ * If necessary we round it up to the next 512 pages boundary so
+ * that we can have a single pgd entry and a single pte table:
+ */
+ #define NR_FIX_BTMAPS 64
+-#define FIX_BTMAPS_SLOTS 4
++#define FIX_BTMAPS_SLOTS 8
+ #define TOTAL_FIX_BTMAPS (NR_FIX_BTMAPS * FIX_BTMAPS_SLOTS)
+ FIX_BTMAP_END =
+ (__end_of_permanent_fixed_addresses ^
+diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
+index 5be9063545d2..3874693c0e53 100644
+--- a/arch/x86/include/asm/pgtable_64.h
++++ b/arch/x86/include/asm/pgtable_64.h
+@@ -19,6 +19,7 @@ extern pud_t level3_ident_pgt[512];
+ extern pmd_t level2_kernel_pgt[512];
+ extern pmd_t level2_fixmap_pgt[512];
+ extern pmd_t level2_ident_pgt[512];
++extern pte_t level1_fixmap_pgt[512];
+ extern pgd_t init_level4_pgt[];
+
+ #define swapper_pg_dir init_level4_pgt
+diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
+index 5492798930ef..215815b6407c 100644
+--- a/arch/x86/kernel/smpboot.c
++++ b/arch/x86/kernel/smpboot.c
+@@ -1292,6 +1292,9 @@ static void remove_siblinginfo(int cpu)
+
+ for_each_cpu(sibling, cpu_sibling_mask(cpu))
+ cpumask_clear_cpu(cpu, cpu_sibling_mask(sibling));
++ for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
++ cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
++ cpumask_clear(cpu_llc_shared_mask(cpu));
+ cpumask_clear(cpu_sibling_mask(cpu));
+ cpumask_clear(cpu_core_mask(cpu));
+ c->phys_proc_id = 0;
+diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
+index e8a1201c3293..16fb0099b7f2 100644
+--- a/arch/x86/xen/mmu.c
++++ b/arch/x86/xen/mmu.c
+@@ -1866,12 +1866,11 @@ static void __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end,
+ *
+ * We can construct this by grafting the Xen provided pagetable into
+ * head_64.S's preconstructed pagetables. We copy the Xen L2's into
+- * level2_ident_pgt, level2_kernel_pgt and level2_fixmap_pgt. This
+- * means that only the kernel has a physical mapping to start with -
+- * but that's enough to get __va working. We need to fill in the rest
+- * of the physical mapping once some sort of allocator has been set
+- * up.
+- * NOTE: for PVH, the page tables are native.
++ * level2_ident_pgt, and level2_kernel_pgt. This means that only the
++ * kernel has a physical mapping to start with - but that's enough to
++ * get __va working. We need to fill in the rest of the physical
++ * mapping once some sort of allocator has been set up. NOTE: for
++ * PVH, the page tables are native.
+ */
+ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+ {
+@@ -1902,8 +1901,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+ /* L3_i[0] -> level2_ident_pgt */
+ convert_pfn_mfn(level3_ident_pgt);
+ /* L3_k[510] -> level2_kernel_pgt
+- * L3_i[511] -> level2_fixmap_pgt */
++ * L3_k[511] -> level2_fixmap_pgt */
+ convert_pfn_mfn(level3_kernel_pgt);
++
++ /* L3_k[511][506] -> level1_fixmap_pgt */
++ convert_pfn_mfn(level2_fixmap_pgt);
+ }
+ /* We get [511][511] and have Xen's version of level2_kernel_pgt */
+ l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
+@@ -1913,21 +1915,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+ addr[1] = (unsigned long)l3;
+ addr[2] = (unsigned long)l2;
+ /* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
+- * Both L4[272][0] and L4[511][511] have entries that point to the same
++ * Both L4[272][0] and L4[511][510] have entries that point to the same
+ * L2 (PMD) tables. Meaning that if you modify it in __va space
+ * it will be also modified in the __ka space! (But if you just
+ * modify the PMD table to point to other PTE's or none, then you
+ * are OK - which is what cleanup_highmap does) */
+ copy_page(level2_ident_pgt, l2);
+- /* Graft it onto L4[511][511] */
++ /* Graft it onto L4[511][510] */
+ copy_page(level2_kernel_pgt, l2);
+
+- /* Get [511][510] and graft that in level2_fixmap_pgt */
+- l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
+- l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
+- copy_page(level2_fixmap_pgt, l2);
+- /* Note that we don't do anything with level1_fixmap_pgt which
+- * we don't need. */
+ if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+ /* Make pagetable pieces RO */
+ set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+@@ -1937,6 +1933,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+ set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
+ set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
+ set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
++ set_page_prot(level1_fixmap_pgt, PAGE_KERNEL_RO);
+
+ /* Pin down new L4 */
+ pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
+diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
+index 4b0ca35a93b1..b2173e5da601 100644
+--- a/arch/xtensa/include/asm/pgtable.h
++++ b/arch/xtensa/include/asm/pgtable.h
+@@ -67,7 +67,12 @@
+ #define VMALLOC_START 0xC0000000
+ #define VMALLOC_END 0xC7FEFFFF
+ #define TLBTEMP_BASE_1 0xC7FF0000
+-#define TLBTEMP_BASE_2 0xC7FF8000
++#define TLBTEMP_BASE_2 (TLBTEMP_BASE_1 + DCACHE_WAY_SIZE)
++#if 2 * DCACHE_WAY_SIZE > ICACHE_WAY_SIZE
++#define TLBTEMP_SIZE (2 * DCACHE_WAY_SIZE)
++#else
++#define TLBTEMP_SIZE ICACHE_WAY_SIZE
++#endif
+
+ /*
+ * For the Xtensa architecture, the PTE layout is as follows:
+diff --git a/arch/xtensa/include/asm/uaccess.h b/arch/xtensa/include/asm/uaccess.h
+index fd686dc45d1a..c7211e7e182d 100644
+--- a/arch/xtensa/include/asm/uaccess.h
++++ b/arch/xtensa/include/asm/uaccess.h
+@@ -52,7 +52,12 @@
+ */
+ .macro get_fs ad, sp
+ GET_CURRENT(\ad,\sp)
++#if THREAD_CURRENT_DS > 1020
++ addi \ad, \ad, TASK_THREAD
++ l32i \ad, \ad, THREAD_CURRENT_DS - TASK_THREAD
++#else
+ l32i \ad, \ad, THREAD_CURRENT_DS
++#endif
+ .endm
+
+ /*
+diff --git a/arch/xtensa/include/uapi/asm/ioctls.h b/arch/xtensa/include/uapi/asm/ioctls.h
+index b4cb1100c0fb..a47909f0c34b 100644
+--- a/arch/xtensa/include/uapi/asm/ioctls.h
++++ b/arch/xtensa/include/uapi/asm/ioctls.h
+@@ -28,17 +28,17 @@
+ #define TCSETSW 0x5403
+ #define TCSETSF 0x5404
+
+-#define TCGETA _IOR('t', 23, struct termio)
+-#define TCSETA _IOW('t', 24, struct termio)
+-#define TCSETAW _IOW('t', 25, struct termio)
+-#define TCSETAF _IOW('t', 28, struct termio)
++#define TCGETA 0x80127417 /* _IOR('t', 23, struct termio) */
++#define TCSETA 0x40127418 /* _IOW('t', 24, struct termio) */
++#define TCSETAW 0x40127419 /* _IOW('t', 25, struct termio) */
++#define TCSETAF 0x4012741C /* _IOW('t', 28, struct termio) */
+
+ #define TCSBRK _IO('t', 29)
+ #define TCXONC _IO('t', 30)
+ #define TCFLSH _IO('t', 31)
+
+-#define TIOCSWINSZ _IOW('t', 103, struct winsize)
+-#define TIOCGWINSZ _IOR('t', 104, struct winsize)
++#define TIOCSWINSZ 0x40087467 /* _IOW('t', 103, struct winsize) */
++#define TIOCGWINSZ 0x80087468 /* _IOR('t', 104, struct winsize) */
+ #define TIOCSTART _IO('t', 110) /* start output, like ^Q */
+ #define TIOCSTOP _IO('t', 111) /* stop output, like ^S */
+ #define TIOCOUTQ _IOR('t', 115, int) /* output queue size */
+@@ -88,7 +88,6 @@
+ #define TIOCSETD _IOW('T', 35, int)
+ #define TIOCGETD _IOR('T', 36, int)
+ #define TCSBRKP _IOW('T', 37, int) /* Needed for POSIX tcsendbreak()*/
+-#define TIOCTTYGSTRUCT _IOR('T', 38, struct tty_struct) /* For debugging only*/
+ #define TIOCSBRK _IO('T', 39) /* BSD compatibility */
+ #define TIOCCBRK _IO('T', 40) /* BSD compatibility */
+ #define TIOCGSID _IOR('T', 41, pid_t) /* Return the session ID of FD*/
+@@ -114,8 +113,10 @@
+ #define TIOCSERGETLSR _IOR('T', 89, unsigned int) /* Get line status reg. */
+ /* ioctl (fd, TIOCSERGETLSR, &result) where result may be as below */
+ # define TIOCSER_TEMT 0x01 /* Transmitter physically empty */
+-#define TIOCSERGETMULTI _IOR('T', 90, struct serial_multiport_struct) /* Get multiport config */
+-#define TIOCSERSETMULTI _IOW('T', 91, struct serial_multiport_struct) /* Set multiport config */
++#define TIOCSERGETMULTI 0x80a8545a /* Get multiport config */
++ /* _IOR('T', 90, struct serial_multiport_struct) */
++#define TIOCSERSETMULTI 0x40a8545b /* Set multiport config */
++ /* _IOW('T', 91, struct serial_multiport_struct) */
+
+ #define TIOCMIWAIT _IO('T', 92) /* wait for a change on serial input line(s) */
+ #define TIOCGICOUNT 0x545D /* read serial port inline interrupt counts */
+diff --git a/arch/xtensa/kernel/entry.S b/arch/xtensa/kernel/entry.S
+index ef7f4990722b..a06b7efaae82 100644
+--- a/arch/xtensa/kernel/entry.S
++++ b/arch/xtensa/kernel/entry.S
+@@ -1001,9 +1001,8 @@ ENTRY(fast_syscall_xtensa)
+ movi a7, 4 # sizeof(unsigned int)
+ access_ok a3, a7, a0, a2, .Leac # a0: scratch reg, a2: sp
+
+- addi a6, a6, -1 # assuming SYS_XTENSA_ATOMIC_SET = 1
+- _bgeui a6, SYS_XTENSA_COUNT - 1, .Lill
+- _bnei a6, SYS_XTENSA_ATOMIC_CMP_SWP - 1, .Lnswp
++ _bgeui a6, SYS_XTENSA_COUNT, .Lill
++ _bnei a6, SYS_XTENSA_ATOMIC_CMP_SWP, .Lnswp
+
+ /* Fall through for ATOMIC_CMP_SWP. */
+
+@@ -1015,27 +1014,26 @@ TRY s32i a5, a3, 0 # different, modify value
+ l32i a7, a2, PT_AREG7 # restore a7
+ l32i a0, a2, PT_AREG0 # restore a0
+ movi a2, 1 # and return 1
+- addi a6, a6, 1 # restore a6 (really necessary?)
+ rfe
+
+ 1: l32i a7, a2, PT_AREG7 # restore a7
+ l32i a0, a2, PT_AREG0 # restore a0
+ movi a2, 0 # return 0 (note that we cannot set
+- addi a6, a6, 1 # restore a6 (really necessary?)
+ rfe
+
+ .Lnswp: /* Atomic set, add, and exg_add. */
+
+ TRY l32i a7, a3, 0 # orig
++ addi a6, a6, -SYS_XTENSA_ATOMIC_SET
+ add a0, a4, a7 # + arg
+ moveqz a0, a4, a6 # set
++ addi a6, a6, SYS_XTENSA_ATOMIC_SET
+ TRY s32i a0, a3, 0 # write new value
+
+ mov a0, a2
+ mov a2, a7
+ l32i a7, a0, PT_AREG7 # restore a7
+ l32i a0, a0, PT_AREG0 # restore a0
+- addi a6, a6, 1 # restore a6 (really necessary?)
+ rfe
+
+ CATCH
+@@ -1044,7 +1042,7 @@ CATCH
+ movi a2, -EFAULT
+ rfe
+
+-.Lill: l32i a7, a2, PT_AREG0 # restore a7
++.Lill: l32i a7, a2, PT_AREG7 # restore a7
+ l32i a0, a2, PT_AREG0 # restore a0
+ movi a2, -EINVAL
+ rfe
+@@ -1565,7 +1563,7 @@ ENTRY(fast_second_level_miss)
+ rsr a0, excvaddr
+ bltu a0, a3, 2f
+
+- addi a1, a0, -(2 << (DCACHE_ALIAS_ORDER + PAGE_SHIFT))
++ addi a1, a0, -TLBTEMP_SIZE
+ bgeu a1, a3, 2f
+
+ /* Check if we have to restore an ITLB mapping. */
+@@ -1820,7 +1818,6 @@ ENTRY(_switch_to)
+
+ entry a1, 16
+
+- mov a10, a2 # preserve 'prev' (a2)
+ mov a11, a3 # and 'next' (a3)
+
+ l32i a4, a2, TASK_THREAD_INFO
+@@ -1828,8 +1825,14 @@ ENTRY(_switch_to)
+
+ save_xtregs_user a4 a6 a8 a9 a12 a13 THREAD_XTREGS_USER
+
+- s32i a0, a10, THREAD_RA # save return address
+- s32i a1, a10, THREAD_SP # save stack pointer
++#if THREAD_RA > 1020 || THREAD_SP > 1020
++ addi a10, a2, TASK_THREAD
++ s32i a0, a10, THREAD_RA - TASK_THREAD # save return address
++ s32i a1, a10, THREAD_SP - TASK_THREAD # save stack pointer
++#else
++ s32i a0, a2, THREAD_RA # save return address
++ s32i a1, a2, THREAD_SP # save stack pointer
++#endif
+
+ /* Disable ints while we manipulate the stack pointer. */
+
+@@ -1870,7 +1873,6 @@ ENTRY(_switch_to)
+ load_xtregs_user a5 a6 a8 a9 a12 a13 THREAD_XTREGS_USER
+
+ wsr a14, ps
+- mov a2, a10 # return 'prev'
+ rsync
+
+ retw
+diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
+index 2d9cc6dbfd78..e8b76b8e4b29 100644
+--- a/arch/xtensa/kernel/pci-dma.c
++++ b/arch/xtensa/kernel/pci-dma.c
+@@ -49,9 +49,8 @@ dma_alloc_coherent(struct device *dev,size_t size,dma_addr_t *handle,gfp_t flag)
+
+ /* We currently don't support coherent memory outside KSEG */
+
+- if (ret < XCHAL_KSEG_CACHED_VADDR
+- || ret >= XCHAL_KSEG_CACHED_VADDR + XCHAL_KSEG_SIZE)
+- BUG();
++ BUG_ON(ret < XCHAL_KSEG_CACHED_VADDR ||
++ ret > XCHAL_KSEG_CACHED_VADDR + XCHAL_KSEG_SIZE - 1);
+
+
+ if (ret != 0) {
+@@ -68,10 +67,11 @@ EXPORT_SYMBOL(dma_alloc_coherent);
+ void dma_free_coherent(struct device *hwdev, size_t size,
+ void *vaddr, dma_addr_t dma_handle)
+ {
+- long addr=(long)vaddr+XCHAL_KSEG_CACHED_VADDR-XCHAL_KSEG_BYPASS_VADDR;
++ unsigned long addr = (unsigned long)vaddr +
++ XCHAL_KSEG_CACHED_VADDR - XCHAL_KSEG_BYPASS_VADDR;
+
+- if (addr < 0 || addr >= XCHAL_KSEG_SIZE)
+- BUG();
++ BUG_ON(addr < XCHAL_KSEG_CACHED_VADDR ||
++ addr > XCHAL_KSEG_CACHED_VADDR + XCHAL_KSEG_SIZE - 1);
+
+ free_pages(addr, get_order(size));
+ }
+diff --git a/block/blk-mq.c b/block/blk-mq.c
+index ad69ef657e85..06ac59f5bb5a 100644
+--- a/block/blk-mq.c
++++ b/block/blk-mq.c
+@@ -219,7 +219,6 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, int rw)
+ if (tag != BLK_MQ_TAG_FAIL) {
+ rq = data->hctx->tags->rqs[tag];
+
+- rq->cmd_flags = 0;
+ if (blk_mq_tag_busy(data->hctx)) {
+ rq->cmd_flags = REQ_MQ_INFLIGHT;
+ atomic_inc(&data->hctx->nr_active);
+@@ -274,6 +273,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
+
+ if (rq->cmd_flags & REQ_MQ_INFLIGHT)
+ atomic_dec(&hctx->nr_active);
++ rq->cmd_flags = 0;
+
+ clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+ blk_mq_put_tag(hctx, tag, &ctx->last_tag);
+@@ -1411,6 +1411,8 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+ left -= to_do * rq_size;
+ for (j = 0; j < to_do; j++) {
+ tags->rqs[i] = p;
++ tags->rqs[i]->atomic_flags = 0;
++ tags->rqs[i]->cmd_flags = 0;
+ if (set->ops->init_request) {
+ if (set->ops->init_request(set->driver_data,
+ tags->rqs[i], hctx_idx, i,
+diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
+index cadc37841744..d7494637c5db 100644
+--- a/block/cfq-iosched.c
++++ b/block/cfq-iosched.c
+@@ -1275,12 +1275,16 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+ static void
+ cfq_update_group_weight(struct cfq_group *cfqg)
+ {
+- BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+-
+ if (cfqg->new_weight) {
+ cfqg->weight = cfqg->new_weight;
+ cfqg->new_weight = 0;
+ }
++}
++
++static void
++cfq_update_group_leaf_weight(struct cfq_group *cfqg)
++{
++ BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+
+ if (cfqg->new_leaf_weight) {
+ cfqg->leaf_weight = cfqg->new_leaf_weight;
+@@ -1299,7 +1303,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+ /* add to the service tree */
+ BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+
+- cfq_update_group_weight(cfqg);
++ cfq_update_group_leaf_weight(cfqg);
+ __cfq_group_service_tree_add(st, cfqg);
+
+ /*
+@@ -1323,6 +1327,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+ */
+ while ((parent = cfqg_parent(pos))) {
+ if (propagate) {
++ cfq_update_group_weight(pos);
+ propagate = !parent->nr_active++;
+ parent->children_weight += pos->weight;
+ }
+diff --git a/block/genhd.c b/block/genhd.c
+index 791f41943132..e6723bd4d7a1 100644
+--- a/block/genhd.c
++++ b/block/genhd.c
+@@ -28,10 +28,10 @@ struct kobject *block_depr;
+ /* for extended dynamic devt allocation, currently only one major is used */
+ #define NR_EXT_DEVT (1 << MINORBITS)
+
+-/* For extended devt allocation. ext_devt_mutex prevents look up
++/* For extended devt allocation. ext_devt_lock prevents look up
+ * results from going away underneath its user.
+ */
+-static DEFINE_MUTEX(ext_devt_mutex);
++static DEFINE_SPINLOCK(ext_devt_lock);
+ static DEFINE_IDR(ext_devt_idr);
+
+ static struct device_type disk_type;
+@@ -420,9 +420,13 @@ int blk_alloc_devt(struct hd_struct *part, dev_t *devt)
+ }
+
+ /* allocate ext devt */
+- mutex_lock(&ext_devt_mutex);
+- idx = idr_alloc(&ext_devt_idr, part, 0, NR_EXT_DEVT, GFP_KERNEL);
+- mutex_unlock(&ext_devt_mutex);
++ idr_preload(GFP_KERNEL);
++
++ spin_lock(&ext_devt_lock);
++ idx = idr_alloc(&ext_devt_idr, part, 0, NR_EXT_DEVT, GFP_NOWAIT);
++ spin_unlock(&ext_devt_lock);
++
++ idr_preload_end();
+ if (idx < 0)
+ return idx == -ENOSPC ? -EBUSY : idx;
+
+@@ -441,15 +445,13 @@ int blk_alloc_devt(struct hd_struct *part, dev_t *devt)
+ */
+ void blk_free_devt(dev_t devt)
+ {
+- might_sleep();
+-
+ if (devt == MKDEV(0, 0))
+ return;
+
+ if (MAJOR(devt) == BLOCK_EXT_MAJOR) {
+- mutex_lock(&ext_devt_mutex);
++ spin_lock(&ext_devt_lock);
+ idr_remove(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
+- mutex_unlock(&ext_devt_mutex);
++ spin_unlock(&ext_devt_lock);
+ }
+ }
+
+@@ -665,7 +667,6 @@ void del_gendisk(struct gendisk *disk)
+ sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
+ pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
+ device_del(disk_to_dev(disk));
+- blk_free_devt(disk_to_dev(disk)->devt);
+ }
+ EXPORT_SYMBOL(del_gendisk);
+
+@@ -690,13 +691,13 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
+ } else {
+ struct hd_struct *part;
+
+- mutex_lock(&ext_devt_mutex);
++ spin_lock(&ext_devt_lock);
+ part = idr_find(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
+ if (part && get_disk(part_to_disk(part))) {
+ *partno = part->partno;
+ disk = part_to_disk(part);
+ }
+- mutex_unlock(&ext_devt_mutex);
++ spin_unlock(&ext_devt_lock);
+ }
+
+ return disk;
+@@ -1098,6 +1099,7 @@ static void disk_release(struct device *dev)
+ {
+ struct gendisk *disk = dev_to_disk(dev);
+
++ blk_free_devt(dev->devt);
+ disk_release_events(disk);
+ kfree(disk->random);
+ disk_replace_part_tbl(disk, NULL);
+diff --git a/block/partition-generic.c b/block/partition-generic.c
+index 789cdea05893..0d9e5f97f0a8 100644
+--- a/block/partition-generic.c
++++ b/block/partition-generic.c
+@@ -211,6 +211,7 @@ static const struct attribute_group *part_attr_groups[] = {
+ static void part_release(struct device *dev)
+ {
+ struct hd_struct *p = dev_to_part(dev);
++ blk_free_devt(dev->devt);
+ free_part_stats(p);
+ free_part_info(p);
+ kfree(p);
+@@ -253,7 +254,6 @@ void delete_partition(struct gendisk *disk, int partno)
+ rcu_assign_pointer(ptbl->last_lookup, NULL);
+ kobject_put(part->holder_dir);
+ device_del(part_to_dev(part));
+- blk_free_devt(part_devt(part));
+
+ hd_struct_put(part);
+ }
+diff --git a/block/partitions/aix.c b/block/partitions/aix.c
+index 43be471d9b1d..0931f5136ab2 100644
+--- a/block/partitions/aix.c
++++ b/block/partitions/aix.c
+@@ -253,7 +253,7 @@ int aix_partition(struct parsed_partitions *state)
+ continue;
+ }
+ lv_ix = be16_to_cpu(p->lv_ix) - 1;
+- if (lv_ix > state->limit) {
++ if (lv_ix >= state->limit) {
+ cur_lv_ix = -1;
+ continue;
+ }
+diff --git a/drivers/acpi/acpi_cmos_rtc.c b/drivers/acpi/acpi_cmos_rtc.c
+index 2da8660262e5..81dc75033f15 100644
+--- a/drivers/acpi/acpi_cmos_rtc.c
++++ b/drivers/acpi/acpi_cmos_rtc.c
+@@ -33,7 +33,7 @@ acpi_cmos_rtc_space_handler(u32 function, acpi_physical_address address,
+ void *handler_context, void *region_context)
+ {
+ int i;
+- u8 *value = (u8 *)&value64;
++ u8 *value = (u8 *)value64;
+
+ if (address > 0xff || !value64)
+ return AE_BAD_PARAMETER;
+diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c
+index 9cb65b0e7597..2f65b0969edb 100644
+--- a/drivers/acpi/acpi_lpss.c
++++ b/drivers/acpi/acpi_lpss.c
+@@ -392,7 +392,6 @@ static int acpi_lpss_create_device(struct acpi_device *adev,
+ adev->driver_data = pdata;
+ pdev = acpi_create_platform_device(adev);
+ if (!IS_ERR_OR_NULL(pdev)) {
+- device_enable_async_suspend(&pdev->dev);
+ return 1;
+ }
+
+@@ -583,7 +582,7 @@ static int acpi_lpss_suspend_late(struct device *dev)
+ return acpi_dev_suspend_late(dev);
+ }
+
+-static int acpi_lpss_restore_early(struct device *dev)
++static int acpi_lpss_resume_early(struct device *dev)
+ {
+ int ret = acpi_dev_resume_early(dev);
+
+@@ -623,15 +622,15 @@ static int acpi_lpss_runtime_resume(struct device *dev)
+ static struct dev_pm_domain acpi_lpss_pm_domain = {
+ .ops = {
+ #ifdef CONFIG_PM_SLEEP
+- .suspend_late = acpi_lpss_suspend_late,
+- .restore_early = acpi_lpss_restore_early,
+ .prepare = acpi_subsys_prepare,
+ .complete = acpi_subsys_complete,
+ .suspend = acpi_subsys_suspend,
+- .resume_early = acpi_subsys_resume_early,
++ .suspend_late = acpi_lpss_suspend_late,
++ .resume_early = acpi_lpss_resume_early,
+ .freeze = acpi_subsys_freeze,
+ .poweroff = acpi_subsys_suspend,
+- .poweroff_late = acpi_subsys_suspend_late,
++ .poweroff_late = acpi_lpss_suspend_late,
++ .restore_early = acpi_lpss_resume_early,
+ #endif
+ #ifdef CONFIG_PM_RUNTIME
+ .runtime_suspend = acpi_lpss_runtime_suspend,
+diff --git a/drivers/acpi/acpica/aclocal.h b/drivers/acpi/acpica/aclocal.h
+index 91f801a2e689..494775a67ffa 100644
+--- a/drivers/acpi/acpica/aclocal.h
++++ b/drivers/acpi/acpica/aclocal.h
+@@ -254,6 +254,7 @@ struct acpi_create_field_info {
+ u32 field_bit_position;
+ u32 field_bit_length;
+ u16 resource_length;
++ u16 pin_number_index;
+ u8 field_flags;
+ u8 attribute;
+ u8 field_type;
+diff --git a/drivers/acpi/acpica/acobject.h b/drivers/acpi/acpica/acobject.h
+index 22fb6449d3d6..8abb393dafab 100644
+--- a/drivers/acpi/acpica/acobject.h
++++ b/drivers/acpi/acpica/acobject.h
+@@ -264,6 +264,7 @@ struct acpi_object_region_field {
+ ACPI_OBJECT_COMMON_HEADER ACPI_COMMON_FIELD_INFO u16 resource_length;
+ union acpi_operand_object *region_obj; /* Containing op_region object */
+ u8 *resource_buffer; /* resource_template for serial regions/fields */
++ u16 pin_number_index; /* Index relative to previous Connection/Template */
+ };
+
+ struct acpi_object_bank_field {
+diff --git a/drivers/acpi/acpica/dsfield.c b/drivers/acpi/acpica/dsfield.c
+index 3661c8e90540..c57666196672 100644
+--- a/drivers/acpi/acpica/dsfield.c
++++ b/drivers/acpi/acpica/dsfield.c
+@@ -360,6 +360,7 @@ acpi_ds_get_field_names(struct acpi_create_field_info *info,
+ */
+ info->resource_buffer = NULL;
+ info->connection_node = NULL;
++ info->pin_number_index = 0;
+
+ /*
+ * A Connection() is either an actual resource descriptor (buffer)
+@@ -437,6 +438,7 @@ acpi_ds_get_field_names(struct acpi_create_field_info *info,
+ }
+
+ info->field_bit_position += info->field_bit_length;
++ info->pin_number_index++; /* Index relative to previous Connection() */
+ break;
+
+ default:
+diff --git a/drivers/acpi/acpica/evregion.c b/drivers/acpi/acpica/evregion.c
+index 9957297d1580..8eb8575e8c16 100644
+--- a/drivers/acpi/acpica/evregion.c
++++ b/drivers/acpi/acpica/evregion.c
+@@ -142,6 +142,7 @@ acpi_ev_address_space_dispatch(union acpi_operand_object *region_obj,
+ union acpi_operand_object *region_obj2;
+ void *region_context = NULL;
+ struct acpi_connection_info *context;
++ acpi_physical_address address;
+
+ ACPI_FUNCTION_TRACE(ev_address_space_dispatch);
+
+@@ -231,25 +232,23 @@ acpi_ev_address_space_dispatch(union acpi_operand_object *region_obj,
+ /* We have everything we need, we can invoke the address space handler */
+
+ handler = handler_desc->address_space.handler;
+-
+- ACPI_DEBUG_PRINT((ACPI_DB_OPREGION,
+- "Handler %p (@%p) Address %8.8X%8.8X [%s]\n",
+- ®ion_obj->region.handler->address_space, handler,
+- ACPI_FORMAT_NATIVE_UINT(region_obj->region.address +
+- region_offset),
+- acpi_ut_get_region_name(region_obj->region.
+- space_id)));
++ address = (region_obj->region.address + region_offset);
+
+ /*
+ * Special handling for generic_serial_bus and general_purpose_io:
+ * There are three extra parameters that must be passed to the
+ * handler via the context:
+- * 1) Connection buffer, a resource template from Connection() op.
+- * 2) Length of the above buffer.
+- * 3) Actual access length from the access_as() op.
++ * 1) Connection buffer, a resource template from Connection() op
++ * 2) Length of the above buffer
++ * 3) Actual access length from the access_as() op
++ *
++ * In addition, for general_purpose_io, the Address and bit_width fields
++ * are defined as follows:
++ * 1) Address is the pin number index of the field (bit offset from
++ * the previous Connection)
++ * 2) bit_width is the actual bit length of the field (number of pins)
+ */
+- if (((region_obj->region.space_id == ACPI_ADR_SPACE_GSBUS) ||
+- (region_obj->region.space_id == ACPI_ADR_SPACE_GPIO)) &&
++ if ((region_obj->region.space_id == ACPI_ADR_SPACE_GSBUS) &&
+ context && field_obj) {
+
+ /* Get the Connection (resource_template) buffer */
+@@ -258,6 +257,24 @@ acpi_ev_address_space_dispatch(union acpi_operand_object *region_obj,
+ context->length = field_obj->field.resource_length;
+ context->access_length = field_obj->field.access_length;
+ }
++ if ((region_obj->region.space_id == ACPI_ADR_SPACE_GPIO) &&
++ context && field_obj) {
++
++ /* Get the Connection (resource_template) buffer */
++
++ context->connection = field_obj->field.resource_buffer;
++ context->length = field_obj->field.resource_length;
++ context->access_length = field_obj->field.access_length;
++ address = field_obj->field.pin_number_index;
++ bit_width = field_obj->field.bit_length;
++ }
++
++ ACPI_DEBUG_PRINT((ACPI_DB_OPREGION,
++ "Handler %p (@%p) Address %8.8X%8.8X [%s]\n",
++ ®ion_obj->region.handler->address_space, handler,
++ ACPI_FORMAT_NATIVE_UINT(address),
++ acpi_ut_get_region_name(region_obj->region.
++ space_id)));
+
+ if (!(handler_desc->address_space.handler_flags &
+ ACPI_ADDR_HANDLER_DEFAULT_INSTALLED)) {
+@@ -271,9 +288,7 @@ acpi_ev_address_space_dispatch(union acpi_operand_object *region_obj,
+
+ /* Call the handler */
+
+- status = handler(function,
+- (region_obj->region.address + region_offset),
+- bit_width, value, context,
++ status = handler(function, address, bit_width, value, context,
+ region_obj2->extra.region_context);
+
+ if (ACPI_FAILURE(status)) {
+diff --git a/drivers/acpi/acpica/exfield.c b/drivers/acpi/acpica/exfield.c
+index 12878e1982f7..9dabfd2acd4d 100644
+--- a/drivers/acpi/acpica/exfield.c
++++ b/drivers/acpi/acpica/exfield.c
+@@ -254,6 +254,37 @@ acpi_ex_read_data_from_field(struct acpi_walk_state * walk_state,
+ buffer = &buffer_desc->integer.value;
+ }
+
++ if ((obj_desc->common.type == ACPI_TYPE_LOCAL_REGION_FIELD) &&
++ (obj_desc->field.region_obj->region.space_id ==
++ ACPI_ADR_SPACE_GPIO)) {
++ /*
++ * For GPIO (general_purpose_io), the Address will be the bit offset
++ * from the previous Connection() operator, making it effectively a
++ * pin number index. The bit_length is the length of the field, which
++ * is thus the number of pins.
++ */
++ ACPI_DEBUG_PRINT((ACPI_DB_BFIELD,
++ "GPIO FieldRead [FROM]: Pin %u Bits %u\n",
++ obj_desc->field.pin_number_index,
++ obj_desc->field.bit_length));
++
++ /* Lock entire transaction if requested */
++
++ acpi_ex_acquire_global_lock(obj_desc->common_field.field_flags);
++
++ /* Perform the write */
++
++ status = acpi_ex_access_region(obj_desc, 0,
++ (u64 *)buffer, ACPI_READ);
++ acpi_ex_release_global_lock(obj_desc->common_field.field_flags);
++ if (ACPI_FAILURE(status)) {
++ acpi_ut_remove_reference(buffer_desc);
++ } else {
++ *ret_buffer_desc = buffer_desc;
++ }
++ return_ACPI_STATUS(status);
++ }
++
+ ACPI_DEBUG_PRINT((ACPI_DB_BFIELD,
+ "FieldRead [TO]: Obj %p, Type %X, Buf %p, ByteLen %X\n",
+ obj_desc, obj_desc->common.type, buffer,
+@@ -415,6 +446,42 @@ acpi_ex_write_data_to_field(union acpi_operand_object *source_desc,
+
+ *result_desc = buffer_desc;
+ return_ACPI_STATUS(status);
++ } else if ((obj_desc->common.type == ACPI_TYPE_LOCAL_REGION_FIELD) &&
++ (obj_desc->field.region_obj->region.space_id ==
++ ACPI_ADR_SPACE_GPIO)) {
++ /*
++ * For GPIO (general_purpose_io), we will bypass the entire field
++ * mechanism and handoff the bit address and bit width directly to
++ * the handler. The Address will be the bit offset
++ * from the previous Connection() operator, making it effectively a
++ * pin number index. The bit_length is the length of the field, which
++ * is thus the number of pins.
++ */
++ if (source_desc->common.type != ACPI_TYPE_INTEGER) {
++ return_ACPI_STATUS(AE_AML_OPERAND_TYPE);
++ }
++
++ ACPI_DEBUG_PRINT((ACPI_DB_BFIELD,
++ "GPIO FieldWrite [FROM]: (%s:%X), Val %.8X [TO]: Pin %u Bits %u\n",
++ acpi_ut_get_type_name(source_desc->common.
++ type),
++ source_desc->common.type,
++ (u32)source_desc->integer.value,
++ obj_desc->field.pin_number_index,
++ obj_desc->field.bit_length));
++
++ buffer = &source_desc->integer.value;
++
++ /* Lock entire transaction if requested */
++
++ acpi_ex_acquire_global_lock(obj_desc->common_field.field_flags);
++
++ /* Perform the write */
++
++ status = acpi_ex_access_region(obj_desc, 0,
++ (u64 *)buffer, ACPI_WRITE);
++ acpi_ex_release_global_lock(obj_desc->common_field.field_flags);
++ return_ACPI_STATUS(status);
+ }
+
+ /* Get a pointer to the data to be written */
+diff --git a/drivers/acpi/acpica/exprep.c b/drivers/acpi/acpica/exprep.c
+index ee3f872870bc..118e942005e5 100644
+--- a/drivers/acpi/acpica/exprep.c
++++ b/drivers/acpi/acpica/exprep.c
+@@ -484,6 +484,8 @@ acpi_status acpi_ex_prep_field_value(struct acpi_create_field_info *info)
+ obj_desc->field.resource_length = info->resource_length;
+ }
+
++ obj_desc->field.pin_number_index = info->pin_number_index;
++
+ /* Allow full data read from EC address space */
+
+ if ((obj_desc->field.region_obj->region.space_id ==
+diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
+index 130f513e08c9..bc0b286ff2ba 100644
+--- a/drivers/acpi/battery.c
++++ b/drivers/acpi/battery.c
+@@ -535,20 +535,6 @@ static int acpi_battery_get_state(struct acpi_battery *battery)
+ " invalid.\n");
+ }
+
+- /*
+- * When fully charged, some batteries wrongly report
+- * capacity_now = design_capacity instead of = full_charge_capacity
+- */
+- if (battery->capacity_now > battery->full_charge_capacity
+- && battery->full_charge_capacity != ACPI_BATTERY_VALUE_UNKNOWN) {
+- battery->capacity_now = battery->full_charge_capacity;
+- if (battery->capacity_now != battery->design_capacity)
+- printk_once(KERN_WARNING FW_BUG
+- "battery: reported current charge level (%d) "
+- "is higher than reported maximum charge level (%d).\n",
+- battery->capacity_now, battery->full_charge_capacity);
+- }
+-
+ if (test_bit(ACPI_BATTERY_QUIRK_PERCENTAGE_CAPACITY, &battery->flags)
+ && battery->capacity_now >= 0 && battery->capacity_now <= 100)
+ battery->capacity_now = (battery->capacity_now *
+diff --git a/drivers/acpi/container.c b/drivers/acpi/container.c
+index 76f7cff64594..c8ead9f97375 100644
+--- a/drivers/acpi/container.c
++++ b/drivers/acpi/container.c
+@@ -99,6 +99,13 @@ static void container_device_detach(struct acpi_device *adev)
+ device_unregister(dev);
+ }
+
++static void container_device_online(struct acpi_device *adev)
++{
++ struct device *dev = acpi_driver_data(adev);
++
++ kobject_uevent(&dev->kobj, KOBJ_ONLINE);
++}
++
+ static struct acpi_scan_handler container_handler = {
+ .ids = container_device_ids,
+ .attach = container_device_attach,
+@@ -106,6 +113,7 @@ static struct acpi_scan_handler container_handler = {
+ .hotplug = {
+ .enabled = true,
+ .demand_offline = true,
++ .notify_online = container_device_online,
+ },
+ };
+
+diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
+index 551f29127369..2e9ed9a4f13f 100644
+--- a/drivers/acpi/scan.c
++++ b/drivers/acpi/scan.c
+@@ -128,7 +128,7 @@ static int create_modalias(struct acpi_device *acpi_dev, char *modalias,
+ list_for_each_entry(id, &acpi_dev->pnp.ids, list) {
+ count = snprintf(&modalias[len], size, "%s:", id->id);
+ if (count < 0)
+- return EINVAL;
++ return -EINVAL;
+ if (count >= size)
+ return -ENOMEM;
+ len += count;
+@@ -2184,6 +2184,9 @@ static void acpi_bus_attach(struct acpi_device *device)
+ ok:
+ list_for_each_entry(child, &device->children, node)
+ acpi_bus_attach(child);
++
++ if (device->handler && device->handler->hotplug.notify_online)
++ device->handler->hotplug.notify_online(device);
+ }
+
+ /**
+diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
+index 4834b4cae540..f1e3496c00c7 100644
+--- a/drivers/acpi/video.c
++++ b/drivers/acpi/video.c
+@@ -675,6 +675,14 @@ static struct dmi_system_id video_dmi_table[] __initdata = {
+ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T520"),
+ },
+ },
++ {
++ .callback = video_disable_native_backlight,
++ .ident = "ThinkPad X201s",
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++ DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad X201s"),
++ },
++ },
+
+ /* The native backlight controls do not work on some older machines */
+ {
+diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
+index 4cd52a4541a9..f0f8ae1197e2 100644
+--- a/drivers/ata/ahci.c
++++ b/drivers/ata/ahci.c
+@@ -305,6 +305,14 @@ static const struct pci_device_id ahci_pci_tbl[] = {
+ { PCI_VDEVICE(INTEL, 0x9c85), board_ahci }, /* Wildcat Point-LP RAID */
+ { PCI_VDEVICE(INTEL, 0x9c87), board_ahci }, /* Wildcat Point-LP RAID */
+ { PCI_VDEVICE(INTEL, 0x9c8f), board_ahci }, /* Wildcat Point-LP RAID */
++ { PCI_VDEVICE(INTEL, 0x8c82), board_ahci }, /* 9 Series AHCI */
++ { PCI_VDEVICE(INTEL, 0x8c83), board_ahci }, /* 9 Series AHCI */
++ { PCI_VDEVICE(INTEL, 0x8c84), board_ahci }, /* 9 Series RAID */
++ { PCI_VDEVICE(INTEL, 0x8c85), board_ahci }, /* 9 Series RAID */
++ { PCI_VDEVICE(INTEL, 0x8c86), board_ahci }, /* 9 Series RAID */
++ { PCI_VDEVICE(INTEL, 0x8c87), board_ahci }, /* 9 Series RAID */
++ { PCI_VDEVICE(INTEL, 0x8c8e), board_ahci }, /* 9 Series RAID */
++ { PCI_VDEVICE(INTEL, 0x8c8f), board_ahci }, /* 9 Series RAID */
+
+ /* JMicron 360/1/3/5/6, match class to avoid IDE function */
+ { PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
+@@ -442,6 +450,8 @@ static const struct pci_device_id ahci_pci_tbl[] = {
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x917a),
+ .driver_data = board_ahci_yes_fbs }, /* 88se9172 */
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9172),
++ .driver_data = board_ahci_yes_fbs }, /* 88se9182 */
++ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9182),
+ .driver_data = board_ahci_yes_fbs }, /* 88se9172 */
+ { PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9192),
+ .driver_data = board_ahci_yes_fbs }, /* 88se9172 on some Gigabyte */
+diff --git a/drivers/ata/ahci_xgene.c b/drivers/ata/ahci_xgene.c
+index ee3a3659bd9e..10d524699676 100644
+--- a/drivers/ata/ahci_xgene.c
++++ b/drivers/ata/ahci_xgene.c
+@@ -337,7 +337,7 @@ static struct ata_port_operations xgene_ahci_ops = {
+ };
+
+ static const struct ata_port_info xgene_ahci_port_info = {
+- .flags = AHCI_FLAG_COMMON | ATA_FLAG_NCQ,
++ .flags = AHCI_FLAG_COMMON,
+ .pio_mask = ATA_PIO4,
+ .udma_mask = ATA_UDMA6,
+ .port_ops = &xgene_ahci_ops,
+@@ -484,7 +484,7 @@ static int xgene_ahci_probe(struct platform_device *pdev)
+ goto disable_resources;
+ }
+
+- hflags = AHCI_HFLAG_NO_PMP | AHCI_HFLAG_YES_NCQ;
++ hflags = AHCI_HFLAG_NO_PMP | AHCI_HFLAG_NO_NCQ;
+
+ rc = ahci_platform_init_host(pdev, hpriv, &xgene_ahci_port_info,
+ hflags, 0, 0);
+diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
+index 893e30e9a9ef..ffbe625e6fd2 100644
+--- a/drivers/ata/ata_piix.c
++++ b/drivers/ata/ata_piix.c
+@@ -340,6 +340,14 @@ static const struct pci_device_id piix_pci_tbl[] = {
+ { 0x8086, 0x0F21, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_2port_sata_byt },
+ /* SATA Controller IDE (Coleto Creek) */
+ { 0x8086, 0x23a6, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_2port_sata },
++ /* SATA Controller IDE (9 Series) */
++ { 0x8086, 0x8c88, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_2port_sata_snb },
++ /* SATA Controller IDE (9 Series) */
++ { 0x8086, 0x8c89, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_2port_sata_snb },
++ /* SATA Controller IDE (9 Series) */
++ { 0x8086, 0x8c80, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_sata_snb },
++ /* SATA Controller IDE (9 Series) */
++ { 0x8086, 0x8c81, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_sata_snb },
+
+ { } /* terminate list */
+ };
+diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
+index 677c0c1b03bd..e7f30b59bc8b 100644
+--- a/drivers/ata/libata-core.c
++++ b/drivers/ata/libata-core.c
+@@ -4227,7 +4227,7 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
+ { "Micron_M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, },
+ { "Crucial_CT???M500SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, },
+ { "Micron_M550*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, },
+- { "Crucial_CT???M550SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, },
++ { "Crucial_CT*M550SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, },
+
+ /*
+ * Some WD SATA-I drives spin up and down erratically when the link
+diff --git a/drivers/ata/pata_scc.c b/drivers/ata/pata_scc.c
+index 4e006d74bef8..7f4cb76ed9fa 100644
+--- a/drivers/ata/pata_scc.c
++++ b/drivers/ata/pata_scc.c
+@@ -585,7 +585,7 @@ static int scc_wait_after_reset(struct ata_link *link, unsigned int devmask,
+ * Note: Original code is ata_bus_softreset().
+ */
+
+-static unsigned int scc_bus_softreset(struct ata_port *ap, unsigned int devmask,
++static int scc_bus_softreset(struct ata_port *ap, unsigned int devmask,
+ unsigned long deadline)
+ {
+ struct ata_ioports *ioaddr = &ap->ioaddr;
+@@ -599,9 +599,7 @@ static unsigned int scc_bus_softreset(struct ata_port *ap, unsigned int devmask,
+ udelay(20);
+ out_be32(ioaddr->ctl_addr, ap->ctl);
+
+- scc_wait_after_reset(&ap->link, devmask, deadline);
+-
+- return 0;
++ return scc_wait_after_reset(&ap->link, devmask, deadline);
+ }
+
+ /**
+@@ -618,7 +616,8 @@ static int scc_softreset(struct ata_link *link, unsigned int *classes,
+ {
+ struct ata_port *ap = link->ap;
+ unsigned int slave_possible = ap->flags & ATA_FLAG_SLAVE_POSS;
+- unsigned int devmask = 0, err_mask;
++ unsigned int devmask = 0;
++ int rc;
+ u8 err;
+
+ DPRINTK("ENTER\n");
+@@ -634,9 +633,9 @@ static int scc_softreset(struct ata_link *link, unsigned int *classes,
+
+ /* issue bus reset */
+ DPRINTK("about to softreset, devmask=%x\n", devmask);
+- err_mask = scc_bus_softreset(ap, devmask, deadline);
+- if (err_mask) {
+- ata_port_err(ap, "SRST failed (err_mask=0x%x)\n", err_mask);
++ rc = scc_bus_softreset(ap, devmask, deadline);
++ if (rc) {
++ ata_port_err(ap, "SRST failed (err_mask=0x%x)\n", rc);
+ return -EIO;
+ }
+
+diff --git a/drivers/base/regmap/internal.h b/drivers/base/regmap/internal.h
+index 7d1326985bee..bfc90b8547f2 100644
+--- a/drivers/base/regmap/internal.h
++++ b/drivers/base/regmap/internal.h
+@@ -146,6 +146,9 @@ struct regcache_ops {
+ enum regcache_type type;
+ int (*init)(struct regmap *map);
+ int (*exit)(struct regmap *map);
++#ifdef CONFIG_DEBUG_FS
++ void (*debugfs_init)(struct regmap *map);
++#endif
+ int (*read)(struct regmap *map, unsigned int reg, unsigned int *value);
+ int (*write)(struct regmap *map, unsigned int reg, unsigned int value);
+ int (*sync)(struct regmap *map, unsigned int min, unsigned int max);
+diff --git a/drivers/base/regmap/regcache-rbtree.c b/drivers/base/regmap/regcache-rbtree.c
+index 6a7e4fa12854..f3e8fe0cc650 100644
+--- a/drivers/base/regmap/regcache-rbtree.c
++++ b/drivers/base/regmap/regcache-rbtree.c
+@@ -194,10 +194,6 @@ static void rbtree_debugfs_init(struct regmap *map)
+ {
+ debugfs_create_file("rbtree", 0400, map->debugfs, map, &rbtree_fops);
+ }
+-#else
+-static void rbtree_debugfs_init(struct regmap *map)
+-{
+-}
+ #endif
+
+ static int regcache_rbtree_init(struct regmap *map)
+@@ -222,8 +218,6 @@ static int regcache_rbtree_init(struct regmap *map)
+ goto err;
+ }
+
+- rbtree_debugfs_init(map);
+-
+ return 0;
+
+ err:
+@@ -532,6 +526,9 @@ struct regcache_ops regcache_rbtree_ops = {
+ .name = "rbtree",
+ .init = regcache_rbtree_init,
+ .exit = regcache_rbtree_exit,
++#ifdef CONFIG_DEBUG_FS
++ .debugfs_init = rbtree_debugfs_init,
++#endif
+ .read = regcache_rbtree_read,
+ .write = regcache_rbtree_write,
+ .sync = regcache_rbtree_sync,
+diff --git a/drivers/base/regmap/regcache.c b/drivers/base/regmap/regcache.c
+index 29b4128da0b0..5617da6dc898 100644
+--- a/drivers/base/regmap/regcache.c
++++ b/drivers/base/regmap/regcache.c
+@@ -698,7 +698,7 @@ int regcache_sync_block(struct regmap *map, void *block,
+ unsigned int block_base, unsigned int start,
+ unsigned int end)
+ {
+- if (regmap_can_raw_write(map))
++ if (regmap_can_raw_write(map) && !map->use_single_rw)
+ return regcache_sync_block_raw(map, block, cache_present,
+ block_base, start, end);
+ else
+diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
+index 45d812c0ea77..65ea7b256b3e 100644
+--- a/drivers/base/regmap/regmap-debugfs.c
++++ b/drivers/base/regmap/regmap-debugfs.c
+@@ -538,6 +538,9 @@ void regmap_debugfs_init(struct regmap *map, const char *name)
+
+ next = rb_next(&range_node->node);
+ }
++
++ if (map->cache_ops && map->cache_ops->debugfs_init)
++ map->cache_ops->debugfs_init(map);
+ }
+
+ void regmap_debugfs_exit(struct regmap *map)
+diff --git a/drivers/base/regmap/regmap.c b/drivers/base/regmap/regmap.c
+index 74d8c0672cf6..283644e5d31f 100644
+--- a/drivers/base/regmap/regmap.c
++++ b/drivers/base/regmap/regmap.c
+@@ -109,7 +109,7 @@ bool regmap_readable(struct regmap *map, unsigned int reg)
+
+ bool regmap_volatile(struct regmap *map, unsigned int reg)
+ {
+- if (!regmap_readable(map, reg))
++ if (!map->format.format_write && !regmap_readable(map, reg))
+ return false;
+
+ if (map->volatile_reg)
+diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
+index c4419ea1ab07..2a451b14b3cc 100644
+--- a/drivers/char/hw_random/core.c
++++ b/drivers/char/hw_random/core.c
+@@ -68,12 +68,6 @@ static void add_early_randomness(struct hwrng *rng)
+ unsigned char bytes[16];
+ int bytes_read;
+
+- /*
+- * Currently only virtio-rng cannot return data during device
+- * probe, and that's handled in virtio-rng.c itself. If there
+- * are more such devices, this call to rng_get_data can be
+- * made conditional here instead of doing it per-device.
+- */
+ bytes_read = rng_get_data(rng, bytes, sizeof(bytes), 1);
+ if (bytes_read > 0)
+ add_device_randomness(bytes, bytes_read);
+diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
+index e9b15bc18b4d..f1aa13b21f74 100644
+--- a/drivers/char/hw_random/virtio-rng.c
++++ b/drivers/char/hw_random/virtio-rng.c
+@@ -36,9 +36,9 @@ struct virtrng_info {
+ bool busy;
+ char name[25];
+ int index;
++ bool hwrng_register_done;
+ };
+
+-static bool probe_done;
+
+ static void random_recv_done(struct virtqueue *vq)
+ {
+@@ -69,13 +69,6 @@ static int virtio_read(struct hwrng *rng, void *buf, size_t size, bool wait)
+ int ret;
+ struct virtrng_info *vi = (struct virtrng_info *)rng->priv;
+
+- /*
+- * Don't ask host for data till we're setup. This call can
+- * happen during hwrng_register(), after commit d9e7972619.
+- */
+- if (unlikely(!probe_done))
+- return 0;
+-
+ if (!vi->busy) {
+ vi->busy = true;
+ init_completion(&vi->have_data);
+@@ -137,25 +130,17 @@ static int probe_common(struct virtio_device *vdev)
+ return err;
+ }
+
+- err = hwrng_register(&vi->hwrng);
+- if (err) {
+- vdev->config->del_vqs(vdev);
+- vi->vq = NULL;
+- kfree(vi);
+- ida_simple_remove(&rng_index_ida, index);
+- return err;
+- }
+-
+- probe_done = true;
+ return 0;
+ }
+
+ static void remove_common(struct virtio_device *vdev)
+ {
+ struct virtrng_info *vi = vdev->priv;
++
+ vdev->config->reset(vdev);
+ vi->busy = false;
+- hwrng_unregister(&vi->hwrng);
++ if (vi->hwrng_register_done)
++ hwrng_unregister(&vi->hwrng);
+ vdev->config->del_vqs(vdev);
+ ida_simple_remove(&rng_index_ida, vi->index);
+ kfree(vi);
+@@ -171,6 +156,16 @@ static void virtrng_remove(struct virtio_device *vdev)
+ remove_common(vdev);
+ }
+
++static void virtrng_scan(struct virtio_device *vdev)
++{
++ struct virtrng_info *vi = vdev->priv;
++ int err;
++
++ err = hwrng_register(&vi->hwrng);
++ if (!err)
++ vi->hwrng_register_done = true;
++}
++
+ #ifdef CONFIG_PM_SLEEP
+ static int virtrng_freeze(struct virtio_device *vdev)
+ {
+@@ -195,6 +190,7 @@ static struct virtio_driver virtio_rng_driver = {
+ .id_table = id_table,
+ .probe = virtrng_probe,
+ .remove = virtrng_remove,
++ .scan = virtrng_scan,
+ #ifdef CONFIG_PM_SLEEP
+ .freeze = virtrng_freeze,
+ .restore = virtrng_restore,
+diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
+index 8b73edef151d..4cc83ef7ef61 100644
+--- a/drivers/clk/clk.c
++++ b/drivers/clk/clk.c
+@@ -1495,6 +1495,7 @@ static struct clk *clk_propagate_rate_change(struct clk *clk, unsigned long even
+ static void clk_change_rate(struct clk *clk)
+ {
+ struct clk *child;
++ struct hlist_node *tmp;
+ unsigned long old_rate;
+ unsigned long best_parent_rate = 0;
+ bool skip_set_rate = false;
+@@ -1530,7 +1531,11 @@ static void clk_change_rate(struct clk *clk)
+ if (clk->notifier_count && old_rate != clk->rate)
+ __clk_notify(clk, POST_RATE_CHANGE, old_rate, clk->rate);
+
+- hlist_for_each_entry(child, &clk->children, child_node) {
++ /*
++ * Use safe iteration, as change_rate can actually swap parents
++ * for certain clock types.
++ */
++ hlist_for_each_entry_safe(child, tmp, &clk->children, child_node) {
+ /* Skip children who will be reparented to another clock */
+ if (child->new_parent && child->new_parent != clk)
+ continue;
+diff --git a/drivers/clk/qcom/common.c b/drivers/clk/qcom/common.c
+index 9b5a1cfc6b91..eeb3eea01f4c 100644
+--- a/drivers/clk/qcom/common.c
++++ b/drivers/clk/qcom/common.c
+@@ -27,30 +27,35 @@ struct qcom_cc {
+ struct clk *clks[];
+ };
+
+-int qcom_cc_probe(struct platform_device *pdev, const struct qcom_cc_desc *desc)
++struct regmap *
++qcom_cc_map(struct platform_device *pdev, const struct qcom_cc_desc *desc)
+ {
+ void __iomem *base;
+ struct resource *res;
++ struct device *dev = &pdev->dev;
++
++ res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
++ base = devm_ioremap_resource(dev, res);
++ if (IS_ERR(base))
++ return ERR_CAST(base);
++
++ return devm_regmap_init_mmio(dev, base, desc->config);
++}
++EXPORT_SYMBOL_GPL(qcom_cc_map);
++
++int qcom_cc_really_probe(struct platform_device *pdev,
++ const struct qcom_cc_desc *desc, struct regmap *regmap)
++{
+ int i, ret;
+ struct device *dev = &pdev->dev;
+ struct clk *clk;
+ struct clk_onecell_data *data;
+ struct clk **clks;
+- struct regmap *regmap;
+ struct qcom_reset_controller *reset;
+ struct qcom_cc *cc;
+ size_t num_clks = desc->num_clks;
+ struct clk_regmap **rclks = desc->clks;
+
+- res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+- base = devm_ioremap_resource(dev, res);
+- if (IS_ERR(base))
+- return PTR_ERR(base);
+-
+- regmap = devm_regmap_init_mmio(dev, base, desc->config);
+- if (IS_ERR(regmap))
+- return PTR_ERR(regmap);
+-
+ cc = devm_kzalloc(dev, sizeof(*cc) + sizeof(*clks) * num_clks,
+ GFP_KERNEL);
+ if (!cc)
+@@ -91,6 +96,18 @@ int qcom_cc_probe(struct platform_device *pdev, const struct qcom_cc_desc *desc)
+
+ return ret;
+ }
++EXPORT_SYMBOL_GPL(qcom_cc_really_probe);
++
++int qcom_cc_probe(struct platform_device *pdev, const struct qcom_cc_desc *desc)
++{
++ struct regmap *regmap;
++
++ regmap = qcom_cc_map(pdev, desc);
++ if (IS_ERR(regmap))
++ return PTR_ERR(regmap);
++
++ return qcom_cc_really_probe(pdev, desc, regmap);
++}
+ EXPORT_SYMBOL_GPL(qcom_cc_probe);
+
+ void qcom_cc_remove(struct platform_device *pdev)
+diff --git a/drivers/clk/qcom/common.h b/drivers/clk/qcom/common.h
+index 2c3cfc860348..2765e9d3da97 100644
+--- a/drivers/clk/qcom/common.h
++++ b/drivers/clk/qcom/common.h
+@@ -17,6 +17,7 @@ struct platform_device;
+ struct regmap_config;
+ struct clk_regmap;
+ struct qcom_reset_map;
++struct regmap;
+
+ struct qcom_cc_desc {
+ const struct regmap_config *config;
+@@ -26,6 +27,11 @@ struct qcom_cc_desc {
+ size_t num_resets;
+ };
+
++extern struct regmap *qcom_cc_map(struct platform_device *pdev,
++ const struct qcom_cc_desc *desc);
++extern int qcom_cc_really_probe(struct platform_device *pdev,
++ const struct qcom_cc_desc *desc,
++ struct regmap *regmap);
+ extern int qcom_cc_probe(struct platform_device *pdev,
+ const struct qcom_cc_desc *desc);
+
+diff --git a/drivers/clk/qcom/mmcc-msm8960.c b/drivers/clk/qcom/mmcc-msm8960.c
+index 4c449b3170f6..9bf6d925dd1a 100644
+--- a/drivers/clk/qcom/mmcc-msm8960.c
++++ b/drivers/clk/qcom/mmcc-msm8960.c
+@@ -38,6 +38,8 @@
+ #define P_PLL2 2
+ #define P_PLL3 3
+
++#define F_MN(f, s, _m, _n) { .freq = f, .src = s, .m = _m, .n = _n }
++
+ static u8 mmcc_pxo_pll8_pll2_map[] = {
+ [P_PXO] = 0,
+ [P_PLL8] = 2,
+@@ -59,8 +61,8 @@ static u8 mmcc_pxo_pll8_pll2_pll3_map[] = {
+
+ static const char *mmcc_pxo_pll8_pll2_pll3[] = {
+ "pxo",
+- "pll2",
+ "pll8_vote",
++ "pll2",
+ "pll3",
+ };
+
+@@ -710,18 +712,18 @@ static struct clk_branch csiphy2_timer_clk = {
+ };
+
+ static struct freq_tbl clk_tbl_gfx2d[] = {
+- { 27000000, P_PXO, 1, 0 },
+- { 48000000, P_PLL8, 1, 8 },
+- { 54857000, P_PLL8, 1, 7 },
+- { 64000000, P_PLL8, 1, 6 },
+- { 76800000, P_PLL8, 1, 5 },
+- { 96000000, P_PLL8, 1, 4 },
+- { 128000000, P_PLL8, 1, 3 },
+- { 145455000, P_PLL2, 2, 11 },
+- { 160000000, P_PLL2, 1, 5 },
+- { 177778000, P_PLL2, 2, 9 },
+- { 200000000, P_PLL2, 1, 4 },
+- { 228571000, P_PLL2, 2, 7 },
++ F_MN( 27000000, P_PXO, 1, 0),
++ F_MN( 48000000, P_PLL8, 1, 8),
++ F_MN( 54857000, P_PLL8, 1, 7),
++ F_MN( 64000000, P_PLL8, 1, 6),
++ F_MN( 76800000, P_PLL8, 1, 5),
++ F_MN( 96000000, P_PLL8, 1, 4),
++ F_MN(128000000, P_PLL8, 1, 3),
++ F_MN(145455000, P_PLL2, 2, 11),
++ F_MN(160000000, P_PLL2, 1, 5),
++ F_MN(177778000, P_PLL2, 2, 9),
++ F_MN(200000000, P_PLL2, 1, 4),
++ F_MN(228571000, P_PLL2, 2, 7),
+ { }
+ };
+
+@@ -842,22 +844,22 @@ static struct clk_branch gfx2d1_clk = {
+ };
+
+ static struct freq_tbl clk_tbl_gfx3d[] = {
+- { 27000000, P_PXO, 1, 0 },
+- { 48000000, P_PLL8, 1, 8 },
+- { 54857000, P_PLL8, 1, 7 },
+- { 64000000, P_PLL8, 1, 6 },
+- { 76800000, P_PLL8, 1, 5 },
+- { 96000000, P_PLL8, 1, 4 },
+- { 128000000, P_PLL8, 1, 3 },
+- { 145455000, P_PLL2, 2, 11 },
+- { 160000000, P_PLL2, 1, 5 },
+- { 177778000, P_PLL2, 2, 9 },
+- { 200000000, P_PLL2, 1, 4 },
+- { 228571000, P_PLL2, 2, 7 },
+- { 266667000, P_PLL2, 1, 3 },
+- { 300000000, P_PLL3, 1, 4 },
+- { 320000000, P_PLL2, 2, 5 },
+- { 400000000, P_PLL2, 1, 2 },
++ F_MN( 27000000, P_PXO, 1, 0),
++ F_MN( 48000000, P_PLL8, 1, 8),
++ F_MN( 54857000, P_PLL8, 1, 7),
++ F_MN( 64000000, P_PLL8, 1, 6),
++ F_MN( 76800000, P_PLL8, 1, 5),
++ F_MN( 96000000, P_PLL8, 1, 4),
++ F_MN(128000000, P_PLL8, 1, 3),
++ F_MN(145455000, P_PLL2, 2, 11),
++ F_MN(160000000, P_PLL2, 1, 5),
++ F_MN(177778000, P_PLL2, 2, 9),
++ F_MN(200000000, P_PLL2, 1, 4),
++ F_MN(228571000, P_PLL2, 2, 7),
++ F_MN(266667000, P_PLL2, 1, 3),
++ F_MN(300000000, P_PLL3, 1, 4),
++ F_MN(320000000, P_PLL2, 2, 5),
++ F_MN(400000000, P_PLL2, 1, 2),
+ { }
+ };
+
+@@ -897,7 +899,7 @@ static struct clk_dyn_rcg gfx3d_src = {
+ .hw.init = &(struct clk_init_data){
+ .name = "gfx3d_src",
+ .parent_names = mmcc_pxo_pll8_pll2_pll3,
+- .num_parents = 3,
++ .num_parents = 4,
+ .ops = &clk_dyn_rcg_ops,
+ },
+ },
+@@ -995,7 +997,7 @@ static struct clk_rcg jpegd_src = {
+ .ns_reg = 0x00ac,
+ .p = {
+ .pre_div_shift = 12,
+- .pre_div_width = 2,
++ .pre_div_width = 4,
+ },
+ .s = {
+ .src_sel_shift = 0,
+@@ -1115,7 +1117,7 @@ static struct clk_branch mdp_lut_clk = {
+ .enable_reg = 0x016c,
+ .enable_mask = BIT(0),
+ .hw.init = &(struct clk_init_data){
+- .parent_names = (const char *[]){ "mdp_clk" },
++ .parent_names = (const char *[]){ "mdp_src" },
+ .num_parents = 1,
+ .name = "mdp_lut_clk",
+ .ops = &clk_branch_ops,
+@@ -1342,15 +1344,15 @@ static struct clk_branch hdmi_app_clk = {
+ };
+
+ static struct freq_tbl clk_tbl_vcodec[] = {
+- { 27000000, P_PXO, 1, 0 },
+- { 32000000, P_PLL8, 1, 12 },
+- { 48000000, P_PLL8, 1, 8 },
+- { 54860000, P_PLL8, 1, 7 },
+- { 96000000, P_PLL8, 1, 4 },
+- { 133330000, P_PLL2, 1, 6 },
+- { 200000000, P_PLL2, 1, 4 },
+- { 228570000, P_PLL2, 2, 7 },
+- { 266670000, P_PLL2, 1, 3 },
++ F_MN( 27000000, P_PXO, 1, 0),
++ F_MN( 32000000, P_PLL8, 1, 12),
++ F_MN( 48000000, P_PLL8, 1, 8),
++ F_MN( 54860000, P_PLL8, 1, 7),
++ F_MN( 96000000, P_PLL8, 1, 4),
++ F_MN(133330000, P_PLL2, 1, 6),
++ F_MN(200000000, P_PLL2, 1, 4),
++ F_MN(228570000, P_PLL2, 2, 7),
++ F_MN(266670000, P_PLL2, 1, 3),
+ { }
+ };
+
+diff --git a/drivers/clk/qcom/mmcc-msm8974.c b/drivers/clk/qcom/mmcc-msm8974.c
+index c65b90515872..bc8f519c47aa 100644
+--- a/drivers/clk/qcom/mmcc-msm8974.c
++++ b/drivers/clk/qcom/mmcc-msm8974.c
+@@ -2547,18 +2547,16 @@ MODULE_DEVICE_TABLE(of, mmcc_msm8974_match_table);
+
+ static int mmcc_msm8974_probe(struct platform_device *pdev)
+ {
+- int ret;
+ struct regmap *regmap;
+
+- ret = qcom_cc_probe(pdev, &mmcc_msm8974_desc);
+- if (ret)
+- return ret;
++ regmap = qcom_cc_map(pdev, &mmcc_msm8974_desc);
++ if (IS_ERR(regmap))
++ return PTR_ERR(regmap);
+
+- regmap = dev_get_regmap(&pdev->dev, NULL);
+ clk_pll_configure_sr_hpm_lp(&mmpll1, regmap, &mmpll1_config, true);
+ clk_pll_configure_sr_hpm_lp(&mmpll3, regmap, &mmpll3_config, false);
+
+- return 0;
++ return qcom_cc_really_probe(pdev, &mmcc_msm8974_desc, regmap);
+ }
+
+ static int mmcc_msm8974_remove(struct platform_device *pdev)
+diff --git a/drivers/clk/ti/clk-dra7-atl.c b/drivers/clk/ti/clk-dra7-atl.c
+index 4a65b410e4d5..af29359677da 100644
+--- a/drivers/clk/ti/clk-dra7-atl.c
++++ b/drivers/clk/ti/clk-dra7-atl.c
+@@ -139,9 +139,13 @@ static long atl_clk_round_rate(struct clk_hw *hw, unsigned long rate,
+ static int atl_clk_set_rate(struct clk_hw *hw, unsigned long rate,
+ unsigned long parent_rate)
+ {
+- struct dra7_atl_desc *cdesc = to_atl_desc(hw);
++ struct dra7_atl_desc *cdesc;
+ u32 divider;
+
++ if (!hw || !rate)
++ return -EINVAL;
++
++ cdesc = to_atl_desc(hw);
+ divider = ((parent_rate + rate / 2) / rate) - 1;
+ if (divider > DRA7_ATL_DIVIDER_MASK)
+ divider = DRA7_ATL_DIVIDER_MASK;
+diff --git a/drivers/clk/ti/divider.c b/drivers/clk/ti/divider.c
+index e6aa10db7bba..a837f703be65 100644
+--- a/drivers/clk/ti/divider.c
++++ b/drivers/clk/ti/divider.c
+@@ -211,11 +211,16 @@ static long ti_clk_divider_round_rate(struct clk_hw *hw, unsigned long rate,
+ static int ti_clk_divider_set_rate(struct clk_hw *hw, unsigned long rate,
+ unsigned long parent_rate)
+ {
+- struct clk_divider *divider = to_clk_divider(hw);
++ struct clk_divider *divider;
+ unsigned int div, value;
+ unsigned long flags = 0;
+ u32 val;
+
++ if (!hw || !rate)
++ return -EINVAL;
++
++ divider = to_clk_divider(hw);
++
+ div = DIV_ROUND_UP(parent_rate, rate);
+ value = _get_val(divider, div);
+
+diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
+index 6f024852c6fb..21ab8bcd4d20 100644
+--- a/drivers/cpufreq/cpufreq.c
++++ b/drivers/cpufreq/cpufreq.c
+@@ -1279,6 +1279,8 @@ err_get_freq:
+ per_cpu(cpufreq_cpu_data, j) = NULL;
+ write_unlock_irqrestore(&cpufreq_driver_lock, flags);
+
++ up_write(&policy->rwsem);
++
+ if (cpufreq_driver->exit)
+ cpufreq_driver->exit(policy);
+ err_set_policy_cpu:
+@@ -1665,7 +1667,7 @@ void cpufreq_suspend(void)
+ return;
+
+ if (!has_target())
+- return;
++ goto suspend;
+
+ pr_debug("%s: Suspending Governors\n", __func__);
+
+@@ -1679,6 +1681,7 @@ void cpufreq_suspend(void)
+ policy);
+ }
+
++suspend:
+ cpufreq_suspended = true;
+ }
+
+@@ -1695,13 +1698,13 @@ void cpufreq_resume(void)
+ if (!cpufreq_driver)
+ return;
+
++ cpufreq_suspended = false;
++
+ if (!has_target())
+ return;
+
+ pr_debug("%s: Resuming Governors\n", __func__);
+
+- cpufreq_suspended = false;
+-
+ list_for_each_entry(policy, &cpufreq_policy_list, policy_list) {
+ if (cpufreq_driver->resume && cpufreq_driver->resume(policy))
+ pr_err("%s: Failed to resume driver: %p\n", __func__,
+diff --git a/drivers/cpufreq/cpufreq_opp.c b/drivers/cpufreq/cpufreq_opp.c
+index c0c6f4a4eccf..f7a32d2326c6 100644
+--- a/drivers/cpufreq/cpufreq_opp.c
++++ b/drivers/cpufreq/cpufreq_opp.c
+@@ -60,7 +60,7 @@ int dev_pm_opp_init_cpufreq_table(struct device *dev,
+ goto out;
+ }
+
+- freq_table = kzalloc(sizeof(*freq_table) * (max_opps + 1), GFP_KERNEL);
++ freq_table = kcalloc(sizeof(*freq_table), (max_opps + 1), GFP_ATOMIC);
+ if (!freq_table) {
+ ret = -ENOMEM;
+ goto out;
+diff --git a/drivers/crypto/ccp/ccp-crypto-main.c b/drivers/crypto/ccp/ccp-crypto-main.c
+index 20dc848481e7..4d4e016d755b 100644
+--- a/drivers/crypto/ccp/ccp-crypto-main.c
++++ b/drivers/crypto/ccp/ccp-crypto-main.c
+@@ -367,6 +367,10 @@ static int ccp_crypto_init(void)
+ {
+ int ret;
+
++ ret = ccp_present();
++ if (ret)
++ return ret;
++
+ spin_lock_init(&req_queue_lock);
+ INIT_LIST_HEAD(&req_queue.cmds);
+ req_queue.backlog = &req_queue.cmds;
+diff --git a/drivers/crypto/ccp/ccp-dev.c b/drivers/crypto/ccp/ccp-dev.c
+index 2c7816149b01..c08151eb54c1 100644
+--- a/drivers/crypto/ccp/ccp-dev.c
++++ b/drivers/crypto/ccp/ccp-dev.c
+@@ -53,6 +53,20 @@ static inline void ccp_del_device(struct ccp_device *ccp)
+ }
+
+ /**
++ * ccp_present - check if a CCP device is present
++ *
++ * Returns zero if a CCP device is present, -ENODEV otherwise.
++ */
++int ccp_present(void)
++{
++ if (ccp_get_device())
++ return 0;
++
++ return -ENODEV;
++}
++EXPORT_SYMBOL_GPL(ccp_present);
++
++/**
+ * ccp_enqueue_cmd - queue an operation for processing by the CCP
+ *
+ * @cmd: ccp_cmd struct to be processed
+diff --git a/drivers/dma/TODO b/drivers/dma/TODO
+index 734ed0206cd5..b8045cd42ee1 100644
+--- a/drivers/dma/TODO
++++ b/drivers/dma/TODO
+@@ -7,7 +7,6 @@ TODO for slave dma
+ - imx-dma
+ - imx-sdma
+ - mxs-dma.c
+- - dw_dmac
+ - intel_mid_dma
+ 4. Check other subsystems for dma drivers and merge/move to dmaengine
+ 5. Remove dma_slave_config's dma direction.
+diff --git a/drivers/dma/dw/core.c b/drivers/dma/dw/core.c
+index a27ded53ab4f..525b4654bd90 100644
+--- a/drivers/dma/dw/core.c
++++ b/drivers/dma/dw/core.c
+@@ -279,6 +279,15 @@ static void dwc_dostart(struct dw_dma_chan *dwc, struct dw_desc *first)
+ channel_set_bit(dw, CH_EN, dwc->mask);
+ }
+
++static void dwc_dostart_first_queued(struct dw_dma_chan *dwc)
++{
++ if (list_empty(&dwc->queue))
++ return;
++
++ list_move(dwc->queue.next, &dwc->active_list);
++ dwc_dostart(dwc, dwc_first_active(dwc));
++}
++
+ /*----------------------------------------------------------------------*/
+
+ static void
+@@ -335,10 +344,7 @@ static void dwc_complete_all(struct dw_dma *dw, struct dw_dma_chan *dwc)
+ * the completed ones.
+ */
+ list_splice_init(&dwc->active_list, &list);
+- if (!list_empty(&dwc->queue)) {
+- list_move(dwc->queue.next, &dwc->active_list);
+- dwc_dostart(dwc, dwc_first_active(dwc));
+- }
++ dwc_dostart_first_queued(dwc);
+
+ spin_unlock_irqrestore(&dwc->lock, flags);
+
+@@ -467,10 +473,7 @@ static void dwc_scan_descriptors(struct dw_dma *dw, struct dw_dma_chan *dwc)
+ /* Try to continue after resetting the channel... */
+ dwc_chan_disable(dw, dwc);
+
+- if (!list_empty(&dwc->queue)) {
+- list_move(dwc->queue.next, &dwc->active_list);
+- dwc_dostart(dwc, dwc_first_active(dwc));
+- }
++ dwc_dostart_first_queued(dwc);
+ spin_unlock_irqrestore(&dwc->lock, flags);
+ }
+
+@@ -677,17 +680,9 @@ static dma_cookie_t dwc_tx_submit(struct dma_async_tx_descriptor *tx)
+ * possible, perhaps even appending to those already submitted
+ * for DMA. But this is hard to do in a race-free manner.
+ */
+- if (list_empty(&dwc->active_list)) {
+- dev_vdbg(chan2dev(tx->chan), "%s: started %u\n", __func__,
+- desc->txd.cookie);
+- list_add_tail(&desc->desc_node, &dwc->active_list);
+- dwc_dostart(dwc, dwc_first_active(dwc));
+- } else {
+- dev_vdbg(chan2dev(tx->chan), "%s: queued %u\n", __func__,
+- desc->txd.cookie);
+
+- list_add_tail(&desc->desc_node, &dwc->queue);
+- }
++ dev_vdbg(chan2dev(tx->chan), "%s: queued %u\n", __func__, desc->txd.cookie);
++ list_add_tail(&desc->desc_node, &dwc->queue);
+
+ spin_unlock_irqrestore(&dwc->lock, flags);
+
+@@ -1092,9 +1087,12 @@ dwc_tx_status(struct dma_chan *chan,
+ static void dwc_issue_pending(struct dma_chan *chan)
+ {
+ struct dw_dma_chan *dwc = to_dw_dma_chan(chan);
++ unsigned long flags;
+
+- if (!list_empty(&dwc->queue))
+- dwc_scan_descriptors(to_dw_dma(chan->device), dwc);
++ spin_lock_irqsave(&dwc->lock, flags);
++ if (list_empty(&dwc->active_list))
++ dwc_dostart_first_queued(dwc);
++ spin_unlock_irqrestore(&dwc->lock, flags);
+ }
+
+ static int dwc_alloc_chan_resources(struct dma_chan *chan)
+diff --git a/drivers/gpio/gpiolib-acpi.c b/drivers/gpio/gpiolib-acpi.c
+index 4a987917c186..86608585ec00 100644
+--- a/drivers/gpio/gpiolib-acpi.c
++++ b/drivers/gpio/gpiolib-acpi.c
+@@ -357,8 +357,10 @@ acpi_gpio_adr_space_handler(u32 function, acpi_physical_address address,
+ struct gpio_chip *chip = achip->chip;
+ struct acpi_resource_gpio *agpio;
+ struct acpi_resource *ares;
++ int pin_index = (int)address;
+ acpi_status status;
+ bool pull_up;
++ int length;
+ int i;
+
+ status = acpi_buffer_to_resource(achip->conn_info.connection,
+@@ -380,7 +382,8 @@ acpi_gpio_adr_space_handler(u32 function, acpi_physical_address address,
+ return AE_BAD_PARAMETER;
+ }
+
+- for (i = 0; i < agpio->pin_table_length; i++) {
++ length = min(agpio->pin_table_length, (u16)(pin_index + bits));
++ for (i = pin_index; i < length; ++i) {
+ unsigned pin = agpio->pin_table[i];
+ struct acpi_gpio_connection *conn;
+ struct gpio_desc *desc;
+diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
+index 2ebc9071e354..810c84fd00c4 100644
+--- a/drivers/gpio/gpiolib.c
++++ b/drivers/gpio/gpiolib.c
+@@ -1368,12 +1368,12 @@ void gpiochip_set_chained_irqchip(struct gpio_chip *gpiochip,
+ return;
+ }
+
+- irq_set_chained_handler(parent_irq, parent_handler);
+ /*
+ * The parent irqchip is already using the chip_data for this
+ * irqchip, so our callbacks simply use the handler_data.
+ */
+ irq_set_handler_data(parent_irq, gpiochip);
++ irq_set_chained_handler(parent_irq, parent_handler);
+ }
+ EXPORT_SYMBOL_GPL(gpiochip_set_chained_irqchip);
+
+diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
+index a2cc6be97983..b792194e0d9c 100644
+--- a/drivers/gpu/drm/ast/ast_main.c
++++ b/drivers/gpu/drm/ast/ast_main.c
+@@ -67,6 +67,7 @@ static int ast_detect_chip(struct drm_device *dev)
+ {
+ struct ast_private *ast = dev->dev_private;
+ uint32_t data, jreg;
++ ast_open_key(ast);
+
+ if (dev->pdev->device == PCI_CHIP_AST1180) {
+ ast->chip = AST1100;
+@@ -104,7 +105,7 @@ static int ast_detect_chip(struct drm_device *dev)
+ }
+ ast->vga2_clone = false;
+ } else {
+- ast->chip = 2000;
++ ast->chip = AST2000;
+ DRM_INFO("AST 2000 detected\n");
+ }
+ }
+diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
+index 9d7954366bd2..fa9764a2e080 100644
+--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
++++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
+@@ -706,11 +706,13 @@ int i915_cmd_parser_init_ring(struct intel_engine_cs *ring)
+ BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
+ BUG_ON(!validate_regs_sorted(ring));
+
+- ret = init_hash_table(ring, cmd_tables, cmd_table_count);
+- if (ret) {
+- DRM_ERROR("CMD: cmd_parser_init failed!\n");
+- fini_hash_table(ring);
+- return ret;
++ if (hash_empty(ring->cmd_hash)) {
++ ret = init_hash_table(ring, cmd_tables, cmd_table_count);
++ if (ret) {
++ DRM_ERROR("CMD: cmd_parser_init failed!\n");
++ fini_hash_table(ring);
++ return ret;
++ }
+ }
+
+ ring->needs_cmd_parser = true;
+diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
+index d893e4da5dce..ef3b4798da02 100644
+--- a/drivers/gpu/drm/i915/i915_gem.c
++++ b/drivers/gpu/drm/i915/i915_gem.c
+@@ -1576,10 +1576,13 @@ unlock:
+ out:
+ switch (ret) {
+ case -EIO:
+- /* If this -EIO is due to a gpu hang, give the reset code a
+- * chance to clean up the mess. Otherwise return the proper
+- * SIGBUS. */
+- if (i915_terminally_wedged(&dev_priv->gpu_error)) {
++ /*
++ * We eat errors when the gpu is terminally wedged to avoid
++ * userspace unduly crashing (gl has no provisions for mmaps to
++ * fail). But any other -EIO isn't ours (e.g. swap in failure)
++ * and so needs to be reported.
++ */
++ if (!i915_terminally_wedged(&dev_priv->gpu_error)) {
+ ret = VM_FAULT_SIGBUS;
+ break;
+ }
+diff --git a/drivers/gpu/drm/i915/intel_bios.c b/drivers/gpu/drm/i915/intel_bios.c
+index 827498e081df..2e0a2feb4cda 100644
+--- a/drivers/gpu/drm/i915/intel_bios.c
++++ b/drivers/gpu/drm/i915/intel_bios.c
+@@ -877,7 +877,7 @@ err:
+
+ /* error during parsing so set all pointers to null
+ * because of partial parsing */
+- memset(dev_priv->vbt.dsi.sequence, 0, MIPI_SEQ_MAX);
++ memset(dev_priv->vbt.dsi.sequence, 0, sizeof(dev_priv->vbt.dsi.sequence));
+ }
+
+ static void parse_ddi_port(struct drm_i915_private *dev_priv, enum port port,
+@@ -1122,7 +1122,7 @@ init_vbt_defaults(struct drm_i915_private *dev_priv)
+ }
+ }
+
+-static int __init intel_no_opregion_vbt_callback(const struct dmi_system_id *id)
++static int intel_no_opregion_vbt_callback(const struct dmi_system_id *id)
+ {
+ DRM_DEBUG_KMS("Falling back to manually reading VBT from "
+ "VBIOS ROM for %s\n",
+diff --git a/drivers/gpu/drm/i915/intel_crt.c b/drivers/gpu/drm/i915/intel_crt.c
+index 5a045d3bd77e..3e1edbfa8e07 100644
+--- a/drivers/gpu/drm/i915/intel_crt.c
++++ b/drivers/gpu/drm/i915/intel_crt.c
+@@ -673,16 +673,21 @@ intel_crt_detect(struct drm_connector *connector, bool force)
+ goto out;
+ }
+
++ drm_modeset_acquire_init(&ctx, 0);
++
+ /* for pre-945g platforms use load detect */
+ if (intel_get_load_detect_pipe(connector, NULL, &tmp, &ctx)) {
+ if (intel_crt_detect_ddc(connector))
+ status = connector_status_connected;
+ else
+ status = intel_crt_load_detect(crt);
+- intel_release_load_detect_pipe(connector, &tmp, &ctx);
++ intel_release_load_detect_pipe(connector, &tmp);
+ } else
+ status = connector_status_unknown;
+
++ drm_modeset_drop_locks(&ctx);
++ drm_modeset_acquire_fini(&ctx);
++
+ out:
+ intel_display_power_put(dev_priv, power_domain);
+ intel_runtime_pm_put(dev_priv);
+@@ -775,7 +780,7 @@ static const struct drm_encoder_funcs intel_crt_enc_funcs = {
+ .destroy = intel_encoder_destroy,
+ };
+
+-static int __init intel_no_crt_dmi_callback(const struct dmi_system_id *id)
++static int intel_no_crt_dmi_callback(const struct dmi_system_id *id)
+ {
+ DRM_INFO("Skipping CRT initialization for %s\n", id->ident);
+ return 1;
+diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
+index f0be855ddf45..ffaf8be939f1 100644
+--- a/drivers/gpu/drm/i915/intel_display.c
++++ b/drivers/gpu/drm/i915/intel_display.c
+@@ -2200,6 +2200,15 @@ intel_pin_and_fence_fb_obj(struct drm_device *dev,
+ if (need_vtd_wa(dev) && alignment < 256 * 1024)
+ alignment = 256 * 1024;
+
++ /*
++ * Global gtt pte registers are special registers which actually forward
++ * writes to a chunk of system memory. Which means that there is no risk
++ * that the register values disappear as soon as we call
++ * intel_runtime_pm_put(), so it is correct to wrap only the
++ * pin/unpin/fence and not more.
++ */
++ intel_runtime_pm_get(dev_priv);
++
+ dev_priv->mm.interruptible = false;
+ ret = i915_gem_object_pin_to_display_plane(obj, alignment, pipelined);
+ if (ret)
+@@ -2217,12 +2226,14 @@ intel_pin_and_fence_fb_obj(struct drm_device *dev,
+ i915_gem_object_pin_fence(obj);
+
+ dev_priv->mm.interruptible = true;
++ intel_runtime_pm_put(dev_priv);
+ return 0;
+
+ err_unpin:
+ i915_gem_object_unpin_from_display_plane(obj);
+ err_interruptible:
+ dev_priv->mm.interruptible = true;
++ intel_runtime_pm_put(dev_priv);
+ return ret;
+ }
+
+@@ -8087,6 +8098,15 @@ static int intel_crtc_cursor_set(struct drm_crtc *crtc,
+ goto fail_locked;
+ }
+
++ /*
++ * Global gtt pte registers are special registers which actually
++ * forward writes to a chunk of system memory. Which means that
++ * there is no risk that the register values disappear as soon
++ * as we call intel_runtime_pm_put(), so it is correct to wrap
++ * only the pin/unpin/fence and not more.
++ */
++ intel_runtime_pm_get(dev_priv);
++
+ /* Note that the w/a also requires 2 PTE of padding following
+ * the bo. We currently fill all unused PTE with the shadow
+ * page and so we should always have valid PTE following the
+@@ -8099,16 +8119,20 @@ static int intel_crtc_cursor_set(struct drm_crtc *crtc,
+ ret = i915_gem_object_pin_to_display_plane(obj, alignment, NULL);
+ if (ret) {
+ DRM_DEBUG_KMS("failed to move cursor bo into the GTT\n");
++ intel_runtime_pm_put(dev_priv);
+ goto fail_locked;
+ }
+
+ ret = i915_gem_object_put_fence(obj);
+ if (ret) {
+ DRM_DEBUG_KMS("failed to release fence for cursor");
++ intel_runtime_pm_put(dev_priv);
+ goto fail_unpin;
+ }
+
+ addr = i915_gem_obj_ggtt_offset(obj);
++
++ intel_runtime_pm_put(dev_priv);
+ } else {
+ int align = IS_I830(dev) ? 16 * 1024 : 256;
+ ret = i915_gem_object_attach_phys(obj, align);
+@@ -8319,8 +8343,6 @@ bool intel_get_load_detect_pipe(struct drm_connector *connector,
+ connector->base.id, connector->name,
+ encoder->base.id, encoder->name);
+
+- drm_modeset_acquire_init(ctx, 0);
+-
+ retry:
+ ret = drm_modeset_lock(&config->connection_mutex, ctx);
+ if (ret)
+@@ -8359,10 +8381,14 @@ retry:
+ i++;
+ if (!(encoder->possible_crtcs & (1 << i)))
+ continue;
+- if (!possible_crtc->enabled) {
+- crtc = possible_crtc;
+- break;
+- }
++ if (possible_crtc->enabled)
++ continue;
++ /* This can occur when applying the pipe A quirk on resume. */
++ if (to_intel_crtc(possible_crtc)->new_enabled)
++ continue;
++
++ crtc = possible_crtc;
++ break;
+ }
+
+ /*
+@@ -8431,15 +8457,11 @@ fail_unlock:
+ goto retry;
+ }
+
+- drm_modeset_drop_locks(ctx);
+- drm_modeset_acquire_fini(ctx);
+-
+ return false;
+ }
+
+ void intel_release_load_detect_pipe(struct drm_connector *connector,
+- struct intel_load_detect_pipe *old,
+- struct drm_modeset_acquire_ctx *ctx)
++ struct intel_load_detect_pipe *old)
+ {
+ struct intel_encoder *intel_encoder =
+ intel_attached_encoder(connector);
+@@ -8463,17 +8485,12 @@ void intel_release_load_detect_pipe(struct drm_connector *connector,
+ drm_framebuffer_unreference(old->release_fb);
+ }
+
+- goto unlock;
+ return;
+ }
+
+ /* Switch crtc and encoder back off if necessary */
+ if (old->dpms_mode != DRM_MODE_DPMS_ON)
+ connector->funcs->dpms(connector, old->dpms_mode);
+-
+-unlock:
+- drm_modeset_drop_locks(ctx);
+- drm_modeset_acquire_fini(ctx);
+ }
+
+ static int i9xx_pll_refclk(struct drm_device *dev,
+@@ -9294,6 +9311,8 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,
+
+ if (IS_VALLEYVIEW(dev)) {
+ ring = &dev_priv->ring[BCS];
++ } else if (IS_IVYBRIDGE(dev)) {
++ ring = &dev_priv->ring[BCS];
+ } else if (INTEL_INFO(dev)->gen >= 7) {
+ ring = obj->ring;
+ if (ring == NULL || ring->id != RCS)
+@@ -11671,6 +11690,9 @@ static struct intel_quirk intel_quirks[] = {
+ /* Acer C720 and C720P Chromebooks (Celeron 2955U) have backlights */
+ { 0x0a06, 0x1025, 0x0a11, quirk_backlight_present },
+
++ /* Acer C720 Chromebook (Core i3 4005U) */
++ { 0x0a16, 0x1025, 0x0a11, quirk_backlight_present },
++
+ /* Toshiba CB35 Chromebook (Celeron 2955U) */
+ { 0x0a06, 0x1179, 0x0a88, quirk_backlight_present },
+
+@@ -11840,7 +11862,7 @@ static void intel_enable_pipe_a(struct drm_device *dev)
+ struct intel_connector *connector;
+ struct drm_connector *crt = NULL;
+ struct intel_load_detect_pipe load_detect_temp;
+- struct drm_modeset_acquire_ctx ctx;
++ struct drm_modeset_acquire_ctx *ctx = dev->mode_config.acquire_ctx;
+
+ /* We can't just switch on the pipe A, we need to set things up with a
+ * proper mode and output configuration. As a gross hack, enable pipe A
+@@ -11857,10 +11879,8 @@ static void intel_enable_pipe_a(struct drm_device *dev)
+ if (!crt)
+ return;
+
+- if (intel_get_load_detect_pipe(crt, NULL, &load_detect_temp, &ctx))
+- intel_release_load_detect_pipe(crt, &load_detect_temp, &ctx);
+-
+-
++ if (intel_get_load_detect_pipe(crt, NULL, &load_detect_temp, ctx))
++ intel_release_load_detect_pipe(crt, &load_detect_temp);
+ }
+
+ static bool
+diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
+index 8a1a4fbc06ac..fbffcbb9a0f8 100644
+--- a/drivers/gpu/drm/i915/intel_dp.c
++++ b/drivers/gpu/drm/i915/intel_dp.c
+@@ -3313,6 +3313,9 @@ intel_dp_check_link_status(struct intel_dp *intel_dp)
+ if (WARN_ON(!intel_encoder->base.crtc))
+ return;
+
++ if (!to_intel_crtc(intel_encoder->base.crtc)->active)
++ return;
++
+ /* Try to read receiver status if the link appears to be up */
+ if (!intel_dp_get_link_status(intel_dp, link_status)) {
+ return;
+diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
+index f67340ed2c12..e0f88a0669c1 100644
+--- a/drivers/gpu/drm/i915/intel_drv.h
++++ b/drivers/gpu/drm/i915/intel_drv.h
+@@ -754,8 +754,7 @@ bool intel_get_load_detect_pipe(struct drm_connector *connector,
+ struct intel_load_detect_pipe *old,
+ struct drm_modeset_acquire_ctx *ctx);
+ void intel_release_load_detect_pipe(struct drm_connector *connector,
+- struct intel_load_detect_pipe *old,
+- struct drm_modeset_acquire_ctx *ctx);
++ struct intel_load_detect_pipe *old);
+ int intel_pin_and_fence_fb_obj(struct drm_device *dev,
+ struct drm_i915_gem_object *obj,
+ struct intel_engine_cs *pipelined);
+diff --git a/drivers/gpu/drm/i915/intel_hdmi.c b/drivers/gpu/drm/i915/intel_hdmi.c
+index eee2bbec2958..057366453d27 100644
+--- a/drivers/gpu/drm/i915/intel_hdmi.c
++++ b/drivers/gpu/drm/i915/intel_hdmi.c
+@@ -728,7 +728,7 @@ static void intel_hdmi_get_config(struct intel_encoder *encoder,
+ if (tmp & HDMI_MODE_SELECT_HDMI)
+ pipe_config->has_hdmi_sink = true;
+
+- if (tmp & HDMI_MODE_SELECT_HDMI)
++ if (tmp & SDVO_AUDIO_ENABLE)
+ pipe_config->has_audio = true;
+
+ pipe_config->adjusted_mode.flags |= flags;
+diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds.c
+index 5e5a72fca5fb..0fb230949f81 100644
+--- a/drivers/gpu/drm/i915/intel_lvds.c
++++ b/drivers/gpu/drm/i915/intel_lvds.c
+@@ -531,7 +531,7 @@ static const struct drm_encoder_funcs intel_lvds_enc_funcs = {
+ .destroy = intel_encoder_destroy,
+ };
+
+-static int __init intel_no_lvds_dmi_callback(const struct dmi_system_id *id)
++static int intel_no_lvds_dmi_callback(const struct dmi_system_id *id)
+ {
+ DRM_INFO("Skipping LVDS initialization for %s\n", id->ident);
+ return 1;
+diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
+index 279488addf3f..7add7eead21d 100644
+--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
++++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
+@@ -517,6 +517,9 @@ static int init_ring_common(struct intel_engine_cs *ring)
+ else
+ ring_setup_phys_status_page(ring);
+
++ /* Enforce ordering by reading HEAD register back */
++ I915_READ_HEAD(ring);
++
+ /* Initialize the ring. This must happen _after_ we've cleared the ring
+ * registers with the above sequence (the readback of the HEAD registers
+ * also enforces ordering), otherwise the hw might lose the new ring
+diff --git a/drivers/gpu/drm/i915/intel_tv.c b/drivers/gpu/drm/i915/intel_tv.c
+index 67c6c9a2eb1c..5c6f7e2417e4 100644
+--- a/drivers/gpu/drm/i915/intel_tv.c
++++ b/drivers/gpu/drm/i915/intel_tv.c
+@@ -854,6 +854,10 @@ intel_enable_tv(struct intel_encoder *encoder)
+ struct drm_device *dev = encoder->base.dev;
+ struct drm_i915_private *dev_priv = dev->dev_private;
+
++ /* Prevents vblank waits from timing out in intel_tv_detect_type() */
++ intel_wait_for_vblank(encoder->base.dev,
++ to_intel_crtc(encoder->base.crtc)->pipe);
++
+ I915_WRITE(TV_CTL, I915_READ(TV_CTL) | TV_ENC_ENABLE);
+ }
+
+@@ -1311,6 +1315,7 @@ intel_tv_detect(struct drm_connector *connector, bool force)
+ {
+ struct drm_display_mode mode;
+ struct intel_tv *intel_tv = intel_attached_tv(connector);
++ enum drm_connector_status status;
+ int type;
+
+ DRM_DEBUG_KMS("[CONNECTOR:%d:%s] force=%d\n",
+@@ -1323,16 +1328,24 @@ intel_tv_detect(struct drm_connector *connector, bool force)
+ struct intel_load_detect_pipe tmp;
+ struct drm_modeset_acquire_ctx ctx;
+
++ drm_modeset_acquire_init(&ctx, 0);
++
+ if (intel_get_load_detect_pipe(connector, &mode, &tmp, &ctx)) {
+ type = intel_tv_detect_type(intel_tv, connector);
+- intel_release_load_detect_pipe(connector, &tmp, &ctx);
++ intel_release_load_detect_pipe(connector, &tmp);
++ status = type < 0 ?
++ connector_status_disconnected :
++ connector_status_connected;
+ } else
+- return connector_status_unknown;
++ status = connector_status_unknown;
++
++ drm_modeset_drop_locks(&ctx);
++ drm_modeset_acquire_fini(&ctx);
+ } else
+ return connector->status;
+
+- if (type < 0)
+- return connector_status_disconnected;
++ if (status != connector_status_connected)
++ return status;
+
+ intel_tv->type = type;
+ intel_tv_find_better_format(connector);
+diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
+index 5425ffe3931d..594c3f54102e 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
++++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
+@@ -596,6 +596,7 @@ int nouveau_pmops_suspend(struct device *dev)
+
+ pci_save_state(pdev);
+ pci_disable_device(pdev);
++ pci_ignore_hotplug(pdev);
+ pci_set_power_state(pdev, PCI_D3hot);
+ return 0;
+ }
+diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c
+index ab0228f640a5..7e185c122750 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
++++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
+@@ -76,6 +76,7 @@ static int
+ nouveau_vram_manager_new(struct ttm_mem_type_manager *man,
+ struct ttm_buffer_object *bo,
+ struct ttm_placement *placement,
++ uint32_t flags,
+ struct ttm_mem_reg *mem)
+ {
+ struct nouveau_drm *drm = nouveau_bdev(man->bdev);
+@@ -162,6 +163,7 @@ static int
+ nouveau_gart_manager_new(struct ttm_mem_type_manager *man,
+ struct ttm_buffer_object *bo,
+ struct ttm_placement *placement,
++ uint32_t flags,
+ struct ttm_mem_reg *mem)
+ {
+ struct nouveau_drm *drm = nouveau_bdev(bo->bdev);
+@@ -242,6 +244,7 @@ static int
+ nv04_gart_manager_new(struct ttm_mem_type_manager *man,
+ struct ttm_buffer_object *bo,
+ struct ttm_placement *placement,
++ uint32_t flags,
+ struct ttm_mem_reg *mem)
+ {
+ struct nouveau_mem *node;
+diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c b/drivers/gpu/drm/nouveau/nouveau_vga.c
+index 4f4c3fec6916..c110b2cfc3eb 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_vga.c
++++ b/drivers/gpu/drm/nouveau/nouveau_vga.c
+@@ -106,7 +106,16 @@ void
+ nouveau_vga_fini(struct nouveau_drm *drm)
+ {
+ struct drm_device *dev = drm->dev;
++ bool runtime = false;
++
++ if (nouveau_runtime_pm == 1)
++ runtime = true;
++ if ((nouveau_runtime_pm == -1) && (nouveau_is_optimus() || nouveau_is_v1_dsm()))
++ runtime = true;
++
+ vga_switcheroo_unregister_client(dev->pdev);
++ if (runtime && nouveau_is_v1_dsm() && !nouveau_is_optimus())
++ vga_switcheroo_fini_domain_pm_ops(drm->dev->dev);
+ vga_client_register(dev->pdev, NULL, NULL, NULL);
+ }
+
+diff --git a/drivers/gpu/drm/radeon/ci_dpm.c b/drivers/gpu/drm/radeon/ci_dpm.c
+index 584090ac3eb9..d416bb2ff48d 100644
+--- a/drivers/gpu/drm/radeon/ci_dpm.c
++++ b/drivers/gpu/drm/radeon/ci_dpm.c
+@@ -869,6 +869,9 @@ static int ci_set_thermal_temperature_range(struct radeon_device *rdev,
+ WREG32_SMC(CG_THERMAL_CTRL, tmp);
+ #endif
+
++ rdev->pm.dpm.thermal.min_temp = low_temp;
++ rdev->pm.dpm.thermal.max_temp = high_temp;
++
+ return 0;
+ }
+
+@@ -940,7 +943,18 @@ static void ci_get_leakage_voltages(struct radeon_device *rdev)
+ pi->vddc_leakage.count = 0;
+ pi->vddci_leakage.count = 0;
+
+- if (radeon_atom_get_leakage_id_from_vbios(rdev, &leakage_id) == 0) {
++ if (rdev->pm.dpm.platform_caps & ATOM_PP_PLATFORM_CAP_EVV) {
++ for (i = 0; i < CISLANDS_MAX_LEAKAGE_COUNT; i++) {
++ virtual_voltage_id = ATOM_VIRTUAL_VOLTAGE_ID0 + i;
++ if (radeon_atom_get_voltage_evv(rdev, virtual_voltage_id, &vddc) != 0)
++ continue;
++ if (vddc != 0 && vddc != virtual_voltage_id) {
++ pi->vddc_leakage.actual_voltage[pi->vddc_leakage.count] = vddc;
++ pi->vddc_leakage.leakage_id[pi->vddc_leakage.count] = virtual_voltage_id;
++ pi->vddc_leakage.count++;
++ }
++ }
++ } else if (radeon_atom_get_leakage_id_from_vbios(rdev, &leakage_id) == 0) {
+ for (i = 0; i < CISLANDS_MAX_LEAKAGE_COUNT; i++) {
+ virtual_voltage_id = ATOM_VIRTUAL_VOLTAGE_ID0 + i;
+ if (radeon_atom_get_leakage_vddc_based_on_leakage_params(rdev, &vddc, &vddci,
+diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
+index 65a8cca603a4..5ea01de617ab 100644
+--- a/drivers/gpu/drm/radeon/cik.c
++++ b/drivers/gpu/drm/radeon/cik.c
+@@ -3259,7 +3259,7 @@ static void cik_gpu_init(struct radeon_device *rdev)
+ u32 mc_shared_chmap, mc_arb_ramcfg;
+ u32 hdp_host_path_cntl;
+ u32 tmp;
+- int i, j, k;
++ int i, j;
+
+ switch (rdev->family) {
+ case CHIP_BONAIRE:
+@@ -3449,12 +3449,11 @@ static void cik_gpu_init(struct radeon_device *rdev)
+ rdev->config.cik.max_sh_per_se,
+ rdev->config.cik.max_backends_per_se);
+
++ rdev->config.cik.active_cus = 0;
+ for (i = 0; i < rdev->config.cik.max_shader_engines; i++) {
+ for (j = 0; j < rdev->config.cik.max_sh_per_se; j++) {
+- for (k = 0; k < rdev->config.cik.max_cu_per_sh; k++) {
+- rdev->config.cik.active_cus +=
+- hweight32(cik_get_cu_active_bitmap(rdev, i, j));
+- }
++ rdev->config.cik.active_cus +=
++ hweight32(cik_get_cu_active_bitmap(rdev, i, j));
+ }
+ }
+
+@@ -4490,7 +4489,7 @@ struct bonaire_mqd
+ */
+ static int cik_cp_compute_resume(struct radeon_device *rdev)
+ {
+- int r, i, idx;
++ int r, i, j, idx;
+ u32 tmp;
+ bool use_doorbell = true;
+ u64 hqd_gpu_addr;
+@@ -4609,7 +4608,7 @@ static int cik_cp_compute_resume(struct radeon_device *rdev)
+ mqd->queue_state.cp_hqd_pq_wptr= 0;
+ if (RREG32(CP_HQD_ACTIVE) & 1) {
+ WREG32(CP_HQD_DEQUEUE_REQUEST, 1);
+- for (i = 0; i < rdev->usec_timeout; i++) {
++ for (j = 0; j < rdev->usec_timeout; j++) {
+ if (!(RREG32(CP_HQD_ACTIVE) & 1))
+ break;
+ udelay(1);
+@@ -5643,12 +5642,13 @@ static void cik_vm_decode_fault(struct radeon_device *rdev,
+ void cik_vm_flush(struct radeon_device *rdev, int ridx, struct radeon_vm *vm)
+ {
+ struct radeon_ring *ring = &rdev->ring[ridx];
++ int usepfp = (ridx == RADEON_RING_TYPE_GFX_INDEX);
+
+ if (vm == NULL)
+ return;
+
+ radeon_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
+- radeon_ring_write(ring, (WRITE_DATA_ENGINE_SEL(0) |
++ radeon_ring_write(ring, (WRITE_DATA_ENGINE_SEL(usepfp) |
+ WRITE_DATA_DST_SEL(0)));
+ if (vm->id < 8) {
+ radeon_ring_write(ring,
+@@ -5698,7 +5698,7 @@ void cik_vm_flush(struct radeon_device *rdev, int ridx, struct radeon_vm *vm)
+ radeon_ring_write(ring, 1 << vm->id);
+
+ /* compute doesn't have PFP */
+- if (ridx == RADEON_RING_TYPE_GFX_INDEX) {
++ if (usepfp) {
+ /* sync PFP to ME, otherwise we might get invalid PFP reads */
+ radeon_ring_write(ring, PACKET3(PACKET3_PFP_SYNC_ME, 0));
+ radeon_ring_write(ring, 0x0);
+diff --git a/drivers/gpu/drm/radeon/cik_sdma.c b/drivers/gpu/drm/radeon/cik_sdma.c
+index 8e9d0f1d858e..72bff72c036d 100644
+--- a/drivers/gpu/drm/radeon/cik_sdma.c
++++ b/drivers/gpu/drm/radeon/cik_sdma.c
+@@ -459,13 +459,6 @@ int cik_sdma_resume(struct radeon_device *rdev)
+ {
+ int r;
+
+- /* Reset dma */
+- WREG32(SRBM_SOFT_RESET, SOFT_RESET_SDMA | SOFT_RESET_SDMA1);
+- RREG32(SRBM_SOFT_RESET);
+- udelay(50);
+- WREG32(SRBM_SOFT_RESET, 0);
+- RREG32(SRBM_SOFT_RESET);
+-
+ r = cik_sdma_load_microcode(rdev);
+ if (r)
+ return r;
+diff --git a/drivers/gpu/drm/radeon/kv_dpm.c b/drivers/gpu/drm/radeon/kv_dpm.c
+index 9ef8c38f2d66..f00e6a6c254a 100644
+--- a/drivers/gpu/drm/radeon/kv_dpm.c
++++ b/drivers/gpu/drm/radeon/kv_dpm.c
+@@ -33,6 +33,8 @@
+ #define KV_MINIMUM_ENGINE_CLOCK 800
+ #define SMC_RAM_END 0x40000
+
++static int kv_enable_nb_dpm(struct radeon_device *rdev,
++ bool enable);
+ static void kv_init_graphics_levels(struct radeon_device *rdev);
+ static int kv_calculate_ds_divider(struct radeon_device *rdev);
+ static int kv_calculate_nbps_level_settings(struct radeon_device *rdev);
+@@ -1295,6 +1297,9 @@ void kv_dpm_disable(struct radeon_device *rdev)
+ {
+ kv_smc_bapm_enable(rdev, false);
+
++ if (rdev->family == CHIP_MULLINS)
++ kv_enable_nb_dpm(rdev, false);
++
+ /* powerup blocks */
+ kv_dpm_powergate_acp(rdev, false);
+ kv_dpm_powergate_samu(rdev, false);
+@@ -1438,14 +1443,14 @@ static int kv_update_uvd_dpm(struct radeon_device *rdev, bool gate)
+ return kv_enable_uvd_dpm(rdev, !gate);
+ }
+
+-static u8 kv_get_vce_boot_level(struct radeon_device *rdev)
++static u8 kv_get_vce_boot_level(struct radeon_device *rdev, u32 evclk)
+ {
+ u8 i;
+ struct radeon_vce_clock_voltage_dependency_table *table =
+ &rdev->pm.dpm.dyn_state.vce_clock_voltage_dependency_table;
+
+ for (i = 0; i < table->count; i++) {
+- if (table->entries[i].evclk >= 0) /* XXX */
++ if (table->entries[i].evclk >= evclk)
+ break;
+ }
+
+@@ -1468,7 +1473,7 @@ static int kv_update_vce_dpm(struct radeon_device *rdev,
+ if (pi->caps_stable_p_state)
+ pi->vce_boot_level = table->count - 1;
+ else
+- pi->vce_boot_level = kv_get_vce_boot_level(rdev);
++ pi->vce_boot_level = kv_get_vce_boot_level(rdev, radeon_new_state->evclk);
+
+ ret = kv_copy_bytes_to_smc(rdev,
+ pi->dpm_table_start +
+@@ -1769,15 +1774,24 @@ static int kv_update_dfs_bypass_settings(struct radeon_device *rdev,
+ return ret;
+ }
+
+-static int kv_enable_nb_dpm(struct radeon_device *rdev)
++static int kv_enable_nb_dpm(struct radeon_device *rdev,
++ bool enable)
+ {
+ struct kv_power_info *pi = kv_get_pi(rdev);
+ int ret = 0;
+
+- if (pi->enable_nb_dpm && !pi->nb_dpm_enabled) {
+- ret = kv_notify_message_to_smu(rdev, PPSMC_MSG_NBDPM_Enable);
+- if (ret == 0)
+- pi->nb_dpm_enabled = true;
++ if (enable) {
++ if (pi->enable_nb_dpm && !pi->nb_dpm_enabled) {
++ ret = kv_notify_message_to_smu(rdev, PPSMC_MSG_NBDPM_Enable);
++ if (ret == 0)
++ pi->nb_dpm_enabled = true;
++ }
++ } else {
++ if (pi->enable_nb_dpm && pi->nb_dpm_enabled) {
++ ret = kv_notify_message_to_smu(rdev, PPSMC_MSG_NBDPM_Disable);
++ if (ret == 0)
++ pi->nb_dpm_enabled = false;
++ }
+ }
+
+ return ret;
+@@ -1864,7 +1878,7 @@ int kv_dpm_set_power_state(struct radeon_device *rdev)
+ }
+ kv_update_sclk_t(rdev);
+ if (rdev->family == CHIP_MULLINS)
+- kv_enable_nb_dpm(rdev);
++ kv_enable_nb_dpm(rdev, true);
+ }
+ } else {
+ if (pi->enable_dpm) {
+@@ -1889,7 +1903,7 @@ int kv_dpm_set_power_state(struct radeon_device *rdev)
+ }
+ kv_update_acp_boot_level(rdev);
+ kv_update_sclk_t(rdev);
+- kv_enable_nb_dpm(rdev);
++ kv_enable_nb_dpm(rdev, true);
+ }
+ }
+
+diff --git a/drivers/gpu/drm/radeon/ni_dma.c b/drivers/gpu/drm/radeon/ni_dma.c
+index 6378e0276691..6927db4d8db7 100644
+--- a/drivers/gpu/drm/radeon/ni_dma.c
++++ b/drivers/gpu/drm/radeon/ni_dma.c
+@@ -191,12 +191,6 @@ int cayman_dma_resume(struct radeon_device *rdev)
+ u32 reg_offset, wb_offset;
+ int i, r;
+
+- /* Reset dma */
+- WREG32(SRBM_SOFT_RESET, SOFT_RESET_DMA | SOFT_RESET_DMA1);
+- RREG32(SRBM_SOFT_RESET);
+- udelay(50);
+- WREG32(SRBM_SOFT_RESET, 0);
+-
+ for (i = 0; i < 2; i++) {
+ if (i == 0) {
+ ring = &rdev->ring[R600_RING_TYPE_DMA_INDEX];
+diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
+index 3c69f58e46ef..44b046b4056f 100644
+--- a/drivers/gpu/drm/radeon/r600.c
++++ b/drivers/gpu/drm/radeon/r600.c
+@@ -1813,7 +1813,6 @@ static void r600_gpu_init(struct radeon_device *rdev)
+ {
+ u32 tiling_config;
+ u32 ramcfg;
+- u32 cc_rb_backend_disable;
+ u32 cc_gc_shader_pipe_config;
+ u32 tmp;
+ int i, j;
+@@ -1940,29 +1939,20 @@ static void r600_gpu_init(struct radeon_device *rdev)
+ }
+ tiling_config |= BANK_SWAPS(1);
+
+- cc_rb_backend_disable = RREG32(CC_RB_BACKEND_DISABLE) & 0x00ff0000;
+- tmp = R6XX_MAX_BACKENDS -
+- r600_count_pipe_bits((cc_rb_backend_disable >> 16) & R6XX_MAX_BACKENDS_MASK);
+- if (tmp < rdev->config.r600.max_backends) {
+- rdev->config.r600.max_backends = tmp;
+- }
+-
+ cc_gc_shader_pipe_config = RREG32(CC_GC_SHADER_PIPE_CONFIG) & 0x00ffff00;
+- tmp = R6XX_MAX_PIPES -
+- r600_count_pipe_bits((cc_gc_shader_pipe_config >> 8) & R6XX_MAX_PIPES_MASK);
+- if (tmp < rdev->config.r600.max_pipes) {
+- rdev->config.r600.max_pipes = tmp;
+- }
+- tmp = R6XX_MAX_SIMDS -
+- r600_count_pipe_bits((cc_gc_shader_pipe_config >> 16) & R6XX_MAX_SIMDS_MASK);
+- if (tmp < rdev->config.r600.max_simds) {
+- rdev->config.r600.max_simds = tmp;
+- }
+ tmp = rdev->config.r600.max_simds -
+ r600_count_pipe_bits((cc_gc_shader_pipe_config >> 16) & R6XX_MAX_SIMDS_MASK);
+ rdev->config.r600.active_simds = tmp;
+
+ disabled_rb_mask = (RREG32(CC_RB_BACKEND_DISABLE) >> 16) & R6XX_MAX_BACKENDS_MASK;
++ tmp = 0;
++ for (i = 0; i < rdev->config.r600.max_backends; i++)
++ tmp |= (1 << i);
++ /* if all the backends are disabled, fix it up here */
++ if ((disabled_rb_mask & tmp) == tmp) {
++ for (i = 0; i < rdev->config.r600.max_backends; i++)
++ disabled_rb_mask &= ~(1 << i);
++ }
+ tmp = (tiling_config & PIPE_TILING__MASK) >> PIPE_TILING__SHIFT;
+ tmp = r6xx_remap_render_backend(rdev, tmp, rdev->config.r600.max_backends,
+ R6XX_MAX_BACKENDS, disabled_rb_mask);
+diff --git a/drivers/gpu/drm/radeon/r600_dma.c b/drivers/gpu/drm/radeon/r600_dma.c
+index 4969cef44a19..b766e052d91f 100644
+--- a/drivers/gpu/drm/radeon/r600_dma.c
++++ b/drivers/gpu/drm/radeon/r600_dma.c
+@@ -124,15 +124,6 @@ int r600_dma_resume(struct radeon_device *rdev)
+ u32 rb_bufsz;
+ int r;
+
+- /* Reset dma */
+- if (rdev->family >= CHIP_RV770)
+- WREG32(SRBM_SOFT_RESET, RV770_SOFT_RESET_DMA);
+- else
+- WREG32(SRBM_SOFT_RESET, SOFT_RESET_DMA);
+- RREG32(SRBM_SOFT_RESET);
+- udelay(50);
+- WREG32(SRBM_SOFT_RESET, 0);
+-
+ WREG32(DMA_SEM_INCOMPLETE_TIMER_CNTL, 0);
+ WREG32(DMA_SEM_WAIT_FAIL_TIMER_CNTL, 0);
+
+diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
+index 60c47f829122..2d6b55d8461e 100644
+--- a/drivers/gpu/drm/radeon/radeon.h
++++ b/drivers/gpu/drm/radeon/radeon.h
+@@ -304,6 +304,9 @@ int radeon_atom_get_leakage_vddc_based_on_leakage_params(struct radeon_device *r
+ u16 *vddc, u16 *vddci,
+ u16 virtual_voltage_id,
+ u16 vbios_voltage_id);
++int radeon_atom_get_voltage_evv(struct radeon_device *rdev,
++ u16 virtual_voltage_id,
++ u16 *voltage);
+ int radeon_atom_round_to_true_voltage(struct radeon_device *rdev,
+ u8 voltage_type,
+ u16 nominal_voltage,
+diff --git a/drivers/gpu/drm/radeon/radeon_atombios.c b/drivers/gpu/drm/radeon/radeon_atombios.c
+index 173f378428a9..be6705eeb649 100644
+--- a/drivers/gpu/drm/radeon/radeon_atombios.c
++++ b/drivers/gpu/drm/radeon/radeon_atombios.c
+@@ -447,6 +447,13 @@ static bool radeon_atom_apply_quirks(struct drm_device *dev,
+ }
+ }
+
++ /* Fujitsu D3003-S2 board lists DVI-I as DVI-I and VGA */
++ if ((dev->pdev->device == 0x9805) &&
++ (dev->pdev->subsystem_vendor == 0x1734) &&
++ (dev->pdev->subsystem_device == 0x11bd)) {
++ if (*connector_type == DRM_MODE_CONNECTOR_VGA)
++ return false;
++ }
+
+ return true;
+ }
+@@ -1963,7 +1970,7 @@ static const char *thermal_controller_names[] = {
+ "adm1032",
+ "adm1030",
+ "max6649",
+- "lm64",
++ "lm63", /* lm64 */
+ "f75375",
+ "asc7xxx",
+ };
+@@ -1974,7 +1981,7 @@ static const char *pp_lib_thermal_controller_names[] = {
+ "adm1032",
+ "adm1030",
+ "max6649",
+- "lm64",
++ "lm63", /* lm64 */
+ "f75375",
+ "RV6xx",
+ "RV770",
+@@ -2281,19 +2288,31 @@ static void radeon_atombios_add_pplib_thermal_controller(struct radeon_device *r
+ (controller->ucFanParameters &
+ ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
+ rdev->pm.int_thermal_type = THERMAL_TYPE_KV;
+- } else if ((controller->ucType ==
+- ATOM_PP_THERMALCONTROLLER_EXTERNAL_GPIO) ||
+- (controller->ucType ==
+- ATOM_PP_THERMALCONTROLLER_ADT7473_WITH_INTERNAL) ||
+- (controller->ucType ==
+- ATOM_PP_THERMALCONTROLLER_EMC2103_WITH_INTERNAL)) {
+- DRM_INFO("Special thermal controller config\n");
++ } else if (controller->ucType ==
++ ATOM_PP_THERMALCONTROLLER_EXTERNAL_GPIO) {
++ DRM_INFO("External GPIO thermal controller %s fan control\n",
++ (controller->ucFanParameters &
++ ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
++ rdev->pm.int_thermal_type = THERMAL_TYPE_EXTERNAL_GPIO;
++ } else if (controller->ucType ==
++ ATOM_PP_THERMALCONTROLLER_ADT7473_WITH_INTERNAL) {
++ DRM_INFO("ADT7473 with internal thermal controller %s fan control\n",
++ (controller->ucFanParameters &
++ ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
++ rdev->pm.int_thermal_type = THERMAL_TYPE_ADT7473_WITH_INTERNAL;
++ } else if (controller->ucType ==
++ ATOM_PP_THERMALCONTROLLER_EMC2103_WITH_INTERNAL) {
++ DRM_INFO("EMC2103 with internal thermal controller %s fan control\n",
++ (controller->ucFanParameters &
++ ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
++ rdev->pm.int_thermal_type = THERMAL_TYPE_EMC2103_WITH_INTERNAL;
+ } else if (controller->ucType < ARRAY_SIZE(pp_lib_thermal_controller_names)) {
+ DRM_INFO("Possible %s thermal controller at 0x%02x %s fan control\n",
+ pp_lib_thermal_controller_names[controller->ucType],
+ controller->ucI2cAddress >> 1,
+ (controller->ucFanParameters &
+ ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
++ rdev->pm.int_thermal_type = THERMAL_TYPE_EXTERNAL;
+ i2c_bus = radeon_lookup_i2c_gpio(rdev, controller->ucI2cLine);
+ rdev->pm.i2c_bus = radeon_i2c_lookup(rdev, &i2c_bus);
+ if (rdev->pm.i2c_bus) {
+@@ -3236,6 +3255,41 @@ int radeon_atom_get_leakage_vddc_based_on_leakage_params(struct radeon_device *r
+ return 0;
+ }
+
++union get_voltage_info {
++ struct _GET_VOLTAGE_INFO_INPUT_PARAMETER_V1_2 in;
++ struct _GET_EVV_VOLTAGE_INFO_OUTPUT_PARAMETER_V1_2 evv_out;
++};
++
++int radeon_atom_get_voltage_evv(struct radeon_device *rdev,
++ u16 virtual_voltage_id,
++ u16 *voltage)
++{
++ int index = GetIndexIntoMasterTable(COMMAND, GetVoltageInfo);
++ u32 entry_id;
++ u32 count = rdev->pm.dpm.dyn_state.vddc_dependency_on_sclk.count;
++ union get_voltage_info args;
++
++ for (entry_id = 0; entry_id < count; entry_id++) {
++ if (rdev->pm.dpm.dyn_state.vddc_dependency_on_sclk.entries[entry_id].v ==
++ virtual_voltage_id)
++ break;
++ }
++
++ if (entry_id >= count)
++ return -EINVAL;
++
++ args.in.ucVoltageType = VOLTAGE_TYPE_VDDC;
++ args.in.ucVoltageMode = ATOM_GET_VOLTAGE_EVV_VOLTAGE;
++ args.in.ulSCLKFreq =
++ cpu_to_le32(rdev->pm.dpm.dyn_state.vddc_dependency_on_sclk.entries[entry_id].clk);
++
++ atom_execute_table(rdev->mode_info.atom_context, index, (uint32_t *)&args);
++
++ *voltage = le16_to_cpu(args.evv_out.usVoltageLevel);
++
++ return 0;
++}
++
+ int radeon_atom_get_voltage_gpio_settings(struct radeon_device *rdev,
+ u16 voltage_level, u8 voltage_type,
+ u32 *gpio_value, u32 *gpio_mask)
+diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
+index ae763f60c8a0..8f7d56f342f1 100644
+--- a/drivers/gpu/drm/radeon/radeon_cs.c
++++ b/drivers/gpu/drm/radeon/radeon_cs.c
+@@ -132,7 +132,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
+ * the buffers used for read only, which doubles the range
+ * to 0 to 31. 32 is reserved for the kernel driver.
+ */
+- priority = (r->flags & 0xf) * 2 + !!r->write_domain;
++ priority = (r->flags & RADEON_RELOC_PRIO_MASK) * 2
++ + !!r->write_domain;
+
+ /* the first reloc of an UVD job is the msg and that must be in
+ VRAM, also but everything into VRAM on AGP cards to avoid
+diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
+index 697add2cd4e3..52a0cfd0276a 100644
+--- a/drivers/gpu/drm/radeon/radeon_device.c
++++ b/drivers/gpu/drm/radeon/radeon_device.c
+@@ -1350,7 +1350,7 @@ int radeon_device_init(struct radeon_device *rdev,
+
+ r = radeon_init(rdev);
+ if (r)
+- return r;
++ goto failed;
+
+ r = radeon_ib_ring_tests(rdev);
+ if (r)
+@@ -1370,7 +1370,7 @@ int radeon_device_init(struct radeon_device *rdev,
+ radeon_agp_disable(rdev);
+ r = radeon_init(rdev);
+ if (r)
+- return r;
++ goto failed;
+ }
+
+ if ((radeon_testing & 1)) {
+@@ -1392,6 +1392,11 @@ int radeon_device_init(struct radeon_device *rdev,
+ DRM_INFO("radeon: acceleration disabled, skipping benchmarks\n");
+ }
+ return 0;
++
++failed:
++ if (runtime)
++ vga_switcheroo_fini_domain_pm_ops(rdev->dev);
++ return r;
+ }
+
+ static void radeon_debugfs_remove_files(struct radeon_device *rdev);
+@@ -1412,6 +1417,8 @@ void radeon_device_fini(struct radeon_device *rdev)
+ radeon_bo_evict_vram(rdev);
+ radeon_fini(rdev);
+ vga_switcheroo_unregister_client(rdev->pdev);
++ if (rdev->flags & RADEON_IS_PX)
++ vga_switcheroo_fini_domain_pm_ops(rdev->dev);
+ vga_client_register(rdev->pdev, NULL, NULL, NULL);
+ if (rdev->rio_mem)
+ pci_iounmap(rdev->pdev, rdev->rio_mem);
+@@ -1637,7 +1644,6 @@ int radeon_gpu_reset(struct radeon_device *rdev)
+ radeon_save_bios_scratch_regs(rdev);
+ /* block TTM */
+ resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
+- radeon_pm_suspend(rdev);
+ radeon_suspend(rdev);
+
+ for (i = 0; i < RADEON_NUM_RINGS; ++i) {
+@@ -1683,9 +1689,24 @@ retry:
+ }
+ }
+
+- radeon_pm_resume(rdev);
++ if ((rdev->pm.pm_method == PM_METHOD_DPM) && rdev->pm.dpm_enabled) {
++ /* do dpm late init */
++ r = radeon_pm_late_init(rdev);
++ if (r) {
++ rdev->pm.dpm_enabled = false;
++ DRM_ERROR("radeon_pm_late_init failed, disabling dpm\n");
++ }
++ } else {
++ /* resume old pm late */
++ radeon_pm_resume(rdev);
++ }
++
+ drm_helper_resume_force_mode(rdev->ddev);
+
++ /* set the power state here in case we are a PX system or headless */
++ if ((rdev->pm.pm_method == PM_METHOD_DPM) && rdev->pm.dpm_enabled)
++ radeon_pm_compute_clocks(rdev);
++
+ ttm_bo_unlock_delayed_workqueue(&rdev->mman.bdev, resched);
+ if (r) {
+ /* bad news, how to tell it to userspace ? */
+diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
+index e9e361084249..a089abb76363 100644
+--- a/drivers/gpu/drm/radeon/radeon_drv.c
++++ b/drivers/gpu/drm/radeon/radeon_drv.c
+@@ -429,6 +429,7 @@ static int radeon_pmops_runtime_suspend(struct device *dev)
+ ret = radeon_suspend_kms(drm_dev, false, false);
+ pci_save_state(pdev);
+ pci_disable_device(pdev);
++ pci_ignore_hotplug(pdev);
+ pci_set_power_state(pdev, PCI_D3cold);
+ drm_dev->switch_power_state = DRM_SWITCH_POWER_DYNAMIC_OFF;
+
+diff --git a/drivers/gpu/drm/radeon/radeon_kms.c b/drivers/gpu/drm/radeon/radeon_kms.c
+index d25ae6acfd5a..c1a206dd859d 100644
+--- a/drivers/gpu/drm/radeon/radeon_kms.c
++++ b/drivers/gpu/drm/radeon/radeon_kms.c
+@@ -254,7 +254,14 @@ static int radeon_info_ioctl(struct drm_device *dev, void *data, struct drm_file
+ }
+ break;
+ case RADEON_INFO_ACCEL_WORKING2:
+- *value = rdev->accel_working;
++ if (rdev->family == CHIP_HAWAII) {
++ if (rdev->accel_working)
++ *value = 2;
++ else
++ *value = 0;
++ } else {
++ *value = rdev->accel_working;
++ }
+ break;
+ case RADEON_INFO_TILING_CONFIG:
+ if (rdev->family >= CHIP_BONAIRE)
+diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c
+index e447e390d09a..50d6ff9d7656 100644
+--- a/drivers/gpu/drm/radeon/radeon_pm.c
++++ b/drivers/gpu/drm/radeon/radeon_pm.c
+@@ -460,10 +460,6 @@ static ssize_t radeon_get_dpm_state(struct device *dev,
+ struct radeon_device *rdev = ddev->dev_private;
+ enum radeon_pm_state_type pm = rdev->pm.dpm.user_state;
+
+- if ((rdev->flags & RADEON_IS_PX) &&
+- (ddev->switch_power_state != DRM_SWITCH_POWER_ON))
+- return snprintf(buf, PAGE_SIZE, "off\n");
+-
+ return snprintf(buf, PAGE_SIZE, "%s\n",
+ (pm == POWER_STATE_TYPE_BATTERY) ? "battery" :
+ (pm == POWER_STATE_TYPE_BALANCED) ? "balanced" : "performance");
+@@ -477,11 +473,6 @@ static ssize_t radeon_set_dpm_state(struct device *dev,
+ struct drm_device *ddev = dev_get_drvdata(dev);
+ struct radeon_device *rdev = ddev->dev_private;
+
+- /* Can't set dpm state when the card is off */
+- if ((rdev->flags & RADEON_IS_PX) &&
+- (ddev->switch_power_state != DRM_SWITCH_POWER_ON))
+- return -EINVAL;
+-
+ mutex_lock(&rdev->pm.mutex);
+ if (strncmp("battery", buf, strlen("battery")) == 0)
+ rdev->pm.dpm.user_state = POWER_STATE_TYPE_BATTERY;
+@@ -495,7 +486,12 @@ static ssize_t radeon_set_dpm_state(struct device *dev,
+ goto fail;
+ }
+ mutex_unlock(&rdev->pm.mutex);
+- radeon_pm_compute_clocks(rdev);
++
++ /* Can't set dpm state when the card is off */
++ if (!(rdev->flags & RADEON_IS_PX) ||
++ (ddev->switch_power_state == DRM_SWITCH_POWER_ON))
++ radeon_pm_compute_clocks(rdev);
++
+ fail:
+ return count;
+ }
+@@ -1303,10 +1299,6 @@ int radeon_pm_init(struct radeon_device *rdev)
+ case CHIP_RS780:
+ case CHIP_RS880:
+ case CHIP_RV770:
+- case CHIP_BARTS:
+- case CHIP_TURKS:
+- case CHIP_CAICOS:
+- case CHIP_CAYMAN:
+ /* DPM requires the RLC, RV770+ dGPU requires SMC */
+ if (!rdev->rlc_fw)
+ rdev->pm.pm_method = PM_METHOD_PROFILE;
+@@ -1330,6 +1322,10 @@ int radeon_pm_init(struct radeon_device *rdev)
+ case CHIP_PALM:
+ case CHIP_SUMO:
+ case CHIP_SUMO2:
++ case CHIP_BARTS:
++ case CHIP_TURKS:
++ case CHIP_CAICOS:
++ case CHIP_CAYMAN:
+ case CHIP_ARUBA:
+ case CHIP_TAHITI:
+ case CHIP_PITCAIRN:
+diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
+index dbd6bcde92de..e6101c18c457 100644
+--- a/drivers/gpu/drm/radeon/radeon_semaphore.c
++++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
+@@ -34,7 +34,7 @@
+ int radeon_semaphore_create(struct radeon_device *rdev,
+ struct radeon_semaphore **semaphore)
+ {
+- uint32_t *cpu_addr;
++ uint64_t *cpu_addr;
+ int i, r;
+
+ *semaphore = kmalloc(sizeof(struct radeon_semaphore), GFP_KERNEL);
+diff --git a/drivers/gpu/drm/radeon/rv770.c b/drivers/gpu/drm/radeon/rv770.c
+index da8703d8d455..11cd3d887428 100644
+--- a/drivers/gpu/drm/radeon/rv770.c
++++ b/drivers/gpu/drm/radeon/rv770.c
+@@ -1178,7 +1178,6 @@ static void rv770_gpu_init(struct radeon_device *rdev)
+ u32 hdp_host_path_cntl;
+ u32 sq_dyn_gpr_size_simd_ab_0;
+ u32 gb_tiling_config = 0;
+- u32 cc_rb_backend_disable = 0;
+ u32 cc_gc_shader_pipe_config = 0;
+ u32 mc_arb_ramcfg;
+ u32 db_debug4, tmp;
+@@ -1312,21 +1311,7 @@ static void rv770_gpu_init(struct radeon_device *rdev)
+ WREG32(SPI_CONFIG_CNTL, 0);
+ }
+
+- cc_rb_backend_disable = RREG32(CC_RB_BACKEND_DISABLE) & 0x00ff0000;
+- tmp = R7XX_MAX_BACKENDS - r600_count_pipe_bits(cc_rb_backend_disable >> 16);
+- if (tmp < rdev->config.rv770.max_backends) {
+- rdev->config.rv770.max_backends = tmp;
+- }
+-
+ cc_gc_shader_pipe_config = RREG32(CC_GC_SHADER_PIPE_CONFIG) & 0xffffff00;
+- tmp = R7XX_MAX_PIPES - r600_count_pipe_bits((cc_gc_shader_pipe_config >> 8) & R7XX_MAX_PIPES_MASK);
+- if (tmp < rdev->config.rv770.max_pipes) {
+- rdev->config.rv770.max_pipes = tmp;
+- }
+- tmp = R7XX_MAX_SIMDS - r600_count_pipe_bits((cc_gc_shader_pipe_config >> 16) & R7XX_MAX_SIMDS_MASK);
+- if (tmp < rdev->config.rv770.max_simds) {
+- rdev->config.rv770.max_simds = tmp;
+- }
+ tmp = rdev->config.rv770.max_simds -
+ r600_count_pipe_bits((cc_gc_shader_pipe_config >> 16) & R7XX_MAX_SIMDS_MASK);
+ rdev->config.rv770.active_simds = tmp;
+@@ -1349,6 +1334,14 @@ static void rv770_gpu_init(struct radeon_device *rdev)
+ rdev->config.rv770.tiling_npipes = rdev->config.rv770.max_tile_pipes;
+
+ disabled_rb_mask = (RREG32(CC_RB_BACKEND_DISABLE) >> 16) & R7XX_MAX_BACKENDS_MASK;
++ tmp = 0;
++ for (i = 0; i < rdev->config.rv770.max_backends; i++)
++ tmp |= (1 << i);
++ /* if all the backends are disabled, fix it up here */
++ if ((disabled_rb_mask & tmp) == tmp) {
++ for (i = 0; i < rdev->config.rv770.max_backends; i++)
++ disabled_rb_mask &= ~(1 << i);
++ }
+ tmp = (gb_tiling_config & PIPE_TILING__MASK) >> PIPE_TILING__SHIFT;
+ tmp = r6xx_remap_render_backend(rdev, tmp, rdev->config.rv770.max_backends,
+ R7XX_MAX_BACKENDS, disabled_rb_mask);
+diff --git a/drivers/gpu/drm/radeon/si.c b/drivers/gpu/drm/radeon/si.c
+index 9e854fd016da..6c17d3b0be8b 100644
+--- a/drivers/gpu/drm/radeon/si.c
++++ b/drivers/gpu/drm/radeon/si.c
+@@ -2901,7 +2901,7 @@ static void si_gpu_init(struct radeon_device *rdev)
+ u32 sx_debug_1;
+ u32 hdp_host_path_cntl;
+ u32 tmp;
+- int i, j, k;
++ int i, j;
+
+ switch (rdev->family) {
+ case CHIP_TAHITI:
+@@ -3099,12 +3099,11 @@ static void si_gpu_init(struct radeon_device *rdev)
+ rdev->config.si.max_sh_per_se,
+ rdev->config.si.max_cu_per_sh);
+
++ rdev->config.si.active_cus = 0;
+ for (i = 0; i < rdev->config.si.max_shader_engines; i++) {
+ for (j = 0; j < rdev->config.si.max_sh_per_se; j++) {
+- for (k = 0; k < rdev->config.si.max_cu_per_sh; k++) {
+- rdev->config.si.active_cus +=
+- hweight32(si_get_cu_active_bitmap(rdev, i, j));
+- }
++ rdev->config.si.active_cus +=
++ hweight32(si_get_cu_active_bitmap(rdev, i, j));
+ }
+ }
+
+@@ -4815,7 +4814,7 @@ void si_vm_flush(struct radeon_device *rdev, int ridx, struct radeon_vm *vm)
+
+ /* write new base address */
+ radeon_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
+- radeon_ring_write(ring, (WRITE_DATA_ENGINE_SEL(0) |
++ radeon_ring_write(ring, (WRITE_DATA_ENGINE_SEL(1) |
+ WRITE_DATA_DST_SEL(0)));
+
+ if (vm->id < 8) {
+diff --git a/drivers/gpu/drm/tegra/dc.c b/drivers/gpu/drm/tegra/dc.c
+index ef40381f3909..48c3bc460eef 100644
+--- a/drivers/gpu/drm/tegra/dc.c
++++ b/drivers/gpu/drm/tegra/dc.c
+@@ -1303,6 +1303,7 @@ static const struct of_device_id tegra_dc_of_match[] = {
+ /* sentinel */
+ }
+ };
++MODULE_DEVICE_TABLE(of, tegra_dc_of_match);
+
+ static int tegra_dc_parse_dt(struct tegra_dc *dc)
+ {
+diff --git a/drivers/gpu/drm/tegra/dpaux.c b/drivers/gpu/drm/tegra/dpaux.c
+index 3f132e356e9c..708f783ead47 100644
+--- a/drivers/gpu/drm/tegra/dpaux.c
++++ b/drivers/gpu/drm/tegra/dpaux.c
+@@ -382,6 +382,7 @@ static const struct of_device_id tegra_dpaux_of_match[] = {
+ { .compatible = "nvidia,tegra124-dpaux", },
+ { },
+ };
++MODULE_DEVICE_TABLE(of, tegra_dpaux_of_match);
+
+ struct platform_driver tegra_dpaux_driver = {
+ .driver = {
+diff --git a/drivers/gpu/drm/tegra/dsi.c b/drivers/gpu/drm/tegra/dsi.c
+index bd56f2affa78..97c409f10456 100644
+--- a/drivers/gpu/drm/tegra/dsi.c
++++ b/drivers/gpu/drm/tegra/dsi.c
+@@ -982,6 +982,7 @@ static const struct of_device_id tegra_dsi_of_match[] = {
+ { .compatible = "nvidia,tegra114-dsi", },
+ { },
+ };
++MODULE_DEVICE_TABLE(of, tegra_dsi_of_match);
+
+ struct platform_driver tegra_dsi_driver = {
+ .driver = {
+diff --git a/drivers/gpu/drm/tegra/gr2d.c b/drivers/gpu/drm/tegra/gr2d.c
+index 7c53941f2a9e..02cd3e37a6ec 100644
+--- a/drivers/gpu/drm/tegra/gr2d.c
++++ b/drivers/gpu/drm/tegra/gr2d.c
+@@ -121,6 +121,7 @@ static const struct of_device_id gr2d_match[] = {
+ { .compatible = "nvidia,tegra20-gr2d" },
+ { },
+ };
++MODULE_DEVICE_TABLE(of, gr2d_match);
+
+ static const u32 gr2d_addr_regs[] = {
+ GR2D_UA_BASE_ADDR,
+diff --git a/drivers/gpu/drm/tegra/gr3d.c b/drivers/gpu/drm/tegra/gr3d.c
+index 30f5ba9bd6d0..2bea2b2d204e 100644
+--- a/drivers/gpu/drm/tegra/gr3d.c
++++ b/drivers/gpu/drm/tegra/gr3d.c
+@@ -130,6 +130,7 @@ static const struct of_device_id tegra_gr3d_match[] = {
+ { .compatible = "nvidia,tegra20-gr3d" },
+ { }
+ };
++MODULE_DEVICE_TABLE(of, tegra_gr3d_match);
+
+ static const u32 gr3d_addr_regs[] = {
+ GR3D_IDX_ATTRIBUTE( 0),
+diff --git a/drivers/gpu/drm/tegra/hdmi.c b/drivers/gpu/drm/tegra/hdmi.c
+index ba067bb767e3..ffe26547328d 100644
+--- a/drivers/gpu/drm/tegra/hdmi.c
++++ b/drivers/gpu/drm/tegra/hdmi.c
+@@ -1450,6 +1450,7 @@ static const struct of_device_id tegra_hdmi_of_match[] = {
+ { .compatible = "nvidia,tegra20-hdmi", .data = &tegra20_hdmi_config },
+ { },
+ };
++MODULE_DEVICE_TABLE(of, tegra_hdmi_of_match);
+
+ static int tegra_hdmi_probe(struct platform_device *pdev)
+ {
+diff --git a/drivers/gpu/drm/tegra/sor.c b/drivers/gpu/drm/tegra/sor.c
+index 27c979b50111..061a5c501124 100644
+--- a/drivers/gpu/drm/tegra/sor.c
++++ b/drivers/gpu/drm/tegra/sor.c
+@@ -1455,6 +1455,7 @@ static const struct of_device_id tegra_sor_of_match[] = {
+ { .compatible = "nvidia,tegra124-sor", },
+ { },
+ };
++MODULE_DEVICE_TABLE(of, tegra_sor_of_match);
+
+ struct platform_driver tegra_sor_driver = {
+ .driver = {
+diff --git a/drivers/gpu/drm/tilcdc/tilcdc_drv.c b/drivers/gpu/drm/tilcdc/tilcdc_drv.c
+index b20b69488dc9..006a30e90390 100644
+--- a/drivers/gpu/drm/tilcdc/tilcdc_drv.c
++++ b/drivers/gpu/drm/tilcdc/tilcdc_drv.c
+@@ -122,6 +122,7 @@ static int tilcdc_unload(struct drm_device *dev)
+ struct tilcdc_drm_private *priv = dev->dev_private;
+ struct tilcdc_module *mod, *cur;
+
++ drm_fbdev_cma_fini(priv->fbdev);
+ drm_kms_helper_poll_fini(dev);
+ drm_mode_config_cleanup(dev);
+ drm_vblank_cleanup(dev);
+@@ -628,10 +629,10 @@ static int __init tilcdc_drm_init(void)
+ static void __exit tilcdc_drm_fini(void)
+ {
+ DBG("fini");
+- tilcdc_tfp410_fini();
+- tilcdc_slave_fini();
+- tilcdc_panel_fini();
+ platform_driver_unregister(&tilcdc_platform_driver);
++ tilcdc_panel_fini();
++ tilcdc_slave_fini();
++ tilcdc_tfp410_fini();
+ }
+
+ late_initcall(tilcdc_drm_init);
+diff --git a/drivers/gpu/drm/tilcdc/tilcdc_panel.c b/drivers/gpu/drm/tilcdc/tilcdc_panel.c
+index 86c67329b605..b085dcc54fb5 100644
+--- a/drivers/gpu/drm/tilcdc/tilcdc_panel.c
++++ b/drivers/gpu/drm/tilcdc/tilcdc_panel.c
+@@ -151,6 +151,7 @@ struct panel_connector {
+ static void panel_connector_destroy(struct drm_connector *connector)
+ {
+ struct panel_connector *panel_connector = to_panel_connector(connector);
++ drm_sysfs_connector_remove(connector);
+ drm_connector_cleanup(connector);
+ kfree(panel_connector);
+ }
+@@ -285,10 +286,8 @@ static void panel_destroy(struct tilcdc_module *mod)
+ {
+ struct panel_module *panel_mod = to_panel_module(mod);
+
+- if (panel_mod->timings) {
++ if (panel_mod->timings)
+ display_timings_release(panel_mod->timings);
+- kfree(panel_mod->timings);
+- }
+
+ tilcdc_module_cleanup(mod);
+ kfree(panel_mod->info);
+diff --git a/drivers/gpu/drm/tilcdc/tilcdc_slave.c b/drivers/gpu/drm/tilcdc/tilcdc_slave.c
+index 595068ba2d5e..2f83ffb7f37e 100644
+--- a/drivers/gpu/drm/tilcdc/tilcdc_slave.c
++++ b/drivers/gpu/drm/tilcdc/tilcdc_slave.c
+@@ -166,6 +166,7 @@ struct slave_connector {
+ static void slave_connector_destroy(struct drm_connector *connector)
+ {
+ struct slave_connector *slave_connector = to_slave_connector(connector);
++ drm_sysfs_connector_remove(connector);
+ drm_connector_cleanup(connector);
+ kfree(slave_connector);
+ }
+diff --git a/drivers/gpu/drm/tilcdc/tilcdc_tfp410.c b/drivers/gpu/drm/tilcdc/tilcdc_tfp410.c
+index c38b56b268ac..ce75ac8de4f8 100644
+--- a/drivers/gpu/drm/tilcdc/tilcdc_tfp410.c
++++ b/drivers/gpu/drm/tilcdc/tilcdc_tfp410.c
+@@ -167,6 +167,7 @@ struct tfp410_connector {
+ static void tfp410_connector_destroy(struct drm_connector *connector)
+ {
+ struct tfp410_connector *tfp410_connector = to_tfp410_connector(connector);
++ drm_sysfs_connector_remove(connector);
+ drm_connector_cleanup(connector);
+ kfree(tfp410_connector);
+ }
+diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
+index 4ab9f7171c4f..a13a10025ec7 100644
+--- a/drivers/gpu/drm/ttm/ttm_bo.c
++++ b/drivers/gpu/drm/ttm/ttm_bo.c
+@@ -784,7 +784,7 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,
+ int ret;
+
+ do {
+- ret = (*man->func->get_node)(man, bo, placement, mem);
++ ret = (*man->func->get_node)(man, bo, placement, 0, mem);
+ if (unlikely(ret != 0))
+ return ret;
+ if (mem->mm_node)
+@@ -897,7 +897,8 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
+
+ if (man->has_type && man->use_type) {
+ type_found = true;
+- ret = (*man->func->get_node)(man, bo, placement, mem);
++ ret = (*man->func->get_node)(man, bo, placement,
++ cur_flags, mem);
+ if (unlikely(ret))
+ return ret;
+ }
+@@ -937,7 +938,6 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
+ ttm_flag_masked(&cur_flags, placement->busy_placement[i],
+ ~TTM_PL_MASK_MEMTYPE);
+
+-
+ if (mem_type == TTM_PL_SYSTEM) {
+ mem->mem_type = mem_type;
+ mem->placement = cur_flags;
+diff --git a/drivers/gpu/drm/ttm/ttm_bo_manager.c b/drivers/gpu/drm/ttm/ttm_bo_manager.c
+index bd850c9f4bca..9e103a4875c8 100644
+--- a/drivers/gpu/drm/ttm/ttm_bo_manager.c
++++ b/drivers/gpu/drm/ttm/ttm_bo_manager.c
+@@ -50,6 +50,7 @@ struct ttm_range_manager {
+ static int ttm_bo_man_get_node(struct ttm_mem_type_manager *man,
+ struct ttm_buffer_object *bo,
+ struct ttm_placement *placement,
++ uint32_t flags,
+ struct ttm_mem_reg *mem)
+ {
+ struct ttm_range_manager *rman = (struct ttm_range_manager *) man->priv;
+@@ -67,7 +68,7 @@ static int ttm_bo_man_get_node(struct ttm_mem_type_manager *man,
+ if (!node)
+ return -ENOMEM;
+
+- if (bo->mem.placement & TTM_PL_FLAG_TOPDOWN)
++ if (flags & TTM_PL_FLAG_TOPDOWN)
+ aflags = DRM_MM_CREATE_TOP;
+
+ spin_lock(&rman->lock);
+diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
+index 863bef9f9234..cf4bad2c1d59 100644
+--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
++++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
+@@ -297,8 +297,10 @@ static void ttm_pool_update_free_locked(struct ttm_page_pool *pool,
+ *
+ * @pool: to free the pages from
+ * @free_all: If set to true will free all pages in pool
++ * @gfp: GFP flags.
+ **/
+-static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free)
++static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free,
++ gfp_t gfp)
+ {
+ unsigned long irq_flags;
+ struct page *p;
+@@ -309,8 +311,7 @@ static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free)
+ if (NUM_PAGES_TO_ALLOC < nr_free)
+ npages_to_free = NUM_PAGES_TO_ALLOC;
+
+- pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
+- GFP_KERNEL);
++ pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
+ if (!pages_to_free) {
+ pr_err("Failed to allocate memory for pool free operation\n");
+ return 0;
+@@ -382,32 +383,35 @@ out:
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+- * ttm_page_pool_free() does memory allocation using GFP_KERNEL. that means
+- * this can deadlock when called a sc->gfp_mask that is not equal to
+- * GFP_KERNEL.
++ * We need to pass sc->gfp_mask to ttm_page_pool_free().
+ *
+ * This code is crying out for a shrinker per pool....
+ */
+ static unsigned long
+ ttm_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+ {
+- static atomic_t start_pool = ATOMIC_INIT(0);
++ static DEFINE_MUTEX(lock);
++ static unsigned start_pool;
+ unsigned i;
+- unsigned pool_offset = atomic_add_return(1, &start_pool);
++ unsigned pool_offset;
+ struct ttm_page_pool *pool;
+ int shrink_pages = sc->nr_to_scan;
+ unsigned long freed = 0;
+
+- pool_offset = pool_offset % NUM_POOLS;
++ if (!mutex_trylock(&lock))
++ return SHRINK_STOP;
++ pool_offset = ++start_pool % NUM_POOLS;
+ /* select start pool in round robin fashion */
+ for (i = 0; i < NUM_POOLS; ++i) {
+ unsigned nr_free = shrink_pages;
+ if (shrink_pages == 0)
+ break;
+ pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
+- shrink_pages = ttm_page_pool_free(pool, nr_free);
++ shrink_pages = ttm_page_pool_free(pool, nr_free,
++ sc->gfp_mask);
+ freed += nr_free - shrink_pages;
+ }
++ mutex_unlock(&lock);
+ return freed;
+ }
+
+@@ -706,7 +710,7 @@ static void ttm_put_pages(struct page **pages, unsigned npages, int flags,
+ }
+ spin_unlock_irqrestore(&pool->lock, irq_flags);
+ if (npages)
+- ttm_page_pool_free(pool, npages);
++ ttm_page_pool_free(pool, npages, GFP_KERNEL);
+ }
+
+ /*
+@@ -846,7 +850,8 @@ void ttm_page_alloc_fini(void)
+ ttm_pool_mm_shrink_fini(_manager);
+
+ for (i = 0; i < NUM_POOLS; ++i)
+- ttm_page_pool_free(&_manager->pools[i], FREE_ALL_PAGES);
++ ttm_page_pool_free(&_manager->pools[i], FREE_ALL_PAGES,
++ GFP_KERNEL);
+
+ kobject_put(&_manager->kobj);
+ _manager = NULL;
+diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+index fb8259f69839..ca65df144765 100644
+--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
++++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+@@ -411,8 +411,10 @@ static void ttm_dma_page_put(struct dma_pool *pool, struct dma_page *d_page)
+ *
+ * @pool: to free the pages from
+ * @nr_free: If set to true will free all pages in pool
++ * @gfp: GFP flags.
+ **/
+-static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free)
++static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free,
++ gfp_t gfp)
+ {
+ unsigned long irq_flags;
+ struct dma_page *dma_p, *tmp;
+@@ -430,8 +432,7 @@ static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free)
+ npages_to_free, nr_free);
+ }
+ #endif
+- pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
+- GFP_KERNEL);
++ pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
+
+ if (!pages_to_free) {
+ pr_err("%s: Failed to allocate memory for pool free operation\n",
+@@ -530,7 +531,7 @@ static void ttm_dma_free_pool(struct device *dev, enum pool_type type)
+ if (pool->type != type)
+ continue;
+ /* Takes a spinlock.. */
+- ttm_dma_page_pool_free(pool, FREE_ALL_PAGES);
++ ttm_dma_page_pool_free(pool, FREE_ALL_PAGES, GFP_KERNEL);
+ WARN_ON(((pool->npages_in_use + pool->npages_free) != 0));
+ /* This code path is called after _all_ references to the
+ * struct device has been dropped - so nobody should be
+@@ -983,7 +984,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
+
+ /* shrink pool if necessary (only on !is_cached pools)*/
+ if (npages)
+- ttm_dma_page_pool_free(pool, npages);
++ ttm_dma_page_pool_free(pool, npages, GFP_KERNEL);
+ ttm->state = tt_unpopulated;
+ }
+ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
+@@ -993,10 +994,7 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+- * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
+- * needs to be paid to sc->gfp_mask to determine if this can be done or not.
+- * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really
+- * bad.
++ * We need to pass sc->gfp_mask to ttm_dma_page_pool_free().
+ *
+ * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
+ * shrinkers
+@@ -1004,9 +1002,9 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
+ static unsigned long
+ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+ {
+- static atomic_t start_pool = ATOMIC_INIT(0);
++ static unsigned start_pool;
+ unsigned idx = 0;
+- unsigned pool_offset = atomic_add_return(1, &start_pool);
++ unsigned pool_offset;
+ unsigned shrink_pages = sc->nr_to_scan;
+ struct device_pools *p;
+ unsigned long freed = 0;
+@@ -1014,8 +1012,11 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+ if (list_empty(&_manager->pools))
+ return SHRINK_STOP;
+
+- mutex_lock(&_manager->lock);
+- pool_offset = pool_offset % _manager->npools;
++ if (!mutex_trylock(&_manager->lock))
++ return SHRINK_STOP;
++ if (!_manager->npools)
++ goto out;
++ pool_offset = ++start_pool % _manager->npools;
+ list_for_each_entry(p, &_manager->pools, pools) {
+ unsigned nr_free;
+
+@@ -1027,13 +1028,15 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+ if (++idx < pool_offset)
+ continue;
+ nr_free = shrink_pages;
+- shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
++ shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free,
++ sc->gfp_mask);
+ freed += nr_free - shrink_pages;
+
+ pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
+ p->pool->dev_name, p->pool->name, current->pid,
+ nr_free, shrink_pages);
+ }
++out:
+ mutex_unlock(&_manager->lock);
+ return freed;
+ }
+@@ -1044,7 +1047,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+ struct device_pools *p;
+ unsigned long count = 0;
+
+- mutex_lock(&_manager->lock);
++ if (!mutex_trylock(&_manager->lock))
++ return 0;
+ list_for_each_entry(p, &_manager->pools, pools)
+ count += p->pool->npages_free;
+ mutex_unlock(&_manager->lock);
+diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c b/drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c
+index 6ccd993e26bf..6eae14d2a3f7 100644
+--- a/drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c
++++ b/drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c
+@@ -180,8 +180,9 @@ void vmw_fifo_release(struct vmw_private *dev_priv, struct vmw_fifo_state *fifo)
+
+ mutex_lock(&dev_priv->hw_mutex);
+
++ vmw_write(dev_priv, SVGA_REG_SYNC, SVGA_SYNC_GENERIC);
+ while (vmw_read(dev_priv, SVGA_REG_BUSY) != 0)
+- vmw_write(dev_priv, SVGA_REG_SYNC, SVGA_SYNC_GENERIC);
++ ;
+
+ dev_priv->last_read_seqno = ioread32(fifo_mem + SVGA_FIFO_FENCE);
+
+diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c b/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
+index b1273e8e9a69..26f8bdde3529 100644
+--- a/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
++++ b/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
+@@ -47,6 +47,7 @@ struct vmwgfx_gmrid_man {
+ static int vmw_gmrid_man_get_node(struct ttm_mem_type_manager *man,
+ struct ttm_buffer_object *bo,
+ struct ttm_placement *placement,
++ uint32_t flags,
+ struct ttm_mem_reg *mem)
+ {
+ struct vmwgfx_gmrid_man *gman =
+diff --git a/drivers/gpu/vga/vga_switcheroo.c b/drivers/gpu/vga/vga_switcheroo.c
+index 6866448083b2..37ac7b5dbd06 100644
+--- a/drivers/gpu/vga/vga_switcheroo.c
++++ b/drivers/gpu/vga/vga_switcheroo.c
+@@ -660,6 +660,12 @@ int vga_switcheroo_init_domain_pm_ops(struct device *dev, struct dev_pm_domain *
+ }
+ EXPORT_SYMBOL(vga_switcheroo_init_domain_pm_ops);
+
++void vga_switcheroo_fini_domain_pm_ops(struct device *dev)
++{
++ dev->pm_domain = NULL;
++}
++EXPORT_SYMBOL(vga_switcheroo_fini_domain_pm_ops);
++
+ static int vga_switcheroo_runtime_resume_hdmi_audio(struct device *dev)
+ {
+ struct pci_dev *pdev = to_pci_dev(dev);
+diff --git a/drivers/hid/hid-logitech-dj.c b/drivers/hid/hid-logitech-dj.c
+index b7ba82960c79..9bf8637747a5 100644
+--- a/drivers/hid/hid-logitech-dj.c
++++ b/drivers/hid/hid-logitech-dj.c
+@@ -656,7 +656,6 @@ static int logi_dj_raw_event(struct hid_device *hdev,
+ struct dj_receiver_dev *djrcv_dev = hid_get_drvdata(hdev);
+ struct dj_report *dj_report = (struct dj_report *) data;
+ unsigned long flags;
+- bool report_processed = false;
+
+ dbg_hid("%s, size:%d\n", __func__, size);
+
+@@ -683,34 +682,42 @@ static int logi_dj_raw_event(struct hid_device *hdev,
+ * device (via hid_input_report() ) and return 1 so hid-core does not do
+ * anything else with it.
+ */
++
++ /* case 1) */
++ if (data[0] != REPORT_ID_DJ_SHORT)
++ return false;
++
+ if ((dj_report->device_index < DJ_DEVICE_INDEX_MIN) ||
+ (dj_report->device_index > DJ_DEVICE_INDEX_MAX)) {
+- dev_err(&hdev->dev, "%s: invalid device index:%d\n",
++ /*
++ * Device index is wrong, bail out.
++ * This driver can ignore safely the receiver notifications,
++ * so ignore those reports too.
++ */
++ if (dj_report->device_index != DJ_RECEIVER_INDEX)
++ dev_err(&hdev->dev, "%s: invalid device index:%d\n",
+ __func__, dj_report->device_index);
+ return false;
+ }
+
+ spin_lock_irqsave(&djrcv_dev->lock, flags);
+- if (dj_report->report_id == REPORT_ID_DJ_SHORT) {
+- switch (dj_report->report_type) {
+- case REPORT_TYPE_NOTIF_DEVICE_PAIRED:
+- case REPORT_TYPE_NOTIF_DEVICE_UNPAIRED:
+- logi_dj_recv_queue_notification(djrcv_dev, dj_report);
+- break;
+- case REPORT_TYPE_NOTIF_CONNECTION_STATUS:
+- if (dj_report->report_params[CONNECTION_STATUS_PARAM_STATUS] ==
+- STATUS_LINKLOSS) {
+- logi_dj_recv_forward_null_report(djrcv_dev, dj_report);
+- }
+- break;
+- default:
+- logi_dj_recv_forward_report(djrcv_dev, dj_report);
++ switch (dj_report->report_type) {
++ case REPORT_TYPE_NOTIF_DEVICE_PAIRED:
++ case REPORT_TYPE_NOTIF_DEVICE_UNPAIRED:
++ logi_dj_recv_queue_notification(djrcv_dev, dj_report);
++ break;
++ case REPORT_TYPE_NOTIF_CONNECTION_STATUS:
++ if (dj_report->report_params[CONNECTION_STATUS_PARAM_STATUS] ==
++ STATUS_LINKLOSS) {
++ logi_dj_recv_forward_null_report(djrcv_dev, dj_report);
+ }
+- report_processed = true;
++ break;
++ default:
++ logi_dj_recv_forward_report(djrcv_dev, dj_report);
+ }
+ spin_unlock_irqrestore(&djrcv_dev->lock, flags);
+
+- return report_processed;
++ return true;
+ }
+
+ static int logi_dj_probe(struct hid_device *hdev,
+diff --git a/drivers/hid/hid-logitech-dj.h b/drivers/hid/hid-logitech-dj.h
+index 4a4000340ce1..daeb0aa4bee9 100644
+--- a/drivers/hid/hid-logitech-dj.h
++++ b/drivers/hid/hid-logitech-dj.h
+@@ -27,6 +27,7 @@
+
+ #define DJ_MAX_PAIRED_DEVICES 6
+ #define DJ_MAX_NUMBER_NOTIFICATIONS 8
++#define DJ_RECEIVER_INDEX 0
+ #define DJ_DEVICE_INDEX_MIN 1
+ #define DJ_DEVICE_INDEX_MAX 6
+
+diff --git a/drivers/hid/hid-magicmouse.c b/drivers/hid/hid-magicmouse.c
+index ecc2cbf300cc..29a74c1efcb8 100644
+--- a/drivers/hid/hid-magicmouse.c
++++ b/drivers/hid/hid-magicmouse.c
+@@ -290,6 +290,11 @@ static int magicmouse_raw_event(struct hid_device *hdev,
+ if (size < 4 || ((size - 4) % 9) != 0)
+ return 0;
+ npoints = (size - 4) / 9;
++ if (npoints > 15) {
++ hid_warn(hdev, "invalid size value (%d) for TRACKPAD_REPORT_ID\n",
++ size);
++ return 0;
++ }
+ msc->ntouches = 0;
+ for (ii = 0; ii < npoints; ii++)
+ magicmouse_emit_touch(msc, ii, data + ii * 9 + 4);
+@@ -307,6 +312,11 @@ static int magicmouse_raw_event(struct hid_device *hdev,
+ if (size < 6 || ((size - 6) % 8) != 0)
+ return 0;
+ npoints = (size - 6) / 8;
++ if (npoints > 15) {
++ hid_warn(hdev, "invalid size value (%d) for MOUSE_REPORT_ID\n",
++ size);
++ return 0;
++ }
+ msc->ntouches = 0;
+ for (ii = 0; ii < npoints; ii++)
+ magicmouse_emit_touch(msc, ii, data + ii * 8 + 6);
+diff --git a/drivers/hid/hid-picolcd_core.c b/drivers/hid/hid-picolcd_core.c
+index acbb021065ec..020df3c2e8b4 100644
+--- a/drivers/hid/hid-picolcd_core.c
++++ b/drivers/hid/hid-picolcd_core.c
+@@ -350,6 +350,12 @@ static int picolcd_raw_event(struct hid_device *hdev,
+ if (!data)
+ return 1;
+
++ if (size > 64) {
++ hid_warn(hdev, "invalid size value (%d) for picolcd raw event\n",
++ size);
++ return 0;
++ }
++
+ if (report->id == REPORT_KEY_STATE) {
+ if (data->input_keys)
+ ret = picolcd_raw_keypad(data, report, raw_data+1, size-1);
+diff --git a/drivers/hwmon/ds1621.c b/drivers/hwmon/ds1621.c
+index fc6f5d54e7f7..8890870309e4 100644
+--- a/drivers/hwmon/ds1621.c
++++ b/drivers/hwmon/ds1621.c
+@@ -309,6 +309,7 @@ static ssize_t set_convrate(struct device *dev, struct device_attribute *da,
+ data->conf |= (resol << DS1621_REG_CONFIG_RESOL_SHIFT);
+ i2c_smbus_write_byte_data(client, DS1621_REG_CONF, data->conf);
+ data->update_interval = ds1721_convrates[resol];
++ data->zbits = 7 - resol;
+ mutex_unlock(&data->update_lock);
+
+ return count;
+diff --git a/drivers/i2c/busses/i2c-at91.c b/drivers/i2c/busses/i2c-at91.c
+index 83c989382be9..e96edab2e30b 100644
+--- a/drivers/i2c/busses/i2c-at91.c
++++ b/drivers/i2c/busses/i2c-at91.c
+@@ -101,6 +101,7 @@ struct at91_twi_dev {
+ unsigned twi_cwgr_reg;
+ struct at91_twi_pdata *pdata;
+ bool use_dma;
++ bool recv_len_abort;
+ struct at91_twi_dma dma;
+ };
+
+@@ -267,12 +268,24 @@ static void at91_twi_read_next_byte(struct at91_twi_dev *dev)
+ *dev->buf = at91_twi_read(dev, AT91_TWI_RHR) & 0xff;
+ --dev->buf_len;
+
++ /* return if aborting, we only needed to read RHR to clear RXRDY*/
++ if (dev->recv_len_abort)
++ return;
++
+ /* handle I2C_SMBUS_BLOCK_DATA */
+ if (unlikely(dev->msg->flags & I2C_M_RECV_LEN)) {
+- dev->msg->flags &= ~I2C_M_RECV_LEN;
+- dev->buf_len += *dev->buf;
+- dev->msg->len = dev->buf_len + 1;
+- dev_dbg(dev->dev, "received block length %d\n", dev->buf_len);
++ /* ensure length byte is a valid value */
++ if (*dev->buf <= I2C_SMBUS_BLOCK_MAX && *dev->buf > 0) {
++ dev->msg->flags &= ~I2C_M_RECV_LEN;
++ dev->buf_len += *dev->buf;
++ dev->msg->len = dev->buf_len + 1;
++ dev_dbg(dev->dev, "received block length %d\n",
++ dev->buf_len);
++ } else {
++ /* abort and send the stop by reading one more byte */
++ dev->recv_len_abort = true;
++ dev->buf_len = 1;
++ }
+ }
+
+ /* send stop if second but last byte has been read */
+@@ -421,8 +434,8 @@ static int at91_do_twi_transfer(struct at91_twi_dev *dev)
+ }
+ }
+
+- ret = wait_for_completion_interruptible_timeout(&dev->cmd_complete,
+- dev->adapter.timeout);
++ ret = wait_for_completion_io_timeout(&dev->cmd_complete,
++ dev->adapter.timeout);
+ if (ret == 0) {
+ dev_err(dev->dev, "controller timed out\n");
+ at91_init_twi_bus(dev);
+@@ -444,6 +457,12 @@ static int at91_do_twi_transfer(struct at91_twi_dev *dev)
+ ret = -EIO;
+ goto error;
+ }
++ if (dev->recv_len_abort) {
++ dev_err(dev->dev, "invalid smbus block length recvd\n");
++ ret = -EPROTO;
++ goto error;
++ }
++
+ dev_dbg(dev->dev, "transfer complete\n");
+
+ return 0;
+@@ -500,6 +519,7 @@ static int at91_twi_xfer(struct i2c_adapter *adap, struct i2c_msg *msg, int num)
+ dev->buf_len = m_start->len;
+ dev->buf = m_start->buf;
+ dev->msg = m_start;
++ dev->recv_len_abort = false;
+
+ ret = at91_do_twi_transfer(dev);
+
+diff --git a/drivers/i2c/busses/i2c-ismt.c b/drivers/i2c/busses/i2c-ismt.c
+index 984492553e95..d9ee43c80cde 100644
+--- a/drivers/i2c/busses/i2c-ismt.c
++++ b/drivers/i2c/busses/i2c-ismt.c
+@@ -497,7 +497,7 @@ static int ismt_access(struct i2c_adapter *adap, u16 addr,
+ desc->wr_len_cmd = dma_size;
+ desc->control |= ISMT_DESC_BLK;
+ priv->dma_buffer[0] = command;
+- memcpy(&priv->dma_buffer[1], &data->block[1], dma_size);
++ memcpy(&priv->dma_buffer[1], &data->block[1], dma_size - 1);
+ } else {
+ /* Block Read */
+ dev_dbg(dev, "I2C_SMBUS_BLOCK_DATA: READ\n");
+@@ -525,7 +525,7 @@ static int ismt_access(struct i2c_adapter *adap, u16 addr,
+ desc->wr_len_cmd = dma_size;
+ desc->control |= ISMT_DESC_I2C;
+ priv->dma_buffer[0] = command;
+- memcpy(&priv->dma_buffer[1], &data->block[1], dma_size);
++ memcpy(&priv->dma_buffer[1], &data->block[1], dma_size - 1);
+ } else {
+ /* i2c Block Read */
+ dev_dbg(dev, "I2C_SMBUS_I2C_BLOCK_DATA: READ\n");
+diff --git a/drivers/i2c/busses/i2c-mv64xxx.c b/drivers/i2c/busses/i2c-mv64xxx.c
+index 9f4b775e2e39..e21e206d94e7 100644
+--- a/drivers/i2c/busses/i2c-mv64xxx.c
++++ b/drivers/i2c/busses/i2c-mv64xxx.c
+@@ -746,8 +746,7 @@ mv64xxx_of_config(struct mv64xxx_i2c_data *drv_data,
+ }
+ tclk = clk_get_rate(drv_data->clk);
+
+- rc = of_property_read_u32(np, "clock-frequency", &bus_freq);
+- if (rc)
++ if (of_property_read_u32(np, "clock-frequency", &bus_freq))
+ bus_freq = 100000; /* 100kHz by default */
+
+ if (!mv64xxx_find_baud_factors(bus_freq, tclk,
+diff --git a/drivers/i2c/busses/i2c-rcar.c b/drivers/i2c/busses/i2c-rcar.c
+index 899405923678..772d76ad036f 100644
+--- a/drivers/i2c/busses/i2c-rcar.c
++++ b/drivers/i2c/busses/i2c-rcar.c
+@@ -34,6 +34,7 @@
+ #include <linux/platform_device.h>
+ #include <linux/pm_runtime.h>
+ #include <linux/slab.h>
++#include <linux/spinlock.h>
+
+ /* register offsets */
+ #define ICSCR 0x00 /* slave ctrl */
+@@ -75,8 +76,8 @@
+ #define RCAR_IRQ_RECV (MNR | MAL | MST | MAT | MDR)
+ #define RCAR_IRQ_STOP (MST)
+
+-#define RCAR_IRQ_ACK_SEND (~(MAT | MDE))
+-#define RCAR_IRQ_ACK_RECV (~(MAT | MDR))
++#define RCAR_IRQ_ACK_SEND (~(MAT | MDE) & 0xFF)
++#define RCAR_IRQ_ACK_RECV (~(MAT | MDR) & 0xFF)
+
+ #define ID_LAST_MSG (1 << 0)
+ #define ID_IOERROR (1 << 1)
+@@ -95,6 +96,7 @@ struct rcar_i2c_priv {
+ struct i2c_msg *msg;
+ struct clk *clk;
+
++ spinlock_t lock;
+ wait_queue_head_t wait;
+
+ int pos;
+@@ -365,20 +367,20 @@ static irqreturn_t rcar_i2c_irq(int irq, void *ptr)
+ struct rcar_i2c_priv *priv = ptr;
+ u32 msr;
+
++ /*-------------- spin lock -----------------*/
++ spin_lock(&priv->lock);
++
+ msr = rcar_i2c_read(priv, ICMSR);
+
++ /* Only handle interrupts that are currently enabled */
++ msr &= rcar_i2c_read(priv, ICMIER);
++
+ /* Arbitration lost */
+ if (msr & MAL) {
+ rcar_i2c_flags_set(priv, (ID_DONE | ID_ARBLOST));
+ goto out;
+ }
+
+- /* Stop */
+- if (msr & MST) {
+- rcar_i2c_flags_set(priv, ID_DONE);
+- goto out;
+- }
+-
+ /* Nack */
+ if (msr & MNR) {
+ /* go to stop phase */
+@@ -388,6 +390,12 @@ static irqreturn_t rcar_i2c_irq(int irq, void *ptr)
+ goto out;
+ }
+
++ /* Stop */
++ if (msr & MST) {
++ rcar_i2c_flags_set(priv, ID_DONE);
++ goto out;
++ }
++
+ if (rcar_i2c_is_recv(priv))
+ rcar_i2c_flags_set(priv, rcar_i2c_irq_recv(priv, msr));
+ else
+@@ -400,6 +408,9 @@ out:
+ wake_up(&priv->wait);
+ }
+
++ spin_unlock(&priv->lock);
++ /*-------------- spin unlock -----------------*/
++
+ return IRQ_HANDLED;
+ }
+
+@@ -409,14 +420,21 @@ static int rcar_i2c_master_xfer(struct i2c_adapter *adap,
+ {
+ struct rcar_i2c_priv *priv = i2c_get_adapdata(adap);
+ struct device *dev = rcar_i2c_priv_to_dev(priv);
++ unsigned long flags;
+ int i, ret, timeout;
+
+ pm_runtime_get_sync(dev);
+
++ /*-------------- spin lock -----------------*/
++ spin_lock_irqsave(&priv->lock, flags);
++
+ rcar_i2c_init(priv);
+ /* start clock */
+ rcar_i2c_write(priv, ICCCR, priv->icccr);
+
++ spin_unlock_irqrestore(&priv->lock, flags);
++ /*-------------- spin unlock -----------------*/
++
+ ret = rcar_i2c_bus_barrier(priv);
+ if (ret < 0)
+ goto out;
+@@ -428,6 +446,9 @@ static int rcar_i2c_master_xfer(struct i2c_adapter *adap,
+ break;
+ }
+
++ /*-------------- spin lock -----------------*/
++ spin_lock_irqsave(&priv->lock, flags);
++
+ /* init each data */
+ priv->msg = &msgs[i];
+ priv->pos = 0;
+@@ -437,6 +458,9 @@ static int rcar_i2c_master_xfer(struct i2c_adapter *adap,
+
+ ret = rcar_i2c_prepare_msg(priv);
+
++ spin_unlock_irqrestore(&priv->lock, flags);
++ /*-------------- spin unlock -----------------*/
++
+ if (ret < 0)
+ break;
+
+@@ -540,6 +564,7 @@ static int rcar_i2c_probe(struct platform_device *pdev)
+
+ irq = platform_get_irq(pdev, 0);
+ init_waitqueue_head(&priv->wait);
++ spin_lock_init(&priv->lock);
+
+ adap = &priv->adap;
+ adap->nr = pdev->id;
+diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
+index 69e11853e8bf..93cfc837200b 100644
+--- a/drivers/i2c/busses/i2c-rk3x.c
++++ b/drivers/i2c/busses/i2c-rk3x.c
+@@ -323,6 +323,10 @@ static void rk3x_i2c_handle_read(struct rk3x_i2c *i2c, unsigned int ipd)
+ /* ack interrupt */
+ i2c_writel(i2c, REG_INT_MBRF, REG_IPD);
+
++ /* Can only handle a maximum of 32 bytes at a time */
++ if (len > 32)
++ len = 32;
++
+ /* read the data from receive buffer */
+ for (i = 0; i < len; ++i) {
+ if (i % 4 == 0)
+@@ -429,12 +433,11 @@ static void rk3x_i2c_set_scl_rate(struct rk3x_i2c *i2c, unsigned long scl_rate)
+ unsigned long i2c_rate = clk_get_rate(i2c->clk);
+ unsigned int div;
+
+- /* SCL rate = (clk rate) / (8 * DIV) */
+- div = DIV_ROUND_UP(i2c_rate, scl_rate * 8);
+-
+- /* The lower and upper half of the CLKDIV reg describe the length of
+- * SCL low & high periods. */
+- div = DIV_ROUND_UP(div, 2);
++ /* set DIV = DIVH = DIVL
++ * SCL rate = (clk rate) / (8 * (DIVH + 1 + DIVL + 1))
++ * = (clk rate) / (16 * (DIV + 1))
++ */
++ div = DIV_ROUND_UP(i2c_rate, scl_rate * 16) - 1;
+
+ i2c_writel(i2c, (div << 16) | (div & 0xffff), REG_CLKDIV);
+ }
+diff --git a/drivers/iio/accel/bma180.c b/drivers/iio/accel/bma180.c
+index a077cc86421b..19100fddd2ed 100644
+--- a/drivers/iio/accel/bma180.c
++++ b/drivers/iio/accel/bma180.c
+@@ -571,7 +571,7 @@ static int bma180_probe(struct i2c_client *client,
+ trig->ops = &bma180_trigger_ops;
+ iio_trigger_set_drvdata(trig, indio_dev);
+ data->trig = trig;
+- indio_dev->trig = trig;
++ indio_dev->trig = iio_trigger_get(trig);
+
+ ret = iio_trigger_register(trig);
+ if (ret)
+diff --git a/drivers/iio/adc/ad_sigma_delta.c b/drivers/iio/adc/ad_sigma_delta.c
+index 9a4e0e32a771..eb799a43aef0 100644
+--- a/drivers/iio/adc/ad_sigma_delta.c
++++ b/drivers/iio/adc/ad_sigma_delta.c
+@@ -472,7 +472,7 @@ static int ad_sd_probe_trigger(struct iio_dev *indio_dev)
+ goto error_free_irq;
+
+ /* select default trigger */
+- indio_dev->trig = sigma_delta->trig;
++ indio_dev->trig = iio_trigger_get(sigma_delta->trig);
+
+ return 0;
+
+diff --git a/drivers/iio/adc/at91_adc.c b/drivers/iio/adc/at91_adc.c
+index 2b6a9ce9927c..f508bd6b46e3 100644
+--- a/drivers/iio/adc/at91_adc.c
++++ b/drivers/iio/adc/at91_adc.c
+@@ -196,6 +196,7 @@ struct at91_adc_state {
+ bool done;
+ int irq;
+ u16 last_value;
++ int chnb;
+ struct mutex lock;
+ u8 num_channels;
+ void __iomem *reg_base;
+@@ -274,7 +275,7 @@ void handle_adc_eoc_trigger(int irq, struct iio_dev *idev)
+ disable_irq_nosync(irq);
+ iio_trigger_poll(idev->trig, iio_get_time_ns());
+ } else {
+- st->last_value = at91_adc_readl(st, AT91_ADC_LCDR);
++ st->last_value = at91_adc_readl(st, AT91_ADC_CHAN(st, st->chnb));
+ st->done = true;
+ wake_up_interruptible(&st->wq_data_avail);
+ }
+@@ -351,7 +352,7 @@ static irqreturn_t at91_adc_rl_interrupt(int irq, void *private)
+ unsigned int reg;
+
+ status &= at91_adc_readl(st, AT91_ADC_IMR);
+- if (status & st->registers->drdy_mask)
++ if (status & GENMASK(st->num_channels - 1, 0))
+ handle_adc_eoc_trigger(irq, idev);
+
+ if (status & AT91RL_ADC_IER_PEN) {
+@@ -418,7 +419,7 @@ static irqreturn_t at91_adc_9x5_interrupt(int irq, void *private)
+ AT91_ADC_IER_YRDY |
+ AT91_ADC_IER_PRDY;
+
+- if (status & st->registers->drdy_mask)
++ if (status & GENMASK(st->num_channels - 1, 0))
+ handle_adc_eoc_trigger(irq, idev);
+
+ if (status & AT91_ADC_IER_PEN) {
+@@ -689,9 +690,10 @@ static int at91_adc_read_raw(struct iio_dev *idev,
+ case IIO_CHAN_INFO_RAW:
+ mutex_lock(&st->lock);
+
++ st->chnb = chan->channel;
+ at91_adc_writel(st, AT91_ADC_CHER,
+ AT91_ADC_CH(chan->channel));
+- at91_adc_writel(st, AT91_ADC_IER, st->registers->drdy_mask);
++ at91_adc_writel(st, AT91_ADC_IER, BIT(chan->channel));
+ at91_adc_writel(st, AT91_ADC_CR, AT91_ADC_START);
+
+ ret = wait_event_interruptible_timeout(st->wq_data_avail,
+@@ -708,7 +710,7 @@ static int at91_adc_read_raw(struct iio_dev *idev,
+
+ at91_adc_writel(st, AT91_ADC_CHDR,
+ AT91_ADC_CH(chan->channel));
+- at91_adc_writel(st, AT91_ADC_IDR, st->registers->drdy_mask);
++ at91_adc_writel(st, AT91_ADC_IDR, BIT(chan->channel));
+
+ st->last_value = 0;
+ st->done = false;
+diff --git a/drivers/iio/adc/xilinx-xadc-core.c b/drivers/iio/adc/xilinx-xadc-core.c
+index ab52be29141b..41d3a5efd62c 100644
+--- a/drivers/iio/adc/xilinx-xadc-core.c
++++ b/drivers/iio/adc/xilinx-xadc-core.c
+@@ -1126,7 +1126,7 @@ static int xadc_parse_dt(struct iio_dev *indio_dev, struct device_node *np,
+ chan->address = XADC_REG_VPVN;
+ } else {
+ chan->scan_index = 15 + reg;
+- chan->scan_index = XADC_REG_VAUX(reg - 1);
++ chan->address = XADC_REG_VAUX(reg - 1);
+ }
+ num_channels++;
+ chan++;
+diff --git a/drivers/iio/common/hid-sensors/hid-sensor-trigger.c b/drivers/iio/common/hid-sensors/hid-sensor-trigger.c
+index a3109a6f4d86..92068cdbf8c7 100644
+--- a/drivers/iio/common/hid-sensors/hid-sensor-trigger.c
++++ b/drivers/iio/common/hid-sensors/hid-sensor-trigger.c
+@@ -122,7 +122,8 @@ int hid_sensor_setup_trigger(struct iio_dev *indio_dev, const char *name,
+ dev_err(&indio_dev->dev, "Trigger Register Failed\n");
+ goto error_free_trig;
+ }
+- indio_dev->trig = attrb->trigger = trig;
++ attrb->trigger = trig;
++ indio_dev->trig = iio_trigger_get(trig);
+
+ return ret;
+
+diff --git a/drivers/iio/common/st_sensors/st_sensors_trigger.c b/drivers/iio/common/st_sensors/st_sensors_trigger.c
+index 8fc3a97eb266..8d8ca6f1e16a 100644
+--- a/drivers/iio/common/st_sensors/st_sensors_trigger.c
++++ b/drivers/iio/common/st_sensors/st_sensors_trigger.c
+@@ -49,7 +49,7 @@ int st_sensors_allocate_trigger(struct iio_dev *indio_dev,
+ dev_err(&indio_dev->dev, "failed to register iio trigger.\n");
+ goto iio_trigger_register_error;
+ }
+- indio_dev->trig = sdata->trig;
++ indio_dev->trig = iio_trigger_get(sdata->trig);
+
+ return 0;
+
+diff --git a/drivers/iio/gyro/itg3200_buffer.c b/drivers/iio/gyro/itg3200_buffer.c
+index e3b3c5084070..eef50e91f17c 100644
+--- a/drivers/iio/gyro/itg3200_buffer.c
++++ b/drivers/iio/gyro/itg3200_buffer.c
+@@ -132,7 +132,7 @@ int itg3200_probe_trigger(struct iio_dev *indio_dev)
+ goto error_free_irq;
+
+ /* select default trigger */
+- indio_dev->trig = st->trig;
++ indio_dev->trig = iio_trigger_get(st->trig);
+
+ return 0;
+
+diff --git a/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c b/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
+index 03b9372c1212..926fccea8de0 100644
+--- a/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
++++ b/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
+@@ -135,7 +135,7 @@ int inv_mpu6050_probe_trigger(struct iio_dev *indio_dev)
+ ret = iio_trigger_register(st->trig);
+ if (ret)
+ goto error_free_irq;
+- indio_dev->trig = st->trig;
++ indio_dev->trig = iio_trigger_get(st->trig);
+
+ return 0;
+
+diff --git a/drivers/iio/inkern.c b/drivers/iio/inkern.c
+index c7497009d60a..f0846108d006 100644
+--- a/drivers/iio/inkern.c
++++ b/drivers/iio/inkern.c
+@@ -178,7 +178,7 @@ static struct iio_channel *of_iio_channel_get_by_name(struct device_node *np,
+ index = of_property_match_string(np, "io-channel-names",
+ name);
+ chan = of_iio_channel_get(np, index);
+- if (!IS_ERR(chan))
++ if (!IS_ERR(chan) || PTR_ERR(chan) == -EPROBE_DEFER)
+ break;
+ else if (name && index >= 0) {
+ pr_err("ERROR: could not get IIO channel %s:%s(%i)\n",
+diff --git a/drivers/iio/magnetometer/st_magn_core.c b/drivers/iio/magnetometer/st_magn_core.c
+index 240a21dd0c61..4d55151893af 100644
+--- a/drivers/iio/magnetometer/st_magn_core.c
++++ b/drivers/iio/magnetometer/st_magn_core.c
+@@ -42,7 +42,8 @@
+ #define ST_MAGN_FS_AVL_5600MG 5600
+ #define ST_MAGN_FS_AVL_8000MG 8000
+ #define ST_MAGN_FS_AVL_8100MG 8100
+-#define ST_MAGN_FS_AVL_10000MG 10000
++#define ST_MAGN_FS_AVL_12000MG 12000
++#define ST_MAGN_FS_AVL_16000MG 16000
+
+ /* CUSTOM VALUES FOR SENSOR 1 */
+ #define ST_MAGN_1_WAI_EXP 0x3c
+@@ -69,20 +70,20 @@
+ #define ST_MAGN_1_FS_AVL_4700_VAL 0x05
+ #define ST_MAGN_1_FS_AVL_5600_VAL 0x06
+ #define ST_MAGN_1_FS_AVL_8100_VAL 0x07
+-#define ST_MAGN_1_FS_AVL_1300_GAIN_XY 1100
+-#define ST_MAGN_1_FS_AVL_1900_GAIN_XY 855
+-#define ST_MAGN_1_FS_AVL_2500_GAIN_XY 670
+-#define ST_MAGN_1_FS_AVL_4000_GAIN_XY 450
+-#define ST_MAGN_1_FS_AVL_4700_GAIN_XY 400
+-#define ST_MAGN_1_FS_AVL_5600_GAIN_XY 330
+-#define ST_MAGN_1_FS_AVL_8100_GAIN_XY 230
+-#define ST_MAGN_1_FS_AVL_1300_GAIN_Z 980
+-#define ST_MAGN_1_FS_AVL_1900_GAIN_Z 760
+-#define ST_MAGN_1_FS_AVL_2500_GAIN_Z 600
+-#define ST_MAGN_1_FS_AVL_4000_GAIN_Z 400
+-#define ST_MAGN_1_FS_AVL_4700_GAIN_Z 355
+-#define ST_MAGN_1_FS_AVL_5600_GAIN_Z 295
+-#define ST_MAGN_1_FS_AVL_8100_GAIN_Z 205
++#define ST_MAGN_1_FS_AVL_1300_GAIN_XY 909
++#define ST_MAGN_1_FS_AVL_1900_GAIN_XY 1169
++#define ST_MAGN_1_FS_AVL_2500_GAIN_XY 1492
++#define ST_MAGN_1_FS_AVL_4000_GAIN_XY 2222
++#define ST_MAGN_1_FS_AVL_4700_GAIN_XY 2500
++#define ST_MAGN_1_FS_AVL_5600_GAIN_XY 3030
++#define ST_MAGN_1_FS_AVL_8100_GAIN_XY 4347
++#define ST_MAGN_1_FS_AVL_1300_GAIN_Z 1020
++#define ST_MAGN_1_FS_AVL_1900_GAIN_Z 1315
++#define ST_MAGN_1_FS_AVL_2500_GAIN_Z 1666
++#define ST_MAGN_1_FS_AVL_4000_GAIN_Z 2500
++#define ST_MAGN_1_FS_AVL_4700_GAIN_Z 2816
++#define ST_MAGN_1_FS_AVL_5600_GAIN_Z 3389
++#define ST_MAGN_1_FS_AVL_8100_GAIN_Z 4878
+ #define ST_MAGN_1_MULTIREAD_BIT false
+
+ /* CUSTOM VALUES FOR SENSOR 2 */
+@@ -105,10 +106,12 @@
+ #define ST_MAGN_2_FS_MASK 0x60
+ #define ST_MAGN_2_FS_AVL_4000_VAL 0x00
+ #define ST_MAGN_2_FS_AVL_8000_VAL 0x01
+-#define ST_MAGN_2_FS_AVL_10000_VAL 0x02
+-#define ST_MAGN_2_FS_AVL_4000_GAIN 430
+-#define ST_MAGN_2_FS_AVL_8000_GAIN 230
+-#define ST_MAGN_2_FS_AVL_10000_GAIN 230
++#define ST_MAGN_2_FS_AVL_12000_VAL 0x02
++#define ST_MAGN_2_FS_AVL_16000_VAL 0x03
++#define ST_MAGN_2_FS_AVL_4000_GAIN 146
++#define ST_MAGN_2_FS_AVL_8000_GAIN 292
++#define ST_MAGN_2_FS_AVL_12000_GAIN 438
++#define ST_MAGN_2_FS_AVL_16000_GAIN 584
+ #define ST_MAGN_2_MULTIREAD_BIT false
+ #define ST_MAGN_2_OUT_X_L_ADDR 0x28
+ #define ST_MAGN_2_OUT_Y_L_ADDR 0x2a
+@@ -266,9 +269,14 @@ static const struct st_sensors st_magn_sensors[] = {
+ .gain = ST_MAGN_2_FS_AVL_8000_GAIN,
+ },
+ [2] = {
+- .num = ST_MAGN_FS_AVL_10000MG,
+- .value = ST_MAGN_2_FS_AVL_10000_VAL,
+- .gain = ST_MAGN_2_FS_AVL_10000_GAIN,
++ .num = ST_MAGN_FS_AVL_12000MG,
++ .value = ST_MAGN_2_FS_AVL_12000_VAL,
++ .gain = ST_MAGN_2_FS_AVL_12000_GAIN,
++ },
++ [3] = {
++ .num = ST_MAGN_FS_AVL_16000MG,
++ .value = ST_MAGN_2_FS_AVL_16000_VAL,
++ .gain = ST_MAGN_2_FS_AVL_16000_GAIN,
+ },
+ },
+ },
+diff --git a/drivers/infiniband/core/uverbs_marshall.c b/drivers/infiniband/core/uverbs_marshall.c
+index e7bee46868d1..abd97247443e 100644
+--- a/drivers/infiniband/core/uverbs_marshall.c
++++ b/drivers/infiniband/core/uverbs_marshall.c
+@@ -140,5 +140,9 @@ void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst,
+ dst->packet_life_time = src->packet_life_time;
+ dst->preference = src->preference;
+ dst->packet_life_time_selector = src->packet_life_time_selector;
++
++ memset(dst->smac, 0, sizeof(dst->smac));
++ memset(dst->dmac, 0, sizeof(dst->dmac));
++ dst->vlan_id = 0xffff;
+ }
+ EXPORT_SYMBOL(ib_copy_path_rec_from_user);
+diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
+index 0f7027e7db13..91eeb5edff80 100644
+--- a/drivers/infiniband/hw/mlx4/main.c
++++ b/drivers/infiniband/hw/mlx4/main.c
+@@ -1678,6 +1678,7 @@ static void mlx4_ib_get_dev_addr(struct net_device *dev,
+ struct inet6_dev *in6_dev;
+ union ib_gid *pgid;
+ struct inet6_ifaddr *ifp;
++ union ib_gid default_gid;
+ #endif
+ union ib_gid gid;
+
+@@ -1698,12 +1699,15 @@ static void mlx4_ib_get_dev_addr(struct net_device *dev,
+ in_dev_put(in_dev);
+ }
+ #if IS_ENABLED(CONFIG_IPV6)
++ mlx4_make_default_gid(dev, &default_gid);
+ /* IPv6 gids */
+ in6_dev = in6_dev_get(dev);
+ if (in6_dev) {
+ read_lock_bh(&in6_dev->lock);
+ list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
+ pgid = (union ib_gid *)&ifp->addr;
++ if (!memcmp(pgid, &default_gid, sizeof(*pgid)))
++ continue;
+ update_gid_table(ibdev, port, pgid, 0, 0);
+ }
+ read_unlock_bh(&in6_dev->lock);
+@@ -1788,31 +1792,34 @@ static void mlx4_ib_scan_netdevs(struct mlx4_ib_dev *ibdev,
+ port_state = (netif_running(curr_netdev) && netif_carrier_ok(curr_netdev)) ?
+ IB_PORT_ACTIVE : IB_PORT_DOWN;
+ mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
+- } else {
+- reset_gid_table(ibdev, port);
+- }
+- /* if using bonding/team and a slave port is down, we don't the bond IP
+- * based gids in the table since flows that select port by gid may get
+- * the down port.
+- */
+- if (curr_master && (port_state == IB_PORT_DOWN)) {
+- reset_gid_table(ibdev, port);
+- mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
+- }
+- /* if bonding is used it is possible that we add it to masters
+- * only after IP address is assigned to the net bonding
+- * interface.
+- */
+- if (curr_master && (old_master != curr_master)) {
+- reset_gid_table(ibdev, port);
+- mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
+- mlx4_ib_get_dev_addr(curr_master, ibdev, port);
+- }
++ /* if using bonding/team and a slave port is down, we
++ * don't the bond IP based gids in the table since
++ * flows that select port by gid may get the down port.
++ */
++ if (curr_master && (port_state == IB_PORT_DOWN)) {
++ reset_gid_table(ibdev, port);
++ mlx4_ib_set_default_gid(ibdev,
++ curr_netdev, port);
++ }
++ /* if bonding is used it is possible that we add it to
++ * masters only after IP address is assigned to the
++ * net bonding interface.
++ */
++ if (curr_master && (old_master != curr_master)) {
++ reset_gid_table(ibdev, port);
++ mlx4_ib_set_default_gid(ibdev,
++ curr_netdev, port);
++ mlx4_ib_get_dev_addr(curr_master, ibdev, port);
++ }
+
+- if (!curr_master && (old_master != curr_master)) {
++ if (!curr_master && (old_master != curr_master)) {
++ reset_gid_table(ibdev, port);
++ mlx4_ib_set_default_gid(ibdev,
++ curr_netdev, port);
++ mlx4_ib_get_dev_addr(curr_netdev, ibdev, port);
++ }
++ } else {
+ reset_gid_table(ibdev, port);
+- mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
+- mlx4_ib_get_dev_addr(curr_netdev, ibdev, port);
+ }
+ }
+
+diff --git a/drivers/infiniband/hw/qib/qib_debugfs.c b/drivers/infiniband/hw/qib/qib_debugfs.c
+index 799a0c3bffc4..6abd3ed3cd51 100644
+--- a/drivers/infiniband/hw/qib/qib_debugfs.c
++++ b/drivers/infiniband/hw/qib/qib_debugfs.c
+@@ -193,6 +193,7 @@ static void *_qp_stats_seq_start(struct seq_file *s, loff_t *pos)
+ struct qib_qp_iter *iter;
+ loff_t n = *pos;
+
++ rcu_read_lock();
+ iter = qib_qp_iter_init(s->private);
+ if (!iter)
+ return NULL;
+@@ -224,7 +225,7 @@ static void *_qp_stats_seq_next(struct seq_file *s, void *iter_ptr,
+
+ static void _qp_stats_seq_stop(struct seq_file *s, void *iter_ptr)
+ {
+- /* nothing for now */
++ rcu_read_unlock();
+ }
+
+ static int _qp_stats_seq_show(struct seq_file *s, void *iter_ptr)
+diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c
+index 7fcc150d603c..6ddc0264aad2 100644
+--- a/drivers/infiniband/hw/qib/qib_qp.c
++++ b/drivers/infiniband/hw/qib/qib_qp.c
+@@ -1325,7 +1325,6 @@ int qib_qp_iter_next(struct qib_qp_iter *iter)
+ struct qib_qp *pqp = iter->qp;
+ struct qib_qp *qp;
+
+- rcu_read_lock();
+ for (; n < dev->qp_table_size; n++) {
+ if (pqp)
+ qp = rcu_dereference(pqp->next);
+@@ -1333,18 +1332,11 @@ int qib_qp_iter_next(struct qib_qp_iter *iter)
+ qp = rcu_dereference(dev->qp_table[n]);
+ pqp = qp;
+ if (qp) {
+- if (iter->qp)
+- atomic_dec(&iter->qp->refcount);
+- atomic_inc(&qp->refcount);
+- rcu_read_unlock();
+ iter->qp = qp;
+ iter->n = n;
+ return 0;
+ }
+ }
+- rcu_read_unlock();
+- if (iter->qp)
+- atomic_dec(&iter->qp->refcount);
+ return ret;
+ }
+
+diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
+index d4c7928a0f36..9959cd1faad9 100644
+--- a/drivers/infiniband/ulp/isert/ib_isert.c
++++ b/drivers/infiniband/ulp/isert/ib_isert.c
+@@ -586,7 +586,6 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
+ init_completion(&isert_conn->conn_wait);
+ init_completion(&isert_conn->conn_wait_comp_err);
+ kref_init(&isert_conn->conn_kref);
+- kref_get(&isert_conn->conn_kref);
+ mutex_init(&isert_conn->conn_mutex);
+ spin_lock_init(&isert_conn->conn_lock);
+ INIT_LIST_HEAD(&isert_conn->conn_fr_pool);
+@@ -746,7 +745,9 @@ isert_connect_release(struct isert_conn *isert_conn)
+ static void
+ isert_connected_handler(struct rdma_cm_id *cma_id)
+ {
+- return;
++ struct isert_conn *isert_conn = cma_id->context;
++
++ kref_get(&isert_conn->conn_kref);
+ }
+
+ static void
+@@ -798,7 +799,6 @@ isert_disconnect_work(struct work_struct *work)
+
+ wake_up:
+ complete(&isert_conn->conn_wait);
+- isert_put_conn(isert_conn);
+ }
+
+ static void
+@@ -3234,6 +3234,7 @@ static void isert_wait_conn(struct iscsi_conn *conn)
+ wait_for_completion(&isert_conn->conn_wait_comp_err);
+
+ wait_for_completion(&isert_conn->conn_wait);
++ isert_put_conn(isert_conn);
+ }
+
+ static void isert_free_conn(struct iscsi_conn *conn)
+diff --git a/drivers/input/keyboard/atkbd.c b/drivers/input/keyboard/atkbd.c
+index 2dd1d0dd4f7d..6f5d79569136 100644
+--- a/drivers/input/keyboard/atkbd.c
++++ b/drivers/input/keyboard/atkbd.c
+@@ -1791,14 +1791,6 @@ static const struct dmi_system_id atkbd_dmi_quirk_table[] __initconst = {
+ {
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "LG Electronics"),
+- DMI_MATCH(DMI_PRODUCT_NAME, "LW25-B7HV"),
+- },
+- .callback = atkbd_deactivate_fixup,
+- },
+- {
+- .matches = {
+- DMI_MATCH(DMI_SYS_VENDOR, "LG Electronics"),
+- DMI_MATCH(DMI_PRODUCT_NAME, "P1-J273B"),
+ },
+ .callback = atkbd_deactivate_fixup,
+ },
+diff --git a/drivers/input/mouse/elantech.c b/drivers/input/mouse/elantech.c
+index ee2a04d90d20..0ec186d256fb 100644
+--- a/drivers/input/mouse/elantech.c
++++ b/drivers/input/mouse/elantech.c
+@@ -1253,6 +1253,13 @@ static bool elantech_is_signature_valid(const unsigned char *param)
+ if (param[1] == 0)
+ return true;
+
++ /*
++ * Some models have a revision higher then 20. Meaning param[2] may
++ * be 10 or 20, skip the rates check for these.
++ */
++ if (param[0] == 0x46 && (param[1] & 0xef) == 0x0f && param[2] < 40)
++ return true;
++
+ for (i = 0; i < ARRAY_SIZE(rates); i++)
+ if (param[2] == rates[i])
+ return false;
+diff --git a/drivers/input/mouse/synaptics.c b/drivers/input/mouse/synaptics.c
+index ef9e0b8a9aa7..a50a2a7a43f7 100644
+--- a/drivers/input/mouse/synaptics.c
++++ b/drivers/input/mouse/synaptics.c
+@@ -626,10 +626,61 @@ static int synaptics_parse_hw_state(const unsigned char buf[],
+ ((buf[0] & 0x04) >> 1) |
+ ((buf[3] & 0x04) >> 2));
+
++ if ((SYN_CAP_ADV_GESTURE(priv->ext_cap_0c) ||
++ SYN_CAP_IMAGE_SENSOR(priv->ext_cap_0c)) &&
++ hw->w == 2) {
++ synaptics_parse_agm(buf, priv, hw);
++ return 1;
++ }
++
++ hw->x = (((buf[3] & 0x10) << 8) |
++ ((buf[1] & 0x0f) << 8) |
++ buf[4]);
++ hw->y = (((buf[3] & 0x20) << 7) |
++ ((buf[1] & 0xf0) << 4) |
++ buf[5]);
++ hw->z = buf[2];
++
+ hw->left = (buf[0] & 0x01) ? 1 : 0;
+ hw->right = (buf[0] & 0x02) ? 1 : 0;
+
+- if (SYN_CAP_CLICKPAD(priv->ext_cap_0c)) {
++ if (SYN_CAP_FORCEPAD(priv->ext_cap_0c)) {
++ /*
++ * ForcePads, like Clickpads, use middle button
++ * bits to report primary button clicks.
++ * Unfortunately they report primary button not
++ * only when user presses on the pad above certain
++ * threshold, but also when there are more than one
++ * finger on the touchpad, which interferes with
++ * out multi-finger gestures.
++ */
++ if (hw->z == 0) {
++ /* No contacts */
++ priv->press = priv->report_press = false;
++ } else if (hw->w >= 4 && ((buf[0] ^ buf[3]) & 0x01)) {
++ /*
++ * Single-finger touch with pressure above
++ * the threshold. If pressure stays long
++ * enough, we'll start reporting primary
++ * button. We rely on the device continuing
++ * sending data even if finger does not
++ * move.
++ */
++ if (!priv->press) {
++ priv->press_start = jiffies;
++ priv->press = true;
++ } else if (time_after(jiffies,
++ priv->press_start +
++ msecs_to_jiffies(50))) {
++ priv->report_press = true;
++ }
++ } else {
++ priv->press = false;
++ }
++
++ hw->left = priv->report_press;
++
++ } else if (SYN_CAP_CLICKPAD(priv->ext_cap_0c)) {
+ /*
+ * Clickpad's button is transmitted as middle button,
+ * however, since it is primary button, we will report
+@@ -648,21 +699,6 @@ static int synaptics_parse_hw_state(const unsigned char buf[],
+ hw->down = ((buf[0] ^ buf[3]) & 0x02) ? 1 : 0;
+ }
+
+- if ((SYN_CAP_ADV_GESTURE(priv->ext_cap_0c) ||
+- SYN_CAP_IMAGE_SENSOR(priv->ext_cap_0c)) &&
+- hw->w == 2) {
+- synaptics_parse_agm(buf, priv, hw);
+- return 1;
+- }
+-
+- hw->x = (((buf[3] & 0x10) << 8) |
+- ((buf[1] & 0x0f) << 8) |
+- buf[4]);
+- hw->y = (((buf[3] & 0x20) << 7) |
+- ((buf[1] & 0xf0) << 4) |
+- buf[5]);
+- hw->z = buf[2];
+-
+ if (SYN_CAP_MULTI_BUTTON_NO(priv->ext_cap) &&
+ ((buf[0] ^ buf[3]) & 0x02)) {
+ switch (SYN_CAP_MULTI_BUTTON_NO(priv->ext_cap) & ~0x01) {
+diff --git a/drivers/input/mouse/synaptics.h b/drivers/input/mouse/synaptics.h
+index e594af0b264b..fb2e076738ae 100644
+--- a/drivers/input/mouse/synaptics.h
++++ b/drivers/input/mouse/synaptics.h
+@@ -78,6 +78,11 @@
+ * 2 0x08 image sensor image sensor tracks 5 fingers, but only
+ * reports 2.
+ * 2 0x20 report min query 0x0f gives min coord reported
++ * 2 0x80 forcepad forcepad is a variant of clickpad that
++ * does not have physical buttons but rather
++ * uses pressure above certain threshold to
++ * report primary clicks. Forcepads also have
++ * clickpad bit set.
+ */
+ #define SYN_CAP_CLICKPAD(ex0c) ((ex0c) & 0x100000) /* 1-button ClickPad */
+ #define SYN_CAP_CLICKPAD2BTN(ex0c) ((ex0c) & 0x000100) /* 2-button ClickPad */
+@@ -86,6 +91,7 @@
+ #define SYN_CAP_ADV_GESTURE(ex0c) ((ex0c) & 0x080000)
+ #define SYN_CAP_REDUCED_FILTERING(ex0c) ((ex0c) & 0x000400)
+ #define SYN_CAP_IMAGE_SENSOR(ex0c) ((ex0c) & 0x000800)
++#define SYN_CAP_FORCEPAD(ex0c) ((ex0c) & 0x008000)
+
+ /* synaptics modes query bits */
+ #define SYN_MODE_ABSOLUTE(m) ((m) & (1 << 7))
+@@ -177,6 +183,11 @@ struct synaptics_data {
+ */
+ struct synaptics_hw_state agm;
+ bool agm_pending; /* new AGM packet received */
++
++ /* ForcePad handling */
++ unsigned long press_start;
++ bool press;
++ bool report_press;
+ };
+
+ void synaptics_module_init(void);
+diff --git a/drivers/input/serio/i8042-x86ia64io.h b/drivers/input/serio/i8042-x86ia64io.h
+index 136b7b204f56..713e3ddb43bd 100644
+--- a/drivers/input/serio/i8042-x86ia64io.h
++++ b/drivers/input/serio/i8042-x86ia64io.h
+@@ -465,6 +465,13 @@ static const struct dmi_system_id __initconst i8042_dmi_nomux_table[] = {
+ DMI_MATCH(DMI_PRODUCT_NAME, "HP Pavilion dv4 Notebook PC"),
+ },
+ },
++ {
++ /* Avatar AVIU-145A6 */
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "Intel"),
++ DMI_MATCH(DMI_PRODUCT_NAME, "IC4I"),
++ },
++ },
+ { }
+ };
+
+@@ -608,6 +615,14 @@ static const struct dmi_system_id __initconst i8042_dmi_notimeout_table[] = {
+ DMI_MATCH(DMI_PRODUCT_NAME, "HP Pavilion dv4 Notebook PC"),
+ },
+ },
++ {
++ /* Fujitsu U574 laptop */
++ /* https://bugzilla.kernel.org/show_bug.cgi?id=69731 */
++ .matches = {
++ DMI_MATCH(DMI_SYS_VENDOR, "FUJITSU"),
++ DMI_MATCH(DMI_PRODUCT_NAME, "LIFEBOOK U574"),
++ },
++ },
+ { }
+ };
+
+diff --git a/drivers/input/serio/serport.c b/drivers/input/serio/serport.c
+index 0cb7ef59071b..69175b825346 100644
+--- a/drivers/input/serio/serport.c
++++ b/drivers/input/serio/serport.c
+@@ -21,6 +21,7 @@
+ #include <linux/init.h>
+ #include <linux/serio.h>
+ #include <linux/tty.h>
++#include <linux/compat.h>
+
+ MODULE_AUTHOR("Vojtech Pavlik <vojtech@ucw.cz>");
+ MODULE_DESCRIPTION("Input device TTY line discipline");
+@@ -198,28 +199,55 @@ static ssize_t serport_ldisc_read(struct tty_struct * tty, struct file * file, u
+ return 0;
+ }
+
++static void serport_set_type(struct tty_struct *tty, unsigned long type)
++{
++ struct serport *serport = tty->disc_data;
++
++ serport->id.proto = type & 0x000000ff;
++ serport->id.id = (type & 0x0000ff00) >> 8;
++ serport->id.extra = (type & 0x00ff0000) >> 16;
++}
++
+ /*
+ * serport_ldisc_ioctl() allows to set the port protocol, and device ID
+ */
+
+-static int serport_ldisc_ioctl(struct tty_struct * tty, struct file * file, unsigned int cmd, unsigned long arg)
++static int serport_ldisc_ioctl(struct tty_struct *tty, struct file *file,
++ unsigned int cmd, unsigned long arg)
+ {
+- struct serport *serport = (struct serport*) tty->disc_data;
+- unsigned long type;
+-
+ if (cmd == SPIOCSTYPE) {
++ unsigned long type;
++
+ if (get_user(type, (unsigned long __user *) arg))
+ return -EFAULT;
+
+- serport->id.proto = type & 0x000000ff;
+- serport->id.id = (type & 0x0000ff00) >> 8;
+- serport->id.extra = (type & 0x00ff0000) >> 16;
++ serport_set_type(tty, type);
++ return 0;
++ }
++
++ return -EINVAL;
++}
++
++#ifdef CONFIG_COMPAT
++#define COMPAT_SPIOCSTYPE _IOW('q', 0x01, compat_ulong_t)
++static long serport_ldisc_compat_ioctl(struct tty_struct *tty,
++ struct file *file,
++ unsigned int cmd, unsigned long arg)
++{
++ if (cmd == COMPAT_SPIOCSTYPE) {
++ void __user *uarg = compat_ptr(arg);
++ compat_ulong_t compat_type;
++
++ if (get_user(compat_type, (compat_ulong_t __user *)uarg))
++ return -EFAULT;
+
++ serport_set_type(tty, compat_type);
+ return 0;
+ }
+
+ return -EINVAL;
+ }
++#endif
+
+ static void serport_ldisc_write_wakeup(struct tty_struct * tty)
+ {
+@@ -243,6 +271,9 @@ static struct tty_ldisc_ops serport_ldisc = {
+ .close = serport_ldisc_close,
+ .read = serport_ldisc_read,
+ .ioctl = serport_ldisc_ioctl,
++#ifdef CONFIG_COMPAT
++ .compat_ioctl = serport_ldisc_compat_ioctl,
++#endif
+ .receive_buf = serport_ldisc_receive,
+ .write_wakeup = serport_ldisc_write_wakeup
+ };
+diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
+index 1599354e974d..9a35baf1caed 100644
+--- a/drivers/iommu/arm-smmu.c
++++ b/drivers/iommu/arm-smmu.c
+@@ -830,8 +830,11 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain)
+ reg |= TTBCR_EAE |
+ (TTBCR_SH_IS << TTBCR_SH0_SHIFT) |
+ (TTBCR_RGN_WBWA << TTBCR_ORGN0_SHIFT) |
+- (TTBCR_RGN_WBWA << TTBCR_IRGN0_SHIFT) |
+- (TTBCR_SL0_LVL_1 << TTBCR_SL0_SHIFT);
++ (TTBCR_RGN_WBWA << TTBCR_IRGN0_SHIFT);
++
++ if (!stage1)
++ reg |= (TTBCR_SL0_LVL_1 << TTBCR_SL0_SHIFT);
++
+ writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR);
+
+ /* MAIR0 (stage-1 only) */
+diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
+index 9a4f05e5b23f..55f1515d54c9 100644
+--- a/drivers/iommu/dmar.c
++++ b/drivers/iommu/dmar.c
+@@ -677,8 +677,7 @@ static int __init dmar_acpi_dev_scope_init(void)
+ andd->object_name);
+ continue;
+ }
+- acpi_bus_get_device(h, &adev);
+- if (!adev) {
++ if (acpi_bus_get_device(h, &adev)) {
+ pr_err("Failed to get device for ACPI object %s\n",
+ andd->object_name);
+ continue;
+diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
+index af47648301a9..87f94d597e6e 100644
+--- a/drivers/iommu/fsl_pamu_domain.c
++++ b/drivers/iommu/fsl_pamu_domain.c
+@@ -1048,7 +1048,7 @@ static int fsl_pamu_add_device(struct device *dev)
+ struct iommu_group *group = ERR_PTR(-ENODEV);
+ struct pci_dev *pdev;
+ const u32 *prop;
+- int ret, len;
++ int ret = 0, len;
+
+ /*
+ * For platform devices we allocate a separate group for
+@@ -1071,7 +1071,13 @@ static int fsl_pamu_add_device(struct device *dev)
+ if (IS_ERR(group))
+ return PTR_ERR(group);
+
+- ret = iommu_group_add_device(group, dev);
++ /*
++ * Check if device has already been added to an iommu group.
++ * Group could have already been created for a PCI device in
++ * the iommu_group_get_for_dev path.
++ */
++ if (!dev->iommu_group)
++ ret = iommu_group_add_device(group, dev);
+
+ iommu_group_put(group);
+ return ret;
+diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
+index 2c63326638b6..c892e48655c2 100644
+--- a/drivers/md/dm-cache-target.c
++++ b/drivers/md/dm-cache-target.c
+@@ -873,8 +873,8 @@ static void migration_success_pre_commit(struct dm_cache_migration *mg)
+ struct cache *cache = mg->cache;
+
+ if (mg->writeback) {
+- cell_defer(cache, mg->old_ocell, false);
+ clear_dirty(cache, mg->old_oblock, mg->cblock);
++ cell_defer(cache, mg->old_ocell, false);
+ cleanup_migration(mg);
+ return;
+
+@@ -929,13 +929,13 @@ static void migration_success_post_commit(struct dm_cache_migration *mg)
+ }
+
+ } else {
++ clear_dirty(cache, mg->new_oblock, mg->cblock);
+ if (mg->requeue_holder)
+ cell_defer(cache, mg->new_ocell, true);
+ else {
+ bio_endio(mg->new_ocell->holder, 0);
+ cell_defer(cache, mg->new_ocell, false);
+ }
+- clear_dirty(cache, mg->new_oblock, mg->cblock);
+ cleanup_migration(mg);
+ }
+ }
+diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
+index 4cba2d808afb..3e6ef4b1fb46 100644
+--- a/drivers/md/dm-crypt.c
++++ b/drivers/md/dm-crypt.c
+@@ -1681,6 +1681,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+ unsigned int key_size, opt_params;
+ unsigned long long tmpll;
+ int ret;
++ size_t iv_size_padding;
+ struct dm_arg_set as;
+ const char *opt_string;
+ char dummy;
+@@ -1717,12 +1718,23 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+
+ cc->dmreq_start = sizeof(struct ablkcipher_request);
+ cc->dmreq_start += crypto_ablkcipher_reqsize(any_tfm(cc));
+- cc->dmreq_start = ALIGN(cc->dmreq_start, crypto_tfm_ctx_alignment());
+- cc->dmreq_start += crypto_ablkcipher_alignmask(any_tfm(cc)) &
+- ~(crypto_tfm_ctx_alignment() - 1);
++ cc->dmreq_start = ALIGN(cc->dmreq_start, __alignof__(struct dm_crypt_request));
++
++ if (crypto_ablkcipher_alignmask(any_tfm(cc)) < CRYPTO_MINALIGN) {
++ /* Allocate the padding exactly */
++ iv_size_padding = -(cc->dmreq_start + sizeof(struct dm_crypt_request))
++ & crypto_ablkcipher_alignmask(any_tfm(cc));
++ } else {
++ /*
++ * If the cipher requires greater alignment than kmalloc
++ * alignment, we don't know the exact position of the
++ * initialization vector. We must assume worst case.
++ */
++ iv_size_padding = crypto_ablkcipher_alignmask(any_tfm(cc));
++ }
+
+ cc->req_pool = mempool_create_kmalloc_pool(MIN_IOS, cc->dmreq_start +
+- sizeof(struct dm_crypt_request) + cc->iv_size);
++ sizeof(struct dm_crypt_request) + iv_size_padding + cc->iv_size);
+ if (!cc->req_pool) {
+ ti->error = "Cannot allocate crypt request mempool";
+ goto bad;
+diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
+index d7690f86fdb9..55de4f6f7eaf 100644
+--- a/drivers/md/raid1.c
++++ b/drivers/md/raid1.c
+@@ -540,11 +540,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
+ has_nonrot_disk = 0;
+ choose_next_idle = 0;
+
+- if (conf->mddev->recovery_cp < MaxSector &&
+- (this_sector + sectors >= conf->next_resync))
+- choose_first = 1;
+- else
+- choose_first = 0;
++ choose_first = (conf->mddev->recovery_cp < this_sector + sectors);
+
+ for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ sector_t dist;
+@@ -831,7 +827,7 @@ static void flush_pending_writes(struct r1conf *conf)
+ * there is no normal IO happeing. It must arrange to call
+ * lower_barrier when the particular background IO completes.
+ */
+-static void raise_barrier(struct r1conf *conf)
++static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
+ {
+ spin_lock_irq(&conf->resync_lock);
+
+@@ -841,6 +837,7 @@ static void raise_barrier(struct r1conf *conf)
+
+ /* block any new IO from starting */
+ conf->barrier++;
++ conf->next_resync = sector_nr;
+
+ /* For these conditions we must wait:
+ * A: while the array is in frozen state
+@@ -849,14 +846,17 @@ static void raise_barrier(struct r1conf *conf)
+ * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
+ * next resync will reach to the window which normal bios are
+ * handling.
++ * D: while there are any active requests in the current window.
+ */
+ wait_event_lock_irq(conf->wait_barrier,
+ !conf->array_frozen &&
+ conf->barrier < RESYNC_DEPTH &&
++ conf->current_window_requests == 0 &&
+ (conf->start_next_window >=
+ conf->next_resync + RESYNC_SECTORS),
+ conf->resync_lock);
+
++ conf->nr_pending++;
+ spin_unlock_irq(&conf->resync_lock);
+ }
+
+@@ -866,6 +866,7 @@ static void lower_barrier(struct r1conf *conf)
+ BUG_ON(conf->barrier <= 0);
+ spin_lock_irqsave(&conf->resync_lock, flags);
+ conf->barrier--;
++ conf->nr_pending--;
+ spin_unlock_irqrestore(&conf->resync_lock, flags);
+ wake_up(&conf->wait_barrier);
+ }
+@@ -877,12 +878,10 @@ static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+ if (conf->array_frozen || !bio)
+ wait = true;
+ else if (conf->barrier && bio_data_dir(bio) == WRITE) {
+- if (conf->next_resync < RESYNC_WINDOW_SECTORS)
+- wait = true;
+- else if ((conf->next_resync - RESYNC_WINDOW_SECTORS
+- >= bio_end_sector(bio)) ||
+- (conf->next_resync + NEXT_NORMALIO_DISTANCE
+- <= bio->bi_iter.bi_sector))
++ if ((conf->mddev->curr_resync_completed
++ >= bio_end_sector(bio)) ||
++ (conf->next_resync + NEXT_NORMALIO_DISTANCE
++ <= bio->bi_iter.bi_sector))
+ wait = false;
+ else
+ wait = true;
+@@ -919,8 +918,8 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+ }
+
+ if (bio && bio_data_dir(bio) == WRITE) {
+- if (conf->next_resync + NEXT_NORMALIO_DISTANCE
+- <= bio->bi_iter.bi_sector) {
++ if (bio->bi_iter.bi_sector >=
++ conf->mddev->curr_resync_completed) {
+ if (conf->start_next_window == MaxSector)
+ conf->start_next_window =
+ conf->next_resync +
+@@ -1186,6 +1185,7 @@ read_again:
+ atomic_read(&bitmap->behind_writes) == 0);
+ }
+ r1_bio->read_disk = rdisk;
++ r1_bio->start_next_window = 0;
+
+ read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
+@@ -1548,8 +1548,13 @@ static void close_sync(struct r1conf *conf)
+ mempool_destroy(conf->r1buf_pool);
+ conf->r1buf_pool = NULL;
+
++ spin_lock_irq(&conf->resync_lock);
+ conf->next_resync = 0;
+ conf->start_next_window = MaxSector;
++ conf->current_window_requests +=
++ conf->next_window_requests;
++ conf->next_window_requests = 0;
++ spin_unlock_irq(&conf->resync_lock);
+ }
+
+ static int raid1_spare_active(struct mddev *mddev)
+@@ -2150,7 +2155,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
+ d--;
+ rdev = conf->mirrors[d].rdev;
+ if (rdev &&
+- test_bit(In_sync, &rdev->flags))
++ !test_bit(Faulty, &rdev->flags))
+ r1_sync_page_io(rdev, sect, s,
+ conf->tmppage, WRITE);
+ }
+@@ -2162,7 +2167,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
+ d--;
+ rdev = conf->mirrors[d].rdev;
+ if (rdev &&
+- test_bit(In_sync, &rdev->flags)) {
++ !test_bit(Faulty, &rdev->flags)) {
+ if (r1_sync_page_io(rdev, sect, s,
+ conf->tmppage, READ)) {
+ atomic_add(s, &rdev->corrected_errors);
+@@ -2541,9 +2546,8 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp
+
+ bitmap_cond_end_sync(mddev->bitmap, sector_nr);
+ r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO);
+- raise_barrier(conf);
+
+- conf->next_resync = sector_nr;
++ raise_barrier(conf, sector_nr);
+
+ rcu_read_lock();
+ /*
+diff --git a/drivers/media/dvb-core/dvb-usb-ids.h b/drivers/media/dvb-core/dvb-usb-ids.h
+index 11d2bea23b02..26674e12133b 100644
+--- a/drivers/media/dvb-core/dvb-usb-ids.h
++++ b/drivers/media/dvb-core/dvb-usb-ids.h
+@@ -279,6 +279,8 @@
+ #define USB_PID_PCTV_400E 0x020f
+ #define USB_PID_PCTV_450E 0x0222
+ #define USB_PID_PCTV_452E 0x021f
++#define USB_PID_PCTV_78E 0x025a
++#define USB_PID_PCTV_79E 0x0262
+ #define USB_PID_REALTEK_RTL2831U 0x2831
+ #define USB_PID_REALTEK_RTL2832U 0x2832
+ #define USB_PID_TECHNOTREND_CONNECT_S2_3600 0x3007
+diff --git a/drivers/media/dvb-frontends/af9033.c b/drivers/media/dvb-frontends/af9033.c
+index be4bec2a9640..5c90ea683a7e 100644
+--- a/drivers/media/dvb-frontends/af9033.c
++++ b/drivers/media/dvb-frontends/af9033.c
+@@ -314,6 +314,19 @@ static int af9033_init(struct dvb_frontend *fe)
+ goto err;
+ }
+
++ /* feed clock to RF tuner */
++ switch (state->cfg.tuner) {
++ case AF9033_TUNER_IT9135_38:
++ case AF9033_TUNER_IT9135_51:
++ case AF9033_TUNER_IT9135_52:
++ case AF9033_TUNER_IT9135_60:
++ case AF9033_TUNER_IT9135_61:
++ case AF9033_TUNER_IT9135_62:
++ ret = af9033_wr_reg(state, 0x80fba8, 0x00);
++ if (ret < 0)
++ goto err;
++ }
++
+ /* settings for TS interface */
+ if (state->cfg.ts_mode == AF9033_TS_MODE_USB) {
+ ret = af9033_wr_reg_mask(state, 0x80f9a5, 0x00, 0x01);
+diff --git a/drivers/media/dvb-frontends/af9033_priv.h b/drivers/media/dvb-frontends/af9033_priv.h
+index fc2ad581e302..ded7b67d7526 100644
+--- a/drivers/media/dvb-frontends/af9033_priv.h
++++ b/drivers/media/dvb-frontends/af9033_priv.h
+@@ -1418,7 +1418,7 @@ static const struct reg_val tuner_init_it9135_60[] = {
+ { 0x800068, 0x0a },
+ { 0x80006a, 0x03 },
+ { 0x800070, 0x0a },
+- { 0x800071, 0x05 },
++ { 0x800071, 0x0a },
+ { 0x800072, 0x02 },
+ { 0x800075, 0x8c },
+ { 0x800076, 0x8c },
+@@ -1484,7 +1484,6 @@ static const struct reg_val tuner_init_it9135_60[] = {
+ { 0x800104, 0x02 },
+ { 0x800105, 0xbe },
+ { 0x800106, 0x00 },
+- { 0x800109, 0x02 },
+ { 0x800115, 0x0a },
+ { 0x800116, 0x03 },
+ { 0x80011a, 0xbe },
+@@ -1510,7 +1509,6 @@ static const struct reg_val tuner_init_it9135_60[] = {
+ { 0x80014b, 0x8c },
+ { 0x80014d, 0xac },
+ { 0x80014e, 0xc6 },
+- { 0x80014f, 0x03 },
+ { 0x800151, 0x1e },
+ { 0x800153, 0xbc },
+ { 0x800178, 0x09 },
+@@ -1522,9 +1520,10 @@ static const struct reg_val tuner_init_it9135_60[] = {
+ { 0x80018d, 0x5f },
+ { 0x80018f, 0xa0 },
+ { 0x800190, 0x5a },
+- { 0x80ed02, 0xff },
+- { 0x80ee42, 0xff },
+- { 0x80ee82, 0xff },
++ { 0x800191, 0x00 },
++ { 0x80ed02, 0x40 },
++ { 0x80ee42, 0x40 },
++ { 0x80ee82, 0x40 },
+ { 0x80f000, 0x0f },
+ { 0x80f01f, 0x8c },
+ { 0x80f020, 0x00 },
+@@ -1699,7 +1698,6 @@ static const struct reg_val tuner_init_it9135_61[] = {
+ { 0x800104, 0x02 },
+ { 0x800105, 0xc8 },
+ { 0x800106, 0x00 },
+- { 0x800109, 0x02 },
+ { 0x800115, 0x0a },
+ { 0x800116, 0x03 },
+ { 0x80011a, 0xc6 },
+@@ -1725,7 +1723,6 @@ static const struct reg_val tuner_init_it9135_61[] = {
+ { 0x80014b, 0x8c },
+ { 0x80014d, 0xa8 },
+ { 0x80014e, 0xc6 },
+- { 0x80014f, 0x03 },
+ { 0x800151, 0x28 },
+ { 0x800153, 0xcc },
+ { 0x800178, 0x09 },
+@@ -1737,9 +1734,10 @@ static const struct reg_val tuner_init_it9135_61[] = {
+ { 0x80018d, 0x5f },
+ { 0x80018f, 0xfb },
+ { 0x800190, 0x5c },
+- { 0x80ed02, 0xff },
+- { 0x80ee42, 0xff },
+- { 0x80ee82, 0xff },
++ { 0x800191, 0x00 },
++ { 0x80ed02, 0x40 },
++ { 0x80ee42, 0x40 },
++ { 0x80ee82, 0x40 },
+ { 0x80f000, 0x0f },
+ { 0x80f01f, 0x8c },
+ { 0x80f020, 0x00 },
+diff --git a/drivers/media/i2c/adv7604.c b/drivers/media/i2c/adv7604.c
+index 1778d320272e..67403b94f0a2 100644
+--- a/drivers/media/i2c/adv7604.c
++++ b/drivers/media/i2c/adv7604.c
+@@ -2325,7 +2325,7 @@ static int adv7604_log_status(struct v4l2_subdev *sd)
+ v4l2_info(sd, "HDCP keys read: %s%s\n",
+ (hdmi_read(sd, 0x04) & 0x20) ? "yes" : "no",
+ (hdmi_read(sd, 0x04) & 0x10) ? "ERROR" : "");
+- if (!is_hdmi(sd)) {
++ if (is_hdmi(sd)) {
+ bool audio_pll_locked = hdmi_read(sd, 0x04) & 0x01;
+ bool audio_sample_packet_detect = hdmi_read(sd, 0x18) & 0x01;
+ bool audio_mute = io_read(sd, 0x65) & 0x40;
+diff --git a/drivers/media/pci/cx18/cx18-driver.c b/drivers/media/pci/cx18/cx18-driver.c
+index 716bdc57fac6..83f5074706f9 100644
+--- a/drivers/media/pci/cx18/cx18-driver.c
++++ b/drivers/media/pci/cx18/cx18-driver.c
+@@ -1091,6 +1091,7 @@ static int cx18_probe(struct pci_dev *pci_dev,
+ setup.addr = ADDR_UNSET;
+ setup.type = cx->options.tuner;
+ setup.mode_mask = T_ANALOG_TV; /* matches TV tuners */
++ setup.config = NULL;
+ if (cx->options.radio > 0)
+ setup.mode_mask |= T_RADIO;
+ setup.tuner_callback = (setup.type == TUNER_XC2028) ?
+diff --git a/drivers/media/tuners/tuner_it913x.c b/drivers/media/tuners/tuner_it913x.c
+index 6f30d7e535b8..3d83c425bccf 100644
+--- a/drivers/media/tuners/tuner_it913x.c
++++ b/drivers/media/tuners/tuner_it913x.c
+@@ -396,6 +396,7 @@ struct dvb_frontend *it913x_attach(struct dvb_frontend *fe,
+ struct i2c_adapter *i2c_adap, u8 i2c_addr, u8 config)
+ {
+ struct it913x_state *state = NULL;
++ int ret;
+
+ /* allocate memory for the internal state */
+ state = kzalloc(sizeof(struct it913x_state), GFP_KERNEL);
+@@ -425,6 +426,11 @@ struct dvb_frontend *it913x_attach(struct dvb_frontend *fe,
+ state->tuner_type = config;
+ state->firmware_ver = 1;
+
++ /* tuner RF initial */
++ ret = it913x_wr_reg(state, PRO_DMOD, 0xec4c, 0x68);
++ if (ret < 0)
++ goto error;
++
+ fe->tuner_priv = state;
+ memcpy(&fe->ops.tuner_ops, &it913x_tuner_ops,
+ sizeof(struct dvb_tuner_ops));
+diff --git a/drivers/media/usb/dvb-usb-v2/af9035.c b/drivers/media/usb/dvb-usb-v2/af9035.c
+index 7b9b75f60774..04d8e951de0d 100644
+--- a/drivers/media/usb/dvb-usb-v2/af9035.c
++++ b/drivers/media/usb/dvb-usb-v2/af9035.c
+@@ -1555,6 +1555,10 @@ static const struct usb_device_id af9035_id_table[] = {
+ &af9035_props, "Leadtek WinFast DTV Dongle Dual", NULL) },
+ { DVB_USB_DEVICE(USB_VID_HAUPPAUGE, 0xf900,
+ &af9035_props, "Hauppauge WinTV-MiniStick 2", NULL) },
++ { DVB_USB_DEVICE(USB_VID_PCTV, USB_PID_PCTV_78E,
++ &af9035_props, "PCTV 78e", RC_MAP_IT913X_V1) },
++ { DVB_USB_DEVICE(USB_VID_PCTV, USB_PID_PCTV_79E,
++ &af9035_props, "PCTV 79e", RC_MAP_IT913X_V2) },
+ { }
+ };
+ MODULE_DEVICE_TABLE(usb, af9035_id_table);
+diff --git a/drivers/media/usb/em28xx/em28xx-video.c b/drivers/media/usb/em28xx/em28xx-video.c
+index f6b49c98e2c9..408c072ce228 100644
+--- a/drivers/media/usb/em28xx/em28xx-video.c
++++ b/drivers/media/usb/em28xx/em28xx-video.c
+@@ -1344,7 +1344,7 @@ static int vidioc_s_fmt_vid_cap(struct file *file, void *priv,
+ struct em28xx *dev = video_drvdata(file);
+ struct em28xx_v4l2 *v4l2 = dev->v4l2;
+
+- if (v4l2->streaming_users > 0)
++ if (vb2_is_busy(&v4l2->vb_vidq))
+ return -EBUSY;
+
+ vidioc_try_fmt_vid_cap(file, priv, f);
+diff --git a/drivers/media/v4l2-core/videobuf2-core.c b/drivers/media/v4l2-core/videobuf2-core.c
+index 1d67e95311d6..dcdceae30ab0 100644
+--- a/drivers/media/v4l2-core/videobuf2-core.c
++++ b/drivers/media/v4l2-core/videobuf2-core.c
+@@ -1126,7 +1126,7 @@ EXPORT_SYMBOL_GPL(vb2_plane_vaddr);
+ */
+ void *vb2_plane_cookie(struct vb2_buffer *vb, unsigned int plane_no)
+ {
+- if (plane_no > vb->num_planes || !vb->planes[plane_no].mem_priv)
++ if (plane_no >= vb->num_planes || !vb->planes[plane_no].mem_priv)
+ return NULL;
+
+ return call_ptr_memop(vb, cookie, vb->planes[plane_no].mem_priv);
+@@ -1161,13 +1161,10 @@ void vb2_buffer_done(struct vb2_buffer *vb, enum vb2_buffer_state state)
+ if (WARN_ON(vb->state != VB2_BUF_STATE_ACTIVE))
+ return;
+
+- if (!q->start_streaming_called) {
+- if (WARN_ON(state != VB2_BUF_STATE_QUEUED))
+- state = VB2_BUF_STATE_QUEUED;
+- } else if (WARN_ON(state != VB2_BUF_STATE_DONE &&
+- state != VB2_BUF_STATE_ERROR)) {
+- state = VB2_BUF_STATE_ERROR;
+- }
++ if (WARN_ON(state != VB2_BUF_STATE_DONE &&
++ state != VB2_BUF_STATE_ERROR &&
++ state != VB2_BUF_STATE_QUEUED))
++ state = VB2_BUF_STATE_ERROR;
+
+ #ifdef CONFIG_VIDEO_ADV_DEBUG
+ /*
+@@ -1774,6 +1771,12 @@ static int vb2_start_streaming(struct vb2_queue *q)
+ /* Must be zero now */
+ WARN_ON(atomic_read(&q->owned_by_drv_count));
+ }
++ /*
++ * If done_list is not empty, then start_streaming() didn't call
++ * vb2_buffer_done(vb, VB2_BUF_STATE_QUEUED) but STATE_ERROR or
++ * STATE_DONE.
++ */
++ WARN_ON(!list_empty(&q->done_list));
+ return ret;
+ }
+
+diff --git a/drivers/media/v4l2-core/videobuf2-dma-sg.c b/drivers/media/v4l2-core/videobuf2-dma-sg.c
+index adefc31bb853..9b163a440f89 100644
+--- a/drivers/media/v4l2-core/videobuf2-dma-sg.c
++++ b/drivers/media/v4l2-core/videobuf2-dma-sg.c
+@@ -113,7 +113,7 @@ static void *vb2_dma_sg_alloc(void *alloc_ctx, unsigned long size, gfp_t gfp_fla
+ goto fail_pages_alloc;
+
+ ret = sg_alloc_table_from_pages(&buf->sg_table, buf->pages,
+- buf->num_pages, 0, size, gfp_flags);
++ buf->num_pages, 0, size, GFP_KERNEL);
+ if (ret)
+ goto fail_table_alloc;
+
+diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
+index 249ab80cbb45..d3f05ad33f09 100644
+--- a/drivers/mmc/host/mmci.c
++++ b/drivers/mmc/host/mmci.c
+@@ -65,6 +65,7 @@ static unsigned int fmax = 515633;
+ * @pwrreg_clkgate: MMCIPOWER register must be used to gate the clock
+ * @busy_detect: true if busy detection on dat0 is supported
+ * @pwrreg_nopower: bits in MMCIPOWER don't controls ext. power supply
++ * @reversed_irq_handling: handle data irq before cmd irq.
+ */
+ struct variant_data {
+ unsigned int clkreg;
+@@ -80,6 +81,7 @@ struct variant_data {
+ bool pwrreg_clkgate;
+ bool busy_detect;
+ bool pwrreg_nopower;
++ bool reversed_irq_handling;
+ };
+
+ static struct variant_data variant_arm = {
+@@ -87,6 +89,7 @@ static struct variant_data variant_arm = {
+ .fifohalfsize = 8 * 4,
+ .datalength_bits = 16,
+ .pwrreg_powerup = MCI_PWR_UP,
++ .reversed_irq_handling = true,
+ };
+
+ static struct variant_data variant_arm_extended_fifo = {
+@@ -1163,8 +1166,13 @@ static irqreturn_t mmci_irq(int irq, void *dev_id)
+
+ dev_dbg(mmc_dev(host->mmc), "irq0 (data+cmd) %08x\n", status);
+
+- mmci_cmd_irq(host, host->cmd, status);
+- mmci_data_irq(host, host->data, status);
++ if (host->variant->reversed_irq_handling) {
++ mmci_data_irq(host, host->data, status);
++ mmci_cmd_irq(host, host->cmd, status);
++ } else {
++ mmci_cmd_irq(host, host->cmd, status);
++ mmci_data_irq(host, host->data, status);
++ }
+
+ /* Don't poll for busy completion in irq context. */
+ if (host->busy_status)
+diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
+index c9127562bd22..21978cc019e7 100644
+--- a/drivers/net/ethernet/ibm/ibmveth.c
++++ b/drivers/net/ethernet/ibm/ibmveth.c
+@@ -292,6 +292,18 @@ failure:
+ atomic_add(buffers_added, &(pool->available));
+ }
+
++/*
++ * The final 8 bytes of the buffer list is a counter of frames dropped
++ * because there was not a buffer in the buffer list capable of holding
++ * the frame.
++ */
++static void ibmveth_update_rx_no_buffer(struct ibmveth_adapter *adapter)
++{
++ __be64 *p = adapter->buffer_list_addr + 4096 - 8;
++
++ adapter->rx_no_buffer = be64_to_cpup(p);
++}
++
+ /* replenish routine */
+ static void ibmveth_replenish_task(struct ibmveth_adapter *adapter)
+ {
+@@ -307,8 +319,7 @@ static void ibmveth_replenish_task(struct ibmveth_adapter *adapter)
+ ibmveth_replenish_buffer_pool(adapter, pool);
+ }
+
+- adapter->rx_no_buffer = *(u64 *)(((char*)adapter->buffer_list_addr) +
+- 4096 - 8);
++ ibmveth_update_rx_no_buffer(adapter);
+ }
+
+ /* empty and free ana buffer pool - also used to do cleanup in error paths */
+@@ -698,8 +709,7 @@ static int ibmveth_close(struct net_device *netdev)
+
+ free_irq(netdev->irq, netdev);
+
+- adapter->rx_no_buffer = *(u64 *)(((char *)adapter->buffer_list_addr) +
+- 4096 - 8);
++ ibmveth_update_rx_no_buffer(adapter);
+
+ ibmveth_cleanup(adapter);
+
+diff --git a/drivers/net/wireless/ath/ath9k/htc_drv_txrx.c b/drivers/net/wireless/ath/ath9k/htc_drv_txrx.c
+index bb86eb2ffc95..f0484b1b617e 100644
+--- a/drivers/net/wireless/ath/ath9k/htc_drv_txrx.c
++++ b/drivers/net/wireless/ath/ath9k/htc_drv_txrx.c
+@@ -978,7 +978,7 @@ static bool ath9k_rx_prepare(struct ath9k_htc_priv *priv,
+ struct ath_hw *ah = common->ah;
+ struct ath_htc_rx_status *rxstatus;
+ struct ath_rx_status rx_stats;
+- bool decrypt_error;
++ bool decrypt_error = false;
+
+ if (skb->len < HTC_RX_FRAME_HEADER_SIZE) {
+ ath_err(common, "Corrupted RX frame, dropping (len: %d)\n",
+diff --git a/drivers/net/wireless/ath/carl9170/carl9170.h b/drivers/net/wireless/ath/carl9170/carl9170.h
+index 8596aba34f96..237d0cda1bcb 100644
+--- a/drivers/net/wireless/ath/carl9170/carl9170.h
++++ b/drivers/net/wireless/ath/carl9170/carl9170.h
+@@ -256,6 +256,7 @@ struct ar9170 {
+ atomic_t rx_work_urbs;
+ atomic_t rx_pool_urbs;
+ kernel_ulong_t features;
++ bool usb_ep_cmd_is_bulk;
+
+ /* firmware settings */
+ struct completion fw_load_wait;
+diff --git a/drivers/net/wireless/ath/carl9170/usb.c b/drivers/net/wireless/ath/carl9170/usb.c
+index f35c7f30f9a6..c9f93310c0d6 100644
+--- a/drivers/net/wireless/ath/carl9170/usb.c
++++ b/drivers/net/wireless/ath/carl9170/usb.c
+@@ -621,9 +621,16 @@ int __carl9170_exec_cmd(struct ar9170 *ar, struct carl9170_cmd *cmd,
+ goto err_free;
+ }
+
+- usb_fill_int_urb(urb, ar->udev, usb_sndintpipe(ar->udev,
+- AR9170_USB_EP_CMD), cmd, cmd->hdr.len + 4,
+- carl9170_usb_cmd_complete, ar, 1);
++ if (ar->usb_ep_cmd_is_bulk)
++ usb_fill_bulk_urb(urb, ar->udev,
++ usb_sndbulkpipe(ar->udev, AR9170_USB_EP_CMD),
++ cmd, cmd->hdr.len + 4,
++ carl9170_usb_cmd_complete, ar);
++ else
++ usb_fill_int_urb(urb, ar->udev,
++ usb_sndintpipe(ar->udev, AR9170_USB_EP_CMD),
++ cmd, cmd->hdr.len + 4,
++ carl9170_usb_cmd_complete, ar, 1);
+
+ if (free_buf)
+ urb->transfer_flags |= URB_FREE_BUFFER;
+@@ -1032,9 +1039,10 @@ static void carl9170_usb_firmware_step2(const struct firmware *fw,
+ static int carl9170_usb_probe(struct usb_interface *intf,
+ const struct usb_device_id *id)
+ {
++ struct usb_endpoint_descriptor *ep;
+ struct ar9170 *ar;
+ struct usb_device *udev;
+- int err;
++ int i, err;
+
+ err = usb_reset_device(interface_to_usbdev(intf));
+ if (err)
+@@ -1050,6 +1058,21 @@ static int carl9170_usb_probe(struct usb_interface *intf,
+ ar->intf = intf;
+ ar->features = id->driver_info;
+
++ /* We need to remember the type of endpoint 4 because it differs
++ * between high- and full-speed configuration. The high-speed
++ * configuration specifies it as interrupt and the full-speed
++ * configuration as bulk endpoint. This information is required
++ * later when sending urbs to that endpoint.
++ */
++ for (i = 0; i < intf->cur_altsetting->desc.bNumEndpoints; ++i) {
++ ep = &intf->cur_altsetting->endpoint[i].desc;
++
++ if (usb_endpoint_num(ep) == AR9170_USB_EP_CMD &&
++ usb_endpoint_dir_out(ep) &&
++ usb_endpoint_type(ep) == USB_ENDPOINT_XFER_BULK)
++ ar->usb_ep_cmd_is_bulk = true;
++ }
++
+ usb_set_intfdata(intf, ar);
+ SET_IEEE80211_DEV(ar->hw, &intf->dev);
+
+diff --git a/drivers/net/wireless/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/brcm80211/brcmfmac/fweh.c
+index fad77dd2a3a5..3f9cb894d001 100644
+--- a/drivers/net/wireless/brcm80211/brcmfmac/fweh.c
++++ b/drivers/net/wireless/brcm80211/brcmfmac/fweh.c
+@@ -185,7 +185,13 @@ static void brcmf_fweh_handle_if_event(struct brcmf_pub *drvr,
+ ifevent->action, ifevent->ifidx, ifevent->bssidx,
+ ifevent->flags, ifevent->role);
+
+- if (ifevent->flags & BRCMF_E_IF_FLAG_NOIF) {
++ /* The P2P Device interface event must not be ignored
++ * contrary to what firmware tells us. The only way to
++ * distinguish the P2P Device is by looking at the ifidx
++ * and bssidx received.
++ */
++ if (!(ifevent->ifidx == 0 && ifevent->bssidx == 1) &&
++ (ifevent->flags & BRCMF_E_IF_FLAG_NOIF)) {
+ brcmf_dbg(EVENT, "event can be ignored\n");
+ return;
+ }
+@@ -210,12 +216,12 @@ static void brcmf_fweh_handle_if_event(struct brcmf_pub *drvr,
+ return;
+ }
+
+- if (ifevent->action == BRCMF_E_IF_CHANGE)
++ if (ifp && ifevent->action == BRCMF_E_IF_CHANGE)
+ brcmf_fws_reset_interface(ifp);
+
+ err = brcmf_fweh_call_event_handler(ifp, emsg->event_code, emsg, data);
+
+- if (ifevent->action == BRCMF_E_IF_DEL) {
++ if (ifp && ifevent->action == BRCMF_E_IF_DEL) {
+ brcmf_fws_del_interface(ifp);
+ brcmf_del_if(drvr, ifevent->bssidx);
+ }
+diff --git a/drivers/net/wireless/brcm80211/brcmfmac/fweh.h b/drivers/net/wireless/brcm80211/brcmfmac/fweh.h
+index 51b53a73d074..d26b47698f68 100644
+--- a/drivers/net/wireless/brcm80211/brcmfmac/fweh.h
++++ b/drivers/net/wireless/brcm80211/brcmfmac/fweh.h
+@@ -167,6 +167,8 @@ enum brcmf_fweh_event_code {
+ #define BRCMF_E_IF_ROLE_STA 0
+ #define BRCMF_E_IF_ROLE_AP 1
+ #define BRCMF_E_IF_ROLE_WDS 2
++#define BRCMF_E_IF_ROLE_P2P_GO 3
++#define BRCMF_E_IF_ROLE_P2P_CLIENT 4
+
+ /**
+ * definitions for event packet validation.
+diff --git a/drivers/net/wireless/iwlwifi/dvm/rxon.c b/drivers/net/wireless/iwlwifi/dvm/rxon.c
+index 6dc5dd3ced44..ed50de6362ed 100644
+--- a/drivers/net/wireless/iwlwifi/dvm/rxon.c
++++ b/drivers/net/wireless/iwlwifi/dvm/rxon.c
+@@ -1068,6 +1068,13 @@ int iwlagn_commit_rxon(struct iwl_priv *priv, struct iwl_rxon_context *ctx)
+ /* recalculate basic rates */
+ iwl_calc_basic_rates(priv, ctx);
+
++ /*
++ * force CTS-to-self frames protection if RTS-CTS is not preferred
++ * one aggregation protection method
++ */
++ if (!priv->hw_params.use_rts_for_aggregation)
++ ctx->staging.flags |= RXON_FLG_SELF_CTS_EN;
++
+ if ((ctx->vif && ctx->vif->bss_conf.use_short_slot) ||
+ !(ctx->staging.flags & RXON_FLG_BAND_24G_MSK))
+ ctx->staging.flags |= RXON_FLG_SHORT_SLOT_MSK;
+@@ -1473,6 +1480,11 @@ void iwlagn_bss_info_changed(struct ieee80211_hw *hw,
+ else
+ ctx->staging.flags &= ~RXON_FLG_TGG_PROTECT_MSK;
+
++ if (bss_conf->use_cts_prot)
++ ctx->staging.flags |= RXON_FLG_SELF_CTS_EN;
++ else
++ ctx->staging.flags &= ~RXON_FLG_SELF_CTS_EN;
++
+ memcpy(ctx->staging.bssid_addr, bss_conf->bssid, ETH_ALEN);
+
+ if (vif->type == NL80211_IFTYPE_AP ||
+diff --git a/drivers/net/wireless/iwlwifi/iwl-config.h b/drivers/net/wireless/iwlwifi/iwl-config.h
+index b7047905f41a..6ac1bedd2876 100644
+--- a/drivers/net/wireless/iwlwifi/iwl-config.h
++++ b/drivers/net/wireless/iwlwifi/iwl-config.h
+@@ -120,6 +120,8 @@ enum iwl_led_mode {
+ #define IWL_LONG_WD_TIMEOUT 10000
+ #define IWL_MAX_WD_TIMEOUT 120000
+
++#define IWL_DEFAULT_MAX_TX_POWER 22
++
+ /* Antenna presence definitions */
+ #define ANT_NONE 0x0
+ #define ANT_A BIT(0)
+diff --git a/drivers/net/wireless/iwlwifi/iwl-nvm-parse.c b/drivers/net/wireless/iwlwifi/iwl-nvm-parse.c
+index 85eee79c495c..0c75fc140bf6 100644
+--- a/drivers/net/wireless/iwlwifi/iwl-nvm-parse.c
++++ b/drivers/net/wireless/iwlwifi/iwl-nvm-parse.c
+@@ -143,8 +143,6 @@ static const u8 iwl_nvm_channels_family_8000[] = {
+ #define LAST_2GHZ_HT_PLUS 9
+ #define LAST_5GHZ_HT 161
+
+-#define DEFAULT_MAX_TX_POWER 16
+-
+ /* rate data (static) */
+ static struct ieee80211_rate iwl_cfg80211_rates[] = {
+ { .bitrate = 1 * 10, .hw_value = 0, .hw_value_short = 0, },
+@@ -279,7 +277,7 @@ static int iwl_init_channel_map(struct device *dev, const struct iwl_cfg *cfg,
+ * Default value - highest tx power value. max_power
+ * is not used in mvm, and is used for backwards compatibility
+ */
+- channel->max_power = DEFAULT_MAX_TX_POWER;
++ channel->max_power = IWL_DEFAULT_MAX_TX_POWER;
+ is_5ghz = channel->band == IEEE80211_BAND_5GHZ;
+ IWL_DEBUG_EEPROM(dev,
+ "Ch. %d [%sGHz] %s%s%s%s%s%s(0x%02x %ddBm): Ad-Hoc %ssupported\n",
+diff --git a/drivers/net/wireless/iwlwifi/mvm/fw-api.h b/drivers/net/wireless/iwlwifi/mvm/fw-api.h
+index 309a9b9a94fe..67363080f83d 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/fw-api.h
++++ b/drivers/net/wireless/iwlwifi/mvm/fw-api.h
+@@ -1487,14 +1487,14 @@ enum iwl_sf_scenario {
+
+ /**
+ * Smart Fifo configuration command.
+- * @state: smart fifo state, types listed in iwl_sf_sate.
++ * @state: smart fifo state, types listed in enum %iwl_sf_sate.
+ * @watermark: Minimum allowed availabe free space in RXF for transient state.
+ * @long_delay_timeouts: aging and idle timer values for each scenario
+ * in long delay state.
+ * @full_on_timeouts: timer values for each scenario in full on state.
+ */
+ struct iwl_sf_cfg_cmd {
+- enum iwl_sf_state state;
++ __le32 state;
+ __le32 watermark[SF_TRANSIENT_STATES_NUMBER];
+ __le32 long_delay_timeouts[SF_NUM_SCENARIO][SF_NUM_TIMEOUT_TYPES];
+ __le32 full_on_timeouts[SF_NUM_SCENARIO][SF_NUM_TIMEOUT_TYPES];
+diff --git a/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c b/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
+index 8b79081d4885..db84533eff5d 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
++++ b/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
+@@ -720,11 +720,6 @@ static int iwl_mvm_mac_ctxt_cmd_sta(struct iwl_mvm *mvm,
+ !force_assoc_off) {
+ u32 dtim_offs;
+
+- /* Allow beacons to pass through as long as we are not
+- * associated, or we do not have dtim period information.
+- */
+- cmd.filter_flags |= cpu_to_le32(MAC_FILTER_IN_BEACON);
+-
+ /*
+ * The DTIM count counts down, so when it is N that means N
+ * more beacon intervals happen until the DTIM TBTT. Therefore
+@@ -758,6 +753,11 @@ static int iwl_mvm_mac_ctxt_cmd_sta(struct iwl_mvm *mvm,
+ ctxt_sta->is_assoc = cpu_to_le32(1);
+ } else {
+ ctxt_sta->is_assoc = cpu_to_le32(0);
++
++ /* Allow beacons to pass through as long as we are not
++ * associated, or we do not have dtim period information.
++ */
++ cmd.filter_flags |= cpu_to_le32(MAC_FILTER_IN_BEACON);
+ }
+
+ ctxt_sta->bi = cpu_to_le32(vif->bss_conf.beacon_int);
+diff --git a/drivers/net/wireless/iwlwifi/mvm/sf.c b/drivers/net/wireless/iwlwifi/mvm/sf.c
+index 7edfd15efc9d..e843b67f2201 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/sf.c
++++ b/drivers/net/wireless/iwlwifi/mvm/sf.c
+@@ -172,7 +172,7 @@ static int iwl_mvm_sf_config(struct iwl_mvm *mvm, u8 sta_id,
+ enum iwl_sf_state new_state)
+ {
+ struct iwl_sf_cfg_cmd sf_cmd = {
+- .state = new_state,
++ .state = cpu_to_le32(new_state),
+ };
+ struct ieee80211_sta *sta;
+ int ret = 0;
+diff --git a/drivers/net/wireless/iwlwifi/mvm/tx.c b/drivers/net/wireless/iwlwifi/mvm/tx.c
+index 3846a6c41eb1..f2465f60122e 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/tx.c
++++ b/drivers/net/wireless/iwlwifi/mvm/tx.c
+@@ -169,10 +169,14 @@ static void iwl_mvm_set_tx_cmd_rate(struct iwl_mvm *mvm,
+
+ /*
+ * for data packets, rate info comes from the table inside the fw. This
+- * table is controlled by LINK_QUALITY commands
++ * table is controlled by LINK_QUALITY commands. Exclude ctrl port
++ * frames like EAPOLs which should be treated as mgmt frames. This
++ * avoids them being sent initially in high rates which increases the
++ * chances for completion of the 4-Way handshake.
+ */
+
+- if (ieee80211_is_data(fc) && sta) {
++ if (ieee80211_is_data(fc) && sta &&
++ !(info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO)) {
+ tx_cmd->initial_rate_index = 0;
+ tx_cmd->tx_flags |= cpu_to_le32(TX_CMD_FLG_STA_RATE);
+ return;
+diff --git a/drivers/net/wireless/rtlwifi/rtl8192cu/sw.c b/drivers/net/wireless/rtlwifi/rtl8192cu/sw.c
+index 361435f8608a..1ac6383e7947 100644
+--- a/drivers/net/wireless/rtlwifi/rtl8192cu/sw.c
++++ b/drivers/net/wireless/rtlwifi/rtl8192cu/sw.c
+@@ -317,6 +317,7 @@ static struct usb_device_id rtl8192c_usb_ids[] = {
+ {RTL_USB_DEVICE(0x0bda, 0x5088, rtl92cu_hal_cfg)}, /*Thinkware-CC&C*/
+ {RTL_USB_DEVICE(0x0df6, 0x0052, rtl92cu_hal_cfg)}, /*Sitecom - Edimax*/
+ {RTL_USB_DEVICE(0x0df6, 0x005c, rtl92cu_hal_cfg)}, /*Sitecom - Edimax*/
++ {RTL_USB_DEVICE(0x0df6, 0x0070, rtl92cu_hal_cfg)}, /*Sitecom - 150N */
+ {RTL_USB_DEVICE(0x0df6, 0x0077, rtl92cu_hal_cfg)}, /*Sitecom-WLA2100V2*/
+ {RTL_USB_DEVICE(0x0eb0, 0x9071, rtl92cu_hal_cfg)}, /*NO Brand - Etop*/
+ {RTL_USB_DEVICE(0x4856, 0x0091, rtl92cu_hal_cfg)}, /*NetweeN - Feixun*/
+diff --git a/drivers/nfc/microread/microread.c b/drivers/nfc/microread/microread.c
+index f868333271aa..963a4a5dc88e 100644
+--- a/drivers/nfc/microread/microread.c
++++ b/drivers/nfc/microread/microread.c
+@@ -501,9 +501,13 @@ static void microread_target_discovered(struct nfc_hci_dev *hdev, u8 gate,
+ targets->sens_res =
+ be16_to_cpu(*(u16 *)&skb->data[MICROREAD_EMCF_A_ATQA]);
+ targets->sel_res = skb->data[MICROREAD_EMCF_A_SAK];
+- memcpy(targets->nfcid1, &skb->data[MICROREAD_EMCF_A_UID],
+- skb->data[MICROREAD_EMCF_A_LEN]);
+ targets->nfcid1_len = skb->data[MICROREAD_EMCF_A_LEN];
++ if (targets->nfcid1_len > sizeof(targets->nfcid1)) {
++ r = -EINVAL;
++ goto exit_free;
++ }
++ memcpy(targets->nfcid1, &skb->data[MICROREAD_EMCF_A_UID],
++ targets->nfcid1_len);
+ break;
+ case MICROREAD_GATE_ID_MREAD_ISO_A_3:
+ targets->supported_protocols =
+@@ -511,9 +515,13 @@ static void microread_target_discovered(struct nfc_hci_dev *hdev, u8 gate,
+ targets->sens_res =
+ be16_to_cpu(*(u16 *)&skb->data[MICROREAD_EMCF_A3_ATQA]);
+ targets->sel_res = skb->data[MICROREAD_EMCF_A3_SAK];
+- memcpy(targets->nfcid1, &skb->data[MICROREAD_EMCF_A3_UID],
+- skb->data[MICROREAD_EMCF_A3_LEN]);
+ targets->nfcid1_len = skb->data[MICROREAD_EMCF_A3_LEN];
++ if (targets->nfcid1_len > sizeof(targets->nfcid1)) {
++ r = -EINVAL;
++ goto exit_free;
++ }
++ memcpy(targets->nfcid1, &skb->data[MICROREAD_EMCF_A3_UID],
++ targets->nfcid1_len);
+ break;
+ case MICROREAD_GATE_ID_MREAD_ISO_B:
+ targets->supported_protocols = NFC_PROTO_ISO14443_B_MASK;
+diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
+index 9aa012e6ea0a..379ad4fa9665 100644
+--- a/drivers/of/fdt.c
++++ b/drivers/of/fdt.c
+@@ -453,7 +453,7 @@ static int __init __reserved_mem_reserve_reg(unsigned long node,
+ base = dt_mem_next_cell(dt_root_addr_cells, &prop);
+ size = dt_mem_next_cell(dt_root_size_cells, &prop);
+
+- if (base && size &&
++ if (size &&
+ early_init_dt_reserve_memory_arch(base, size, nomap) == 0)
+ pr_debug("Reserved memory: reserved region for node '%s': base %pa, size %ld MiB\n",
+ uname, &base, (unsigned long)size / SZ_1M);
+diff --git a/drivers/of/irq.c b/drivers/of/irq.c
+index 3e06a699352d..1471e0a223a5 100644
+--- a/drivers/of/irq.c
++++ b/drivers/of/irq.c
+@@ -301,16 +301,17 @@ int of_irq_parse_one(struct device_node *device, int index, struct of_phandle_ar
+ /* Get the reg property (if any) */
+ addr = of_get_property(device, "reg", NULL);
+
++ /* Try the new-style interrupts-extended first */
++ res = of_parse_phandle_with_args(device, "interrupts-extended",
++ "#interrupt-cells", index, out_irq);
++ if (!res)
++ return of_irq_parse_raw(addr, out_irq);
++
+ /* Get the interrupts property */
+ intspec = of_get_property(device, "interrupts", &intlen);
+- if (intspec == NULL) {
+- /* Try the new-style interrupts-extended */
+- res = of_parse_phandle_with_args(device, "interrupts-extended",
+- "#interrupt-cells", index, out_irq);
+- if (res)
+- return -EINVAL;
+- return of_irq_parse_raw(addr, out_irq);
+- }
++ if (intspec == NULL)
++ return -EINVAL;
++
+ intlen /= sizeof(*intspec);
+
+ pr_debug(" intspec=%d intlen=%d\n", be32_to_cpup(intspec), intlen);
+diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
+index 602d153c7055..c074b262a492 100644
+--- a/drivers/pci/hotplug/acpiphp_glue.c
++++ b/drivers/pci/hotplug/acpiphp_glue.c
+@@ -573,19 +573,15 @@ static void disable_slot(struct acpiphp_slot *slot)
+ slot->flags &= (~SLOT_ENABLED);
+ }
+
+-static bool acpiphp_no_hotplug(struct acpi_device *adev)
+-{
+- return adev && adev->flags.no_hotplug;
+-}
+-
+ static bool slot_no_hotplug(struct acpiphp_slot *slot)
+ {
+- struct acpiphp_func *func;
++ struct pci_bus *bus = slot->bus;
++ struct pci_dev *dev;
+
+- list_for_each_entry(func, &slot->funcs, sibling)
+- if (acpiphp_no_hotplug(func_to_acpi_device(func)))
++ list_for_each_entry(dev, &bus->devices, bus_list) {
++ if (PCI_SLOT(dev->devfn) == slot->device && dev->ignore_hotplug)
+ return true;
+-
++ }
+ return false;
+ }
+
+@@ -658,7 +654,7 @@ static void trim_stale_devices(struct pci_dev *dev)
+
+ status = acpi_evaluate_integer(adev->handle, "_STA", NULL, &sta);
+ alive = (ACPI_SUCCESS(status) && device_status_valid(sta))
+- || acpiphp_no_hotplug(adev);
++ || dev->ignore_hotplug;
+ }
+ if (!alive)
+ alive = pci_device_is_present(dev);
+diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
+index 056841651a80..fa6a320b4d58 100644
+--- a/drivers/pci/hotplug/pciehp_hpc.c
++++ b/drivers/pci/hotplug/pciehp_hpc.c
+@@ -508,6 +508,8 @@ static irqreturn_t pcie_isr(int irq, void *dev_id)
+ {
+ struct controller *ctrl = (struct controller *)dev_id;
+ struct pci_dev *pdev = ctrl_dev(ctrl);
++ struct pci_bus *subordinate = pdev->subordinate;
++ struct pci_dev *dev;
+ struct slot *slot = ctrl->slot;
+ u16 detected, intr_loc;
+
+@@ -541,6 +543,16 @@ static irqreturn_t pcie_isr(int irq, void *dev_id)
+ wake_up(&ctrl->queue);
+ }
+
++ if (subordinate) {
++ list_for_each_entry(dev, &subordinate->devices, bus_list) {
++ if (dev->ignore_hotplug) {
++ ctrl_dbg(ctrl, "ignoring hotplug event %#06x (%s requested no hotplug)\n",
++ intr_loc, pci_name(dev));
++ return IRQ_HANDLED;
++ }
++ }
++ }
++
+ if (!(intr_loc & ~PCI_EXP_SLTSTA_CC))
+ return IRQ_HANDLED;
+
+diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
+index e3cf8a2e6292..4170113cde61 100644
+--- a/drivers/pci/probe.c
++++ b/drivers/pci/probe.c
+@@ -775,7 +775,7 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max, int pass)
+ /* Check if setup is sensible at all */
+ if (!pass &&
+ (primary != bus->number || secondary <= bus->number ||
+- secondary > subordinate || subordinate > bus->busn_res.end)) {
++ secondary > subordinate)) {
+ dev_info(&dev->dev, "bridge configuration invalid ([bus %02x-%02x]), reconfiguring\n",
+ secondary, subordinate);
+ broken = 1;
+@@ -838,23 +838,18 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max, int pass)
+ goto out;
+ }
+
+- if (max >= bus->busn_res.end) {
+- dev_warn(&dev->dev, "can't allocate child bus %02x from %pR\n",
+- max, &bus->busn_res);
+- goto out;
+- }
+-
+ /* Clear errors */
+ pci_write_config_word(dev, PCI_STATUS, 0xffff);
+
+- /* The bus will already exist if we are rescanning */
++ /* Prevent assigning a bus number that already exists.
++ * This can happen when a bridge is hot-plugged, so in
++ * this case we only re-scan this bus. */
+ child = pci_find_bus(pci_domain_nr(bus), max+1);
+ if (!child) {
+ child = pci_add_new_bus(bus, dev, max+1);
+ if (!child)
+ goto out;
+- pci_bus_insert_busn_res(child, max+1,
+- bus->busn_res.end);
++ pci_bus_insert_busn_res(child, max+1, 0xff);
+ }
+ max++;
+ buses = (buses & 0xff000000)
+@@ -913,11 +908,6 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max, int pass)
+ /*
+ * Set the subordinate bus number to its real value.
+ */
+- if (max > bus->busn_res.end) {
+- dev_warn(&dev->dev, "max busn %02x is outside %pR\n",
+- max, &bus->busn_res);
+- max = bus->busn_res.end;
+- }
+ pci_bus_update_busn_res_end(child, max);
+ pci_write_config_byte(dev, PCI_SUBORDINATE_BUS, max);
+ }
+diff --git a/drivers/phy/phy-twl4030-usb.c b/drivers/phy/phy-twl4030-usb.c
+index 2e0e9b3774c8..ef4f3350faa1 100644
+--- a/drivers/phy/phy-twl4030-usb.c
++++ b/drivers/phy/phy-twl4030-usb.c
+@@ -34,6 +34,7 @@
+ #include <linux/delay.h>
+ #include <linux/usb/otg.h>
+ #include <linux/phy/phy.h>
++#include <linux/pm_runtime.h>
+ #include <linux/usb/musb-omap.h>
+ #include <linux/usb/ulpi.h>
+ #include <linux/i2c/twl.h>
+@@ -422,37 +423,55 @@ static void twl4030_phy_power(struct twl4030_usb *twl, int on)
+ }
+ }
+
+-static int twl4030_phy_power_off(struct phy *phy)
++static int twl4030_usb_runtime_suspend(struct device *dev)
+ {
+- struct twl4030_usb *twl = phy_get_drvdata(phy);
++ struct twl4030_usb *twl = dev_get_drvdata(dev);
+
++ dev_dbg(twl->dev, "%s\n", __func__);
+ if (twl->asleep)
+ return 0;
+
+ twl4030_phy_power(twl, 0);
+ twl->asleep = 1;
+- dev_dbg(twl->dev, "%s\n", __func__);
++
+ return 0;
+ }
+
+-static void __twl4030_phy_power_on(struct twl4030_usb *twl)
++static int twl4030_usb_runtime_resume(struct device *dev)
+ {
++ struct twl4030_usb *twl = dev_get_drvdata(dev);
++
++ dev_dbg(twl->dev, "%s\n", __func__);
++ if (!twl->asleep)
++ return 0;
++
+ twl4030_phy_power(twl, 1);
+- twl4030_i2c_access(twl, 1);
+- twl4030_usb_set_mode(twl, twl->usb_mode);
+- if (twl->usb_mode == T2_USB_MODE_ULPI)
+- twl4030_i2c_access(twl, 0);
++ twl->asleep = 0;
++
++ return 0;
++}
++
++static int twl4030_phy_power_off(struct phy *phy)
++{
++ struct twl4030_usb *twl = phy_get_drvdata(phy);
++
++ dev_dbg(twl->dev, "%s\n", __func__);
++ pm_runtime_mark_last_busy(twl->dev);
++ pm_runtime_put_autosuspend(twl->dev);
++
++ return 0;
+ }
+
+ static int twl4030_phy_power_on(struct phy *phy)
+ {
+ struct twl4030_usb *twl = phy_get_drvdata(phy);
+
+- if (!twl->asleep)
+- return 0;
+- __twl4030_phy_power_on(twl);
+- twl->asleep = 0;
+ dev_dbg(twl->dev, "%s\n", __func__);
++ pm_runtime_get_sync(twl->dev);
++ twl4030_i2c_access(twl, 1);
++ twl4030_usb_set_mode(twl, twl->usb_mode);
++ if (twl->usb_mode == T2_USB_MODE_ULPI)
++ twl4030_i2c_access(twl, 0);
+
+ /*
+ * XXX When VBUS gets driven after musb goes to A mode,
+@@ -558,9 +577,27 @@ static irqreturn_t twl4030_usb_irq(int irq, void *_twl)
+ * USB_LINK_VBUS state. musb_hdrc won't care until it
+ * starts to handle softconnect right.
+ */
++ if ((status == OMAP_MUSB_VBUS_VALID) ||
++ (status == OMAP_MUSB_ID_GROUND)) {
++ if (twl->asleep)
++ pm_runtime_get_sync(twl->dev);
++ } else {
++ if (!twl->asleep) {
++ pm_runtime_mark_last_busy(twl->dev);
++ pm_runtime_put_autosuspend(twl->dev);
++ }
++ }
+ omap_musb_mailbox(status);
+ }
+- sysfs_notify(&twl->dev->kobj, NULL, "vbus");
++
++ /* don't schedule during sleep - irq works right then */
++ if (status == OMAP_MUSB_ID_GROUND && !twl->asleep) {
++ cancel_delayed_work(&twl->id_workaround_work);
++ schedule_delayed_work(&twl->id_workaround_work, HZ);
++ }
++
++ if (irq)
++ sysfs_notify(&twl->dev->kobj, NULL, "vbus");
+
+ return IRQ_HANDLED;
+ }
+@@ -569,29 +606,8 @@ static void twl4030_id_workaround_work(struct work_struct *work)
+ {
+ struct twl4030_usb *twl = container_of(work, struct twl4030_usb,
+ id_workaround_work.work);
+- enum omap_musb_vbus_id_status status;
+- bool status_changed = false;
+-
+- status = twl4030_usb_linkstat(twl);
+-
+- spin_lock_irq(&twl->lock);
+- if (status >= 0 && status != twl->linkstat) {
+- twl->linkstat = status;
+- status_changed = true;
+- }
+- spin_unlock_irq(&twl->lock);
+-
+- if (status_changed) {
+- dev_dbg(twl->dev, "handle missing status change to %d\n",
+- status);
+- omap_musb_mailbox(status);
+- }
+
+- /* don't schedule during sleep - irq works right then */
+- if (status == OMAP_MUSB_ID_GROUND && !twl->asleep) {
+- cancel_delayed_work(&twl->id_workaround_work);
+- schedule_delayed_work(&twl->id_workaround_work, HZ);
+- }
++ twl4030_usb_irq(0, twl);
+ }
+
+ static int twl4030_phy_init(struct phy *phy)
+@@ -599,22 +615,17 @@ static int twl4030_phy_init(struct phy *phy)
+ struct twl4030_usb *twl = phy_get_drvdata(phy);
+ enum omap_musb_vbus_id_status status;
+
+- /*
+- * Start in sleep state, we'll get called through set_suspend()
+- * callback when musb is runtime resumed and it's time to start.
+- */
+- __twl4030_phy_power(twl, 0);
+- twl->asleep = 1;
+-
++ pm_runtime_get_sync(twl->dev);
+ status = twl4030_usb_linkstat(twl);
+ twl->linkstat = status;
+
+- if (status == OMAP_MUSB_ID_GROUND || status == OMAP_MUSB_VBUS_VALID) {
++ if (status == OMAP_MUSB_ID_GROUND || status == OMAP_MUSB_VBUS_VALID)
+ omap_musb_mailbox(twl->linkstat);
+- twl4030_phy_power_on(phy);
+- }
+
+ sysfs_notify(&twl->dev->kobj, NULL, "vbus");
++ pm_runtime_mark_last_busy(twl->dev);
++ pm_runtime_put_autosuspend(twl->dev);
++
+ return 0;
+ }
+
+@@ -650,6 +661,11 @@ static const struct phy_ops ops = {
+ .owner = THIS_MODULE,
+ };
+
++static const struct dev_pm_ops twl4030_usb_pm_ops = {
++ SET_RUNTIME_PM_OPS(twl4030_usb_runtime_suspend,
++ twl4030_usb_runtime_resume, NULL)
++};
++
+ static int twl4030_usb_probe(struct platform_device *pdev)
+ {
+ struct twl4030_usb_data *pdata = dev_get_platdata(&pdev->dev);
+@@ -726,6 +742,11 @@ static int twl4030_usb_probe(struct platform_device *pdev)
+
+ ATOMIC_INIT_NOTIFIER_HEAD(&twl->phy.notifier);
+
++ pm_runtime_use_autosuspend(&pdev->dev);
++ pm_runtime_set_autosuspend_delay(&pdev->dev, 2000);
++ pm_runtime_enable(&pdev->dev);
++ pm_runtime_get_sync(&pdev->dev);
++
+ /* Our job is to use irqs and status from the power module
+ * to keep the transceiver disabled when nothing's connected.
+ *
+@@ -744,6 +765,9 @@ static int twl4030_usb_probe(struct platform_device *pdev)
+ return status;
+ }
+
++ pm_runtime_mark_last_busy(&pdev->dev);
++ pm_runtime_put_autosuspend(twl->dev);
++
+ dev_info(&pdev->dev, "Initialized TWL4030 USB module\n");
+ return 0;
+ }
+@@ -753,6 +777,7 @@ static int twl4030_usb_remove(struct platform_device *pdev)
+ struct twl4030_usb *twl = platform_get_drvdata(pdev);
+ int val;
+
++ pm_runtime_get_sync(twl->dev);
+ cancel_delayed_work(&twl->id_workaround_work);
+ device_remove_file(twl->dev, &dev_attr_vbus);
+
+@@ -772,9 +797,8 @@ static int twl4030_usb_remove(struct platform_device *pdev)
+
+ /* disable complete OTG block */
+ twl4030_usb_clear_bits(twl, POWER_CTRL, POWER_CTRL_OTG_ENAB);
+-
+- if (!twl->asleep)
+- twl4030_phy_power(twl, 0);
++ pm_runtime_mark_last_busy(twl->dev);
++ pm_runtime_put(twl->dev);
+
+ return 0;
+ }
+@@ -792,6 +816,7 @@ static struct platform_driver twl4030_usb_driver = {
+ .remove = twl4030_usb_remove,
+ .driver = {
+ .name = "twl4030_usb",
++ .pm = &twl4030_usb_pm_ops,
+ .owner = THIS_MODULE,
+ .of_match_table = of_match_ptr(twl4030_usb_id_table),
+ },
+diff --git a/drivers/pwm/core.c b/drivers/pwm/core.c
+index 4b66bf09ee55..d2c35920ff08 100644
+--- a/drivers/pwm/core.c
++++ b/drivers/pwm/core.c
+@@ -606,6 +606,8 @@ struct pwm_device *pwm_get(struct device *dev, const char *con_id)
+ unsigned int best = 0;
+ struct pwm_lookup *p;
+ unsigned int match;
++ unsigned int period;
++ enum pwm_polarity polarity;
+
+ /* look up via DT first */
+ if (IS_ENABLED(CONFIG_OF) && dev && dev->of_node)
+@@ -653,6 +655,8 @@ struct pwm_device *pwm_get(struct device *dev, const char *con_id)
+ if (match > best) {
+ chip = pwmchip_find_by_name(p->provider);
+ index = p->index;
++ period = p->period;
++ polarity = p->polarity;
+
+ if (match != 3)
+ best = match;
+@@ -668,8 +672,8 @@ struct pwm_device *pwm_get(struct device *dev, const char *con_id)
+ if (IS_ERR(pwm))
+ return pwm;
+
+- pwm_set_period(pwm, p->period);
+- pwm_set_polarity(pwm, p->polarity);
++ pwm_set_period(pwm, period);
++ pwm_set_polarity(pwm, polarity);
+
+
+ return pwm;
+diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
+index 3d1bc67bac9d..874bc950b9f6 100644
+--- a/drivers/scsi/libiscsi.c
++++ b/drivers/scsi/libiscsi.c
+@@ -717,11 +717,21 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr,
+ return NULL;
+ }
+
++ if (data_size > ISCSI_DEF_MAX_RECV_SEG_LEN) {
++ iscsi_conn_printk(KERN_ERR, conn, "Invalid buffer len of %u for login task. Max len is %u\n", data_size, ISCSI_DEF_MAX_RECV_SEG_LEN);
++ return NULL;
++ }
++
+ task = conn->login_task;
+ } else {
+ if (session->state != ISCSI_STATE_LOGGED_IN)
+ return NULL;
+
++ if (data_size != 0) {
++ iscsi_conn_printk(KERN_ERR, conn, "Can not send data buffer of len %u for op 0x%x\n", data_size, opcode);
++ return NULL;
++ }
++
+ BUG_ON(conn->c_stage == ISCSI_CONN_INITIAL_STAGE);
+ BUG_ON(conn->c_stage == ISCSI_CONN_STOPPED);
+
+diff --git a/drivers/spi/spi-dw-pci.c b/drivers/spi/spi-dw-pci.c
+index 3f3dc1226edf..e14960470d8d 100644
+--- a/drivers/spi/spi-dw-pci.c
++++ b/drivers/spi/spi-dw-pci.c
+@@ -62,6 +62,8 @@ static int spi_pci_probe(struct pci_dev *pdev,
+ if (ret)
+ return ret;
+
++ dws->regs = pcim_iomap_table(pdev)[pci_bar];
++
+ dws->bus_num = 0;
+ dws->num_cs = 4;
+ dws->irq = pdev->irq;
+diff --git a/drivers/spi/spi-dw.c b/drivers/spi/spi-dw.c
+index 29f33143b795..0dd0623319b0 100644
+--- a/drivers/spi/spi-dw.c
++++ b/drivers/spi/spi-dw.c
+@@ -271,7 +271,7 @@ static void giveback(struct dw_spi *dws)
+ transfer_list);
+
+ if (!last_transfer->cs_change)
+- spi_chip_sel(dws, dws->cur_msg->spi, 0);
++ spi_chip_sel(dws, msg->spi, 0);
+
+ spi_finalize_current_message(dws->master);
+ }
+@@ -547,8 +547,7 @@ static int dw_spi_setup(struct spi_device *spi)
+ /* Only alloc on first setup */
+ chip = spi_get_ctldata(spi);
+ if (!chip) {
+- chip = devm_kzalloc(&spi->dev, sizeof(struct chip_data),
+- GFP_KERNEL);
++ chip = kzalloc(sizeof(struct chip_data), GFP_KERNEL);
+ if (!chip)
+ return -ENOMEM;
+ spi_set_ctldata(spi, chip);
+@@ -606,6 +605,14 @@ static int dw_spi_setup(struct spi_device *spi)
+ return 0;
+ }
+
++static void dw_spi_cleanup(struct spi_device *spi)
++{
++ struct chip_data *chip = spi_get_ctldata(spi);
++
++ kfree(chip);
++ spi_set_ctldata(spi, NULL);
++}
++
+ /* Restart the controller, disable all interrupts, clean rx fifo */
+ static void spi_hw_init(struct dw_spi *dws)
+ {
+@@ -661,6 +668,7 @@ int dw_spi_add_host(struct device *dev, struct dw_spi *dws)
+ master->bus_num = dws->bus_num;
+ master->num_chipselect = dws->num_cs;
+ master->setup = dw_spi_setup;
++ master->cleanup = dw_spi_cleanup;
+ master->transfer_one_message = dw_spi_transfer_one_message;
+ master->max_speed_hz = dws->max_freq;
+
+diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
+index 8ebd724e4c59..429e11190265 100644
+--- a/drivers/spi/spi-fsl-espi.c
++++ b/drivers/spi/spi-fsl-espi.c
+@@ -452,16 +452,16 @@ static int fsl_espi_setup(struct spi_device *spi)
+ int retval;
+ u32 hw_mode;
+ u32 loop_mode;
+- struct spi_mpc8xxx_cs *cs = spi->controller_state;
++ struct spi_mpc8xxx_cs *cs = spi_get_ctldata(spi);
+
+ if (!spi->max_speed_hz)
+ return -EINVAL;
+
+ if (!cs) {
+- cs = devm_kzalloc(&spi->dev, sizeof(*cs), GFP_KERNEL);
++ cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+ if (!cs)
+ return -ENOMEM;
+- spi->controller_state = cs;
++ spi_set_ctldata(spi, cs);
+ }
+
+ mpc8xxx_spi = spi_master_get_devdata(spi->master);
+@@ -496,6 +496,14 @@ static int fsl_espi_setup(struct spi_device *spi)
+ return 0;
+ }
+
++static void fsl_espi_cleanup(struct spi_device *spi)
++{
++ struct spi_mpc8xxx_cs *cs = spi_get_ctldata(spi);
++
++ kfree(cs);
++ spi_set_ctldata(spi, NULL);
++}
++
+ void fsl_espi_cpu_irq(struct mpc8xxx_spi *mspi, u32 events)
+ {
+ struct fsl_espi_reg *reg_base = mspi->reg_base;
+@@ -605,6 +613,7 @@ static struct spi_master * fsl_espi_probe(struct device *dev,
+
+ master->bits_per_word_mask = SPI_BPW_RANGE_MASK(4, 16);
+ master->setup = fsl_espi_setup;
++ master->cleanup = fsl_espi_cleanup;
+
+ mpc8xxx_spi = spi_master_get_devdata(master);
+ mpc8xxx_spi->spi_do_one_msg = fsl_espi_do_one_msg;
+diff --git a/drivers/spi/spi-fsl-spi.c b/drivers/spi/spi-fsl-spi.c
+index 98ccd231bf00..bea26b719361 100644
+--- a/drivers/spi/spi-fsl-spi.c
++++ b/drivers/spi/spi-fsl-spi.c
+@@ -425,16 +425,16 @@ static int fsl_spi_setup(struct spi_device *spi)
+ struct fsl_spi_reg *reg_base;
+ int retval;
+ u32 hw_mode;
+- struct spi_mpc8xxx_cs *cs = spi->controller_state;
++ struct spi_mpc8xxx_cs *cs = spi_get_ctldata(spi);
+
+ if (!spi->max_speed_hz)
+ return -EINVAL;
+
+ if (!cs) {
+- cs = devm_kzalloc(&spi->dev, sizeof(*cs), GFP_KERNEL);
++ cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+ if (!cs)
+ return -ENOMEM;
+- spi->controller_state = cs;
++ spi_set_ctldata(spi, cs);
+ }
+ mpc8xxx_spi = spi_master_get_devdata(spi->master);
+
+@@ -496,9 +496,13 @@ static int fsl_spi_setup(struct spi_device *spi)
+ static void fsl_spi_cleanup(struct spi_device *spi)
+ {
+ struct mpc8xxx_spi *mpc8xxx_spi = spi_master_get_devdata(spi->master);
++ struct spi_mpc8xxx_cs *cs = spi_get_ctldata(spi);
+
+ if (mpc8xxx_spi->type == TYPE_GRLIB && gpio_is_valid(spi->cs_gpio))
+ gpio_free(spi->cs_gpio);
++
++ kfree(cs);
++ spi_set_ctldata(spi, NULL);
+ }
+
+ static void fsl_spi_cpu_irq(struct mpc8xxx_spi *mspi, u32 events)
+diff --git a/drivers/spi/spi-omap2-mcspi.c b/drivers/spi/spi-omap2-mcspi.c
+index 68441fa448de..352eed7463ac 100644
+--- a/drivers/spi/spi-omap2-mcspi.c
++++ b/drivers/spi/spi-omap2-mcspi.c
+@@ -329,7 +329,8 @@ static void omap2_mcspi_set_fifo(const struct spi_device *spi,
+ disable_fifo:
+ if (t->rx_buf != NULL)
+ chconf &= ~OMAP2_MCSPI_CHCONF_FFER;
+- else
++
++ if (t->tx_buf != NULL)
+ chconf &= ~OMAP2_MCSPI_CHCONF_FFET;
+
+ mcspi_write_chconf0(spi, chconf);
+diff --git a/drivers/spi/spi-sirf.c b/drivers/spi/spi-sirf.c
+index 95ac276eaafe..1a5161336730 100644
+--- a/drivers/spi/spi-sirf.c
++++ b/drivers/spi/spi-sirf.c
+@@ -438,7 +438,8 @@ static void spi_sirfsoc_pio_transfer(struct spi_device *spi,
+ sspi->tx_word(sspi);
+ writel(SIRFSOC_SPI_TXFIFO_EMPTY_INT_EN |
+ SIRFSOC_SPI_TX_UFLOW_INT_EN |
+- SIRFSOC_SPI_RX_OFLOW_INT_EN,
++ SIRFSOC_SPI_RX_OFLOW_INT_EN |
++ SIRFSOC_SPI_RX_IO_DMA_INT_EN,
+ sspi->base + SIRFSOC_SPI_INT_EN);
+ writel(SIRFSOC_SPI_RX_EN | SIRFSOC_SPI_TX_EN,
+ sspi->base + SIRFSOC_SPI_TX_RX_EN);
+diff --git a/drivers/staging/iio/meter/ade7758_trigger.c b/drivers/staging/iio/meter/ade7758_trigger.c
+index 7a94ddd42f59..8c4f2896cd0d 100644
+--- a/drivers/staging/iio/meter/ade7758_trigger.c
++++ b/drivers/staging/iio/meter/ade7758_trigger.c
+@@ -85,7 +85,7 @@ int ade7758_probe_trigger(struct iio_dev *indio_dev)
+ ret = iio_trigger_register(st->trig);
+
+ /* select default trigger */
+- indio_dev->trig = st->trig;
++ indio_dev->trig = iio_trigger_get(st->trig);
+ if (ret)
+ goto error_free_irq;
+
+diff --git a/drivers/staging/imx-drm/imx-ldb.c b/drivers/staging/imx-drm/imx-ldb.c
+index 7e3f019d7e72..4662e00b456a 100644
+--- a/drivers/staging/imx-drm/imx-ldb.c
++++ b/drivers/staging/imx-drm/imx-ldb.c
+@@ -574,6 +574,9 @@ static void imx_ldb_unbind(struct device *dev, struct device *master,
+ for (i = 0; i < 2; i++) {
+ struct imx_ldb_channel *channel = &imx_ldb->channel[i];
+
++ if (!channel->connector.funcs)
++ continue;
++
+ channel->connector.funcs->destroy(&channel->connector);
+ channel->encoder.funcs->destroy(&channel->encoder);
+ }
+diff --git a/drivers/staging/imx-drm/ipuv3-plane.c b/drivers/staging/imx-drm/ipuv3-plane.c
+index 6f393a11f44d..50de10a550e9 100644
+--- a/drivers/staging/imx-drm/ipuv3-plane.c
++++ b/drivers/staging/imx-drm/ipuv3-plane.c
+@@ -281,7 +281,8 @@ static void ipu_plane_dpms(struct ipu_plane *ipu_plane, int mode)
+
+ ipu_idmac_put(ipu_plane->ipu_ch);
+ ipu_dmfc_put(ipu_plane->dmfc);
+- ipu_dp_put(ipu_plane->dp);
++ if (ipu_plane->dp)
++ ipu_dp_put(ipu_plane->dp);
+ }
+ }
+
+diff --git a/drivers/staging/lustre/lustre/Kconfig b/drivers/staging/lustre/lustre/Kconfig
+index 209e4c7e6f8a..4f65ba1158bf 100644
+--- a/drivers/staging/lustre/lustre/Kconfig
++++ b/drivers/staging/lustre/lustre/Kconfig
+@@ -57,4 +57,5 @@ config LUSTRE_TRANSLATE_ERRNOS
+ config LUSTRE_LLITE_LLOOP
+ tristate "Lustre virtual block device"
+ depends on LUSTRE_FS && BLOCK
++ depends on !PPC_64K_PAGES && !ARM64_64K_PAGES
+ default m
+diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
+index 1f4c794f5fcc..260c3e1e312c 100644
+--- a/drivers/target/iscsi/iscsi_target.c
++++ b/drivers/target/iscsi/iscsi_target.c
+@@ -4540,6 +4540,7 @@ static void iscsit_logout_post_handler_diffcid(
+ {
+ struct iscsi_conn *l_conn;
+ struct iscsi_session *sess = conn->sess;
++ bool conn_found = false;
+
+ if (!sess)
+ return;
+@@ -4548,12 +4549,13 @@ static void iscsit_logout_post_handler_diffcid(
+ list_for_each_entry(l_conn, &sess->sess_conn_list, conn_list) {
+ if (l_conn->cid == cid) {
+ iscsit_inc_conn_usage_count(l_conn);
++ conn_found = true;
+ break;
+ }
+ }
+ spin_unlock_bh(&sess->conn_lock);
+
+- if (!l_conn)
++ if (!conn_found)
+ return;
+
+ if (l_conn->sock)
+diff --git a/drivers/target/iscsi/iscsi_target_parameters.c b/drivers/target/iscsi/iscsi_target_parameters.c
+index 02f9de26f38a..18c29260b4a2 100644
+--- a/drivers/target/iscsi/iscsi_target_parameters.c
++++ b/drivers/target/iscsi/iscsi_target_parameters.c
+@@ -601,7 +601,7 @@ int iscsi_copy_param_list(
+ param_list = kzalloc(sizeof(struct iscsi_param_list), GFP_KERNEL);
+ if (!param_list) {
+ pr_err("Unable to allocate memory for struct iscsi_param_list.\n");
+- goto err_out;
++ return -1;
+ }
+ INIT_LIST_HEAD(¶m_list->param_list);
+ INIT_LIST_HEAD(¶m_list->extra_response_list);
+diff --git a/drivers/target/target_core_configfs.c b/drivers/target/target_core_configfs.c
+index bf55c5a04cfa..756def38c77a 100644
+--- a/drivers/target/target_core_configfs.c
++++ b/drivers/target/target_core_configfs.c
+@@ -2363,7 +2363,7 @@ static ssize_t target_core_alua_tg_pt_gp_store_attr_alua_support_##_name(\
+ pr_err("Invalid value '%ld', must be '0' or '1'\n", tmp); \
+ return -EINVAL; \
+ } \
+- if (!tmp) \
++ if (tmp) \
+ t->_var |= _bit; \
+ else \
+ t->_var &= ~_bit; \
+diff --git a/drivers/tty/serial/atmel_serial.c b/drivers/tty/serial/atmel_serial.c
+index c4f750314100..ffefec83a02f 100644
+--- a/drivers/tty/serial/atmel_serial.c
++++ b/drivers/tty/serial/atmel_serial.c
+@@ -527,6 +527,45 @@ static void atmel_enable_ms(struct uart_port *port)
+ }
+
+ /*
++ * Disable modem status interrupts
++ */
++static void atmel_disable_ms(struct uart_port *port)
++{
++ struct atmel_uart_port *atmel_port = to_atmel_uart_port(port);
++ uint32_t idr = 0;
++
++ /*
++ * Interrupt should not be disabled twice
++ */
++ if (!atmel_port->ms_irq_enabled)
++ return;
++
++ atmel_port->ms_irq_enabled = false;
++
++ if (atmel_port->gpio_irq[UART_GPIO_CTS] >= 0)
++ disable_irq(atmel_port->gpio_irq[UART_GPIO_CTS]);
++ else
++ idr |= ATMEL_US_CTSIC;
++
++ if (atmel_port->gpio_irq[UART_GPIO_DSR] >= 0)
++ disable_irq(atmel_port->gpio_irq[UART_GPIO_DSR]);
++ else
++ idr |= ATMEL_US_DSRIC;
++
++ if (atmel_port->gpio_irq[UART_GPIO_RI] >= 0)
++ disable_irq(atmel_port->gpio_irq[UART_GPIO_RI]);
++ else
++ idr |= ATMEL_US_RIIC;
++
++ if (atmel_port->gpio_irq[UART_GPIO_DCD] >= 0)
++ disable_irq(atmel_port->gpio_irq[UART_GPIO_DCD]);
++ else
++ idr |= ATMEL_US_DCDIC;
++
++ UART_PUT_IDR(port, idr);
++}
++
++/*
+ * Control the transmission of a break signal
+ */
+ static void atmel_break_ctl(struct uart_port *port, int break_state)
+@@ -1993,7 +2032,9 @@ static void atmel_set_termios(struct uart_port *port, struct ktermios *termios,
+
+ /* CTS flow-control and modem-status interrupts */
+ if (UART_ENABLE_MS(port, termios->c_cflag))
+- port->ops->enable_ms(port);
++ atmel_enable_ms(port);
++ else
++ atmel_disable_ms(port);
+
+ spin_unlock_irqrestore(&port->lock, flags);
+ }
+diff --git a/drivers/usb/chipidea/ci_hdrc_msm.c b/drivers/usb/chipidea/ci_hdrc_msm.c
+index d72b9d2de2c5..4935ac38fd00 100644
+--- a/drivers/usb/chipidea/ci_hdrc_msm.c
++++ b/drivers/usb/chipidea/ci_hdrc_msm.c
+@@ -20,13 +20,13 @@
+ static void ci_hdrc_msm_notify_event(struct ci_hdrc *ci, unsigned event)
+ {
+ struct device *dev = ci->gadget.dev.parent;
+- int val;
+
+ switch (event) {
+ case CI_HDRC_CONTROLLER_RESET_EVENT:
+ dev_dbg(dev, "CI_HDRC_CONTROLLER_RESET_EVENT received\n");
+ writel(0, USB_AHBBURST);
+ writel(0, USB_AHBMODE);
++ usb_phy_init(ci->transceiver);
+ break;
+ case CI_HDRC_CONTROLLER_STOPPED_EVENT:
+ dev_dbg(dev, "CI_HDRC_CONTROLLER_STOPPED_EVENT received\n");
+@@ -34,10 +34,7 @@ static void ci_hdrc_msm_notify_event(struct ci_hdrc *ci, unsigned event)
+ * Put the transceiver in non-driving mode. Otherwise host
+ * may not detect soft-disconnection.
+ */
+- val = usb_phy_io_read(ci->transceiver, ULPI_FUNC_CTRL);
+- val &= ~ULPI_FUNC_CTRL_OPMODE_MASK;
+- val |= ULPI_FUNC_CTRL_OPMODE_NONDRIVING;
+- usb_phy_io_write(ci->transceiver, val, ULPI_FUNC_CTRL);
++ usb_phy_notify_disconnect(ci->transceiver, USB_SPEED_UNKNOWN);
+ break;
+ default:
+ dev_dbg(dev, "unknown ci_hdrc event\n");
+diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
+index 27f217107ef1..50e854509f55 100644
+--- a/drivers/usb/core/hub.c
++++ b/drivers/usb/core/hub.c
+@@ -5008,9 +5008,10 @@ static void hub_events(void)
+
+ hub = list_entry(tmp, struct usb_hub, event_list);
+ kref_get(&hub->kref);
++ hdev = hub->hdev;
++ usb_get_dev(hdev);
+ spin_unlock_irq(&hub_event_lock);
+
+- hdev = hub->hdev;
+ hub_dev = hub->intfdev;
+ intf = to_usb_interface(hub_dev);
+ dev_dbg(hub_dev, "state %d ports %d chg %04x evt %04x\n",
+@@ -5123,6 +5124,7 @@ static void hub_events(void)
+ usb_autopm_put_interface(intf);
+ loop_disconnected:
+ usb_unlock_device(hdev);
++ usb_put_dev(hdev);
+ kref_put(&hub->kref, hub_release);
+
+ } /* end while (1) */
+diff --git a/drivers/usb/dwc2/gadget.c b/drivers/usb/dwc2/gadget.c
+index f3c56a2fed5b..a0d2f31b30cc 100644
+--- a/drivers/usb/dwc2/gadget.c
++++ b/drivers/usb/dwc2/gadget.c
+@@ -1650,6 +1650,7 @@ static void s3c_hsotg_txfifo_flush(struct s3c_hsotg *hsotg, unsigned int idx)
+ dev_err(hsotg->dev,
+ "%s: timeout flushing fifo (GRSTCTL=%08x)\n",
+ __func__, val);
++ break;
+ }
+
+ udelay(1);
+@@ -2748,13 +2749,14 @@ static void s3c_hsotg_phy_enable(struct s3c_hsotg *hsotg)
+
+ dev_dbg(hsotg->dev, "pdev 0x%p\n", pdev);
+
+- if (hsotg->phy) {
+- phy_init(hsotg->phy);
+- phy_power_on(hsotg->phy);
+- } else if (hsotg->uphy)
++ if (hsotg->uphy)
+ usb_phy_init(hsotg->uphy);
+- else if (hsotg->plat->phy_init)
++ else if (hsotg->plat && hsotg->plat->phy_init)
+ hsotg->plat->phy_init(pdev, hsotg->plat->phy_type);
++ else {
++ phy_init(hsotg->phy);
++ phy_power_on(hsotg->phy);
++ }
+ }
+
+ /**
+@@ -2768,13 +2770,14 @@ static void s3c_hsotg_phy_disable(struct s3c_hsotg *hsotg)
+ {
+ struct platform_device *pdev = to_platform_device(hsotg->dev);
+
+- if (hsotg->phy) {
+- phy_power_off(hsotg->phy);
+- phy_exit(hsotg->phy);
+- } else if (hsotg->uphy)
++ if (hsotg->uphy)
+ usb_phy_shutdown(hsotg->uphy);
+- else if (hsotg->plat->phy_exit)
++ else if (hsotg->plat && hsotg->plat->phy_exit)
+ hsotg->plat->phy_exit(pdev, hsotg->plat->phy_type);
++ else {
++ phy_power_off(hsotg->phy);
++ phy_exit(hsotg->phy);
++ }
+ }
+
+ /**
+@@ -2893,13 +2896,11 @@ static int s3c_hsotg_udc_stop(struct usb_gadget *gadget,
+ return -ENODEV;
+
+ /* all endpoints should be shutdown */
+- for (ep = 0; ep < hsotg->num_of_eps; ep++)
++ for (ep = 1; ep < hsotg->num_of_eps; ep++)
+ s3c_hsotg_ep_disable(&hsotg->eps[ep].ep);
+
+ spin_lock_irqsave(&hsotg->lock, flags);
+
+- s3c_hsotg_phy_disable(hsotg);
+-
+ if (!driver)
+ hsotg->driver = NULL;
+
+@@ -2942,7 +2943,6 @@ static int s3c_hsotg_pullup(struct usb_gadget *gadget, int is_on)
+ s3c_hsotg_phy_enable(hsotg);
+ s3c_hsotg_core_init(hsotg);
+ } else {
+- s3c_hsotg_disconnect(hsotg);
+ s3c_hsotg_phy_disable(hsotg);
+ }
+
+@@ -3444,13 +3444,6 @@ static int s3c_hsotg_probe(struct platform_device *pdev)
+
+ hsotg->irq = ret;
+
+- ret = devm_request_irq(&pdev->dev, hsotg->irq, s3c_hsotg_irq, 0,
+- dev_name(dev), hsotg);
+- if (ret < 0) {
+- dev_err(dev, "cannot claim IRQ\n");
+- goto err_clk;
+- }
+-
+ dev_info(dev, "regs %p, irq %d\n", hsotg->regs, hsotg->irq);
+
+ hsotg->gadget.max_speed = USB_SPEED_HIGH;
+@@ -3491,9 +3484,6 @@ static int s3c_hsotg_probe(struct platform_device *pdev)
+ if (hsotg->phy && (phy_get_bus_width(phy) == 8))
+ hsotg->phyif = GUSBCFG_PHYIF8;
+
+- if (hsotg->phy)
+- phy_init(hsotg->phy);
+-
+ /* usb phy enable */
+ s3c_hsotg_phy_enable(hsotg);
+
+@@ -3501,6 +3491,17 @@ static int s3c_hsotg_probe(struct platform_device *pdev)
+ s3c_hsotg_init(hsotg);
+ s3c_hsotg_hw_cfg(hsotg);
+
++ ret = devm_request_irq(&pdev->dev, hsotg->irq, s3c_hsotg_irq, 0,
++ dev_name(dev), hsotg);
++ if (ret < 0) {
++ s3c_hsotg_phy_disable(hsotg);
++ clk_disable_unprepare(hsotg->clk);
++ regulator_bulk_disable(ARRAY_SIZE(hsotg->supplies),
++ hsotg->supplies);
++ dev_err(dev, "cannot claim IRQ\n");
++ goto err_clk;
++ }
++
+ /* hsotg->num_of_eps holds number of EPs other than ep0 */
+
+ if (hsotg->num_of_eps == 0) {
+@@ -3586,9 +3587,6 @@ static int s3c_hsotg_remove(struct platform_device *pdev)
+ usb_gadget_unregister_driver(hsotg->driver);
+ }
+
+- s3c_hsotg_phy_disable(hsotg);
+- if (hsotg->phy)
+- phy_exit(hsotg->phy);
+ clk_disable_unprepare(hsotg->clk);
+
+ return 0;
+diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
+index eb69eb9f06c8..52b30c5b000e 100644
+--- a/drivers/usb/dwc3/core.c
++++ b/drivers/usb/dwc3/core.c
+@@ -786,20 +786,21 @@ static int dwc3_remove(struct platform_device *pdev)
+ {
+ struct dwc3 *dwc = platform_get_drvdata(pdev);
+
++ dwc3_debugfs_exit(dwc);
++ dwc3_core_exit_mode(dwc);
++ dwc3_event_buffers_cleanup(dwc);
++ dwc3_free_event_buffers(dwc);
++
+ usb_phy_set_suspend(dwc->usb2_phy, 1);
+ usb_phy_set_suspend(dwc->usb3_phy, 1);
+ phy_power_off(dwc->usb2_generic_phy);
+ phy_power_off(dwc->usb3_generic_phy);
+
++ dwc3_core_exit(dwc);
++
+ pm_runtime_put_sync(&pdev->dev);
+ pm_runtime_disable(&pdev->dev);
+
+- dwc3_debugfs_exit(dwc);
+- dwc3_core_exit_mode(dwc);
+- dwc3_event_buffers_cleanup(dwc);
+- dwc3_free_event_buffers(dwc);
+- dwc3_core_exit(dwc);
+-
+ return 0;
+ }
+
+diff --git a/drivers/usb/dwc3/dwc3-omap.c b/drivers/usb/dwc3/dwc3-omap.c
+index 07a736acd0f2..3536ad7f1346 100644
+--- a/drivers/usb/dwc3/dwc3-omap.c
++++ b/drivers/usb/dwc3/dwc3-omap.c
+@@ -576,9 +576,9 @@ static int dwc3_omap_remove(struct platform_device *pdev)
+ if (omap->extcon_id_dev.edev)
+ extcon_unregister_interest(&omap->extcon_id_dev);
+ dwc3_omap_disable_irqs(omap);
++ device_for_each_child(&pdev->dev, NULL, dwc3_omap_remove_core);
+ pm_runtime_put_sync(&pdev->dev);
+ pm_runtime_disable(&pdev->dev);
+- device_for_each_child(&pdev->dev, NULL, dwc3_omap_remove_core);
+
+ return 0;
+ }
+diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
+index dab7927d1009..f5b352a19eb0 100644
+--- a/drivers/usb/dwc3/gadget.c
++++ b/drivers/usb/dwc3/gadget.c
+@@ -527,7 +527,7 @@ static int dwc3_gadget_set_ep_config(struct dwc3 *dwc, struct dwc3_ep *dep,
+ dep->stream_capable = true;
+ }
+
+- if (usb_endpoint_xfer_isoc(desc))
++ if (!usb_endpoint_xfer_control(desc))
+ params.param1 |= DWC3_DEPCFG_XFER_IN_PROGRESS_EN;
+
+ /*
+@@ -2042,12 +2042,6 @@ static void dwc3_endpoint_interrupt(struct dwc3 *dwc,
+ dwc3_endpoint_transfer_complete(dwc, dep, event, 1);
+ break;
+ case DWC3_DEPEVT_XFERINPROGRESS:
+- if (!usb_endpoint_xfer_isoc(dep->endpoint.desc)) {
+- dev_dbg(dwc->dev, "%s is not an Isochronous endpoint\n",
+- dep->name);
+- return;
+- }
+-
+ dwc3_endpoint_transfer_complete(dwc, dep, event, 0);
+ break;
+ case DWC3_DEPEVT_XFERNOTREADY:
+diff --git a/drivers/usb/gadget/f_rndis.c b/drivers/usb/gadget/f_rndis.c
+index 9c41e9515b8e..ddb09dc6d1f2 100644
+--- a/drivers/usb/gadget/f_rndis.c
++++ b/drivers/usb/gadget/f_rndis.c
+@@ -727,6 +727,10 @@ rndis_bind(struct usb_configuration *c, struct usb_function *f)
+ rndis_control_intf.bInterfaceNumber = status;
+ rndis_union_desc.bMasterInterface0 = status;
+
++ if (cdev->use_os_string)
++ f->os_desc_table[0].if_id =
++ rndis_iad_descriptor.bFirstInterface;
++
+ status = usb_interface_id(c, f);
+ if (status < 0)
+ goto fail;
+diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
+index 81cda09b47e3..488a30836c36 100644
+--- a/drivers/usb/host/ehci-hcd.c
++++ b/drivers/usb/host/ehci-hcd.c
+@@ -965,8 +965,6 @@ rescan:
+ }
+
+ qh->exception = 1;
+- if (ehci->rh_state < EHCI_RH_RUNNING)
+- qh->qh_state = QH_STATE_IDLE;
+ switch (qh->qh_state) {
+ case QH_STATE_LINKED:
+ WARN_ON(!list_empty(&qh->qtd_list));
+diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
+index aa79e8749040..69aece31143a 100644
+--- a/drivers/usb/host/xhci-hub.c
++++ b/drivers/usb/host/xhci-hub.c
+@@ -468,7 +468,8 @@ static void xhci_hub_report_usb2_link_state(u32 *status, u32 status_reg)
+ }
+
+ /* Updates Link Status for super Speed port */
+-static void xhci_hub_report_usb3_link_state(u32 *status, u32 status_reg)
++static void xhci_hub_report_usb3_link_state(struct xhci_hcd *xhci,
++ u32 *status, u32 status_reg)
+ {
+ u32 pls = status_reg & PORT_PLS_MASK;
+
+@@ -507,7 +508,8 @@ static void xhci_hub_report_usb3_link_state(u32 *status, u32 status_reg)
+ * in which sometimes the port enters compliance mode
+ * caused by a delay on the host-device negotiation.
+ */
+- if (pls == USB_SS_PORT_LS_COMP_MOD)
++ if ((xhci->quirks & XHCI_COMP_MODE_QUIRK) &&
++ (pls == USB_SS_PORT_LS_COMP_MOD))
+ pls |= USB_PORT_STAT_CONNECTION;
+ }
+
+@@ -666,7 +668,7 @@ static u32 xhci_get_port_status(struct usb_hcd *hcd,
+ }
+ /* Update Port Link State */
+ if (hcd->speed == HCD_USB3) {
+- xhci_hub_report_usb3_link_state(&status, raw_port_status);
++ xhci_hub_report_usb3_link_state(xhci, &status, raw_port_status);
+ /*
+ * Verify if all USB3 Ports Have entered U0 already.
+ * Delete Compliance Mode Timer if so.
+diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
+index 8056d90690ee..8936211b161d 100644
+--- a/drivers/usb/host/xhci-mem.c
++++ b/drivers/usb/host/xhci-mem.c
+@@ -1812,6 +1812,7 @@ void xhci_mem_cleanup(struct xhci_hcd *xhci)
+
+ if (xhci->lpm_command)
+ xhci_free_command(xhci, xhci->lpm_command);
++ xhci->lpm_command = NULL;
+ if (xhci->cmd_ring)
+ xhci_ring_free(xhci, xhci->cmd_ring);
+ xhci->cmd_ring = NULL;
+@@ -1819,7 +1820,7 @@ void xhci_mem_cleanup(struct xhci_hcd *xhci)
+ xhci_cleanup_command_queue(xhci);
+
+ num_ports = HCS_MAX_PORTS(xhci->hcs_params1);
+- for (i = 0; i < num_ports; i++) {
++ for (i = 0; i < num_ports && xhci->rh_bw; i++) {
+ struct xhci_interval_bw_table *bwt = &xhci->rh_bw[i].bw_table;
+ for (j = 0; j < XHCI_MAX_INTERVAL; j++) {
+ struct list_head *ep = &bwt->interval_bw[j].endpoints;
+diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
+index e32cc6cf86dc..2d1284adc987 100644
+--- a/drivers/usb/host/xhci.c
++++ b/drivers/usb/host/xhci.c
+@@ -3982,13 +3982,21 @@ static int __maybe_unused xhci_change_max_exit_latency(struct xhci_hcd *xhci,
+ int ret;
+
+ spin_lock_irqsave(&xhci->lock, flags);
+- if (max_exit_latency == xhci->devs[udev->slot_id]->current_mel) {
++
++ virt_dev = xhci->devs[udev->slot_id];
++
++ /*
++ * virt_dev might not exists yet if xHC resumed from hibernate (S4) and
++ * xHC was re-initialized. Exit latency will be set later after
++ * hub_port_finish_reset() is done and xhci->devs[] are re-allocated
++ */
++
++ if (!virt_dev || max_exit_latency == virt_dev->current_mel) {
+ spin_unlock_irqrestore(&xhci->lock, flags);
+ return 0;
+ }
+
+ /* Attempt to issue an Evaluate Context command to change the MEL. */
+- virt_dev = xhci->devs[udev->slot_id];
+ command = xhci->lpm_command;
+ ctrl_ctx = xhci_get_input_control_ctx(xhci, command->in_ctx);
+ if (!ctrl_ctx) {
+diff --git a/drivers/usb/misc/sisusbvga/sisusb.c b/drivers/usb/misc/sisusbvga/sisusb.c
+index 06b5d77cd9ad..633caf643122 100644
+--- a/drivers/usb/misc/sisusbvga/sisusb.c
++++ b/drivers/usb/misc/sisusbvga/sisusb.c
+@@ -3250,6 +3250,7 @@ static const struct usb_device_id sisusb_table[] = {
+ { USB_DEVICE(0x0711, 0x0918) },
+ { USB_DEVICE(0x0711, 0x0920) },
+ { USB_DEVICE(0x0711, 0x0950) },
++ { USB_DEVICE(0x0711, 0x5200) },
+ { USB_DEVICE(0x182d, 0x021c) },
+ { USB_DEVICE(0x182d, 0x0269) },
+ { }
+diff --git a/drivers/usb/phy/phy-tegra-usb.c b/drivers/usb/phy/phy-tegra-usb.c
+index bbe4f8e6e8d7..8834b70c868c 100644
+--- a/drivers/usb/phy/phy-tegra-usb.c
++++ b/drivers/usb/phy/phy-tegra-usb.c
+@@ -881,8 +881,8 @@ static int utmi_phy_probe(struct tegra_usb_phy *tegra_phy,
+ return -ENOMEM;
+ }
+
+- tegra_phy->config = devm_kzalloc(&pdev->dev,
+- sizeof(*tegra_phy->config), GFP_KERNEL);
++ tegra_phy->config = devm_kzalloc(&pdev->dev, sizeof(*config),
++ GFP_KERNEL);
+ if (!tegra_phy->config) {
+ dev_err(&pdev->dev,
+ "unable to allocate memory for USB UTMIP config\n");
+diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
+index 8b0f517abb6b..3614620e09e1 100644
+--- a/drivers/usb/serial/ftdi_sio.c
++++ b/drivers/usb/serial/ftdi_sio.c
+@@ -741,6 +741,7 @@ static const struct usb_device_id id_table_combined[] = {
+ { USB_DEVICE(FTDI_VID, FTDI_NDI_AURORA_SCU_PID),
+ .driver_info = (kernel_ulong_t)&ftdi_NDI_device_quirk },
+ { USB_DEVICE(TELLDUS_VID, TELLDUS_TELLSTICK_PID) },
++ { USB_DEVICE(NOVITUS_VID, NOVITUS_BONO_E_PID) },
+ { USB_DEVICE(RTSYSTEMS_VID, RTSYSTEMS_USB_S03_PID) },
+ { USB_DEVICE(RTSYSTEMS_VID, RTSYSTEMS_USB_59_PID) },
+ { USB_DEVICE(RTSYSTEMS_VID, RTSYSTEMS_USB_57A_PID) },
+@@ -952,6 +953,8 @@ static const struct usb_device_id id_table_combined[] = {
+ { USB_DEVICE(FTDI_VID, FTDI_EKEY_CONV_USB_PID) },
+ /* Infineon Devices */
+ { USB_DEVICE_INTERFACE_NUMBER(INFINEON_VID, INFINEON_TRIBOARD_PID, 1) },
++ /* GE Healthcare devices */
++ { USB_DEVICE(GE_HEALTHCARE_VID, GE_HEALTHCARE_NEMO_TRACKER_PID) },
+ { } /* Terminating entry */
+ };
+
+diff --git a/drivers/usb/serial/ftdi_sio_ids.h b/drivers/usb/serial/ftdi_sio_ids.h
+index 70b0b1d88ae9..5937b2d242f2 100644
+--- a/drivers/usb/serial/ftdi_sio_ids.h
++++ b/drivers/usb/serial/ftdi_sio_ids.h
+@@ -837,6 +837,12 @@
+ #define TELLDUS_TELLSTICK_PID 0x0C30 /* RF control dongle 433 MHz using FT232RL */
+
+ /*
++ * NOVITUS printers
++ */
++#define NOVITUS_VID 0x1a28
++#define NOVITUS_BONO_E_PID 0x6010
++
++/*
+ * RT Systems programming cables for various ham radios
+ */
+ #define RTSYSTEMS_VID 0x2100 /* Vendor ID */
+@@ -1385,3 +1391,9 @@
+ * ekey biometric systems GmbH (http://ekey.net/)
+ */
+ #define FTDI_EKEY_CONV_USB_PID 0xCB08 /* Converter USB */
++
++/*
++ * GE Healthcare devices
++ */
++#define GE_HEALTHCARE_VID 0x1901
++#define GE_HEALTHCARE_NEMO_TRACKER_PID 0x0015
+diff --git a/drivers/usb/serial/option.c b/drivers/usb/serial/option.c
+index a9688940543d..54a8120897a6 100644
+--- a/drivers/usb/serial/option.c
++++ b/drivers/usb/serial/option.c
+@@ -275,8 +275,12 @@ static void option_instat_callback(struct urb *urb);
+ #define ZTE_PRODUCT_MF622 0x0001
+ #define ZTE_PRODUCT_MF628 0x0015
+ #define ZTE_PRODUCT_MF626 0x0031
+-#define ZTE_PRODUCT_MC2718 0xffe8
+ #define ZTE_PRODUCT_AC2726 0xfff1
++#define ZTE_PRODUCT_CDMA_TECH 0xfffe
++#define ZTE_PRODUCT_AC8710T 0xffff
++#define ZTE_PRODUCT_MC2718 0xffe8
++#define ZTE_PRODUCT_AD3812 0xffeb
++#define ZTE_PRODUCT_MC2716 0xffed
+
+ #define BENQ_VENDOR_ID 0x04a5
+ #define BENQ_PRODUCT_H10 0x4068
+@@ -494,6 +498,10 @@ static void option_instat_callback(struct urb *urb);
+ #define INOVIA_VENDOR_ID 0x20a6
+ #define INOVIA_SEW858 0x1105
+
++/* VIA Telecom */
++#define VIATELECOM_VENDOR_ID 0x15eb
++#define VIATELECOM_PRODUCT_CDS7 0x0001
++
+ /* some devices interfaces need special handling due to a number of reasons */
+ enum option_blacklist_reason {
+ OPTION_BLACKLIST_NONE = 0,
+@@ -527,10 +535,18 @@ static const struct option_blacklist_info zte_k3765_z_blacklist = {
+ .reserved = BIT(4),
+ };
+
++static const struct option_blacklist_info zte_ad3812_z_blacklist = {
++ .sendsetup = BIT(0) | BIT(1) | BIT(2),
++};
++
+ static const struct option_blacklist_info zte_mc2718_z_blacklist = {
+ .sendsetup = BIT(1) | BIT(2) | BIT(3) | BIT(4),
+ };
+
++static const struct option_blacklist_info zte_mc2716_z_blacklist = {
++ .sendsetup = BIT(1) | BIT(2) | BIT(3),
++};
++
+ static const struct option_blacklist_info huawei_cdc12_blacklist = {
+ .reserved = BIT(1) | BIT(2),
+ };
+@@ -1070,6 +1086,7 @@ static const struct usb_device_id option_ids[] = {
+ { USB_DEVICE_INTERFACE_CLASS(BANDRICH_VENDOR_ID, BANDRICH_PRODUCT_1012, 0xff) },
+ { USB_DEVICE(KYOCERA_VENDOR_ID, KYOCERA_PRODUCT_KPC650) },
+ { USB_DEVICE(KYOCERA_VENDOR_ID, KYOCERA_PRODUCT_KPC680) },
++ { USB_DEVICE(QUALCOMM_VENDOR_ID, 0x6000)}, /* ZTE AC8700 */
+ { USB_DEVICE(QUALCOMM_VENDOR_ID, 0x6613)}, /* Onda H600/ZTE MF330 */
+ { USB_DEVICE(QUALCOMM_VENDOR_ID, 0x0023)}, /* ONYX 3G device */
+ { USB_DEVICE(QUALCOMM_VENDOR_ID, 0x9000)}, /* SIMCom SIM5218 */
+@@ -1544,13 +1561,18 @@ static const struct usb_device_id option_ids[] = {
+ { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff93, 0xff, 0xff, 0xff) },
+ { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff94, 0xff, 0xff, 0xff) },
+
+- /* NOTE: most ZTE CDMA devices should be driven by zte_ev, not option */
++ { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_CDMA_TECH, 0xff, 0xff, 0xff) },
++ { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_AC2726, 0xff, 0xff, 0xff) },
++ { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_AC8710T, 0xff, 0xff, 0xff) },
+ { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_MC2718, 0xff, 0xff, 0xff),
+ .driver_info = (kernel_ulong_t)&zte_mc2718_z_blacklist },
++ { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_AD3812, 0xff, 0xff, 0xff),
++ .driver_info = (kernel_ulong_t)&zte_ad3812_z_blacklist },
++ { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_MC2716, 0xff, 0xff, 0xff),
++ .driver_info = (kernel_ulong_t)&zte_mc2716_z_blacklist },
+ { USB_VENDOR_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff, 0x02, 0x01) },
+ { USB_VENDOR_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff, 0x02, 0x05) },
+ { USB_VENDOR_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff, 0x86, 0x10) },
+- { USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_AC2726, 0xff, 0xff, 0xff) },
+
+ { USB_DEVICE(BENQ_VENDOR_ID, BENQ_PRODUCT_H10) },
+ { USB_DEVICE(DLINK_VENDOR_ID, DLINK_PRODUCT_DWM_652) },
+@@ -1724,6 +1746,7 @@ static const struct usb_device_id option_ids[] = {
+ { USB_DEVICE_AND_INTERFACE_INFO(0x07d1, 0x3e01, 0xff, 0xff, 0xff) }, /* D-Link DWM-152/C1 */
+ { USB_DEVICE_AND_INTERFACE_INFO(0x07d1, 0x3e02, 0xff, 0xff, 0xff) }, /* D-Link DWM-156/C1 */
+ { USB_DEVICE(INOVIA_VENDOR_ID, INOVIA_SEW858) },
++ { USB_DEVICE(VIATELECOM_VENDOR_ID, VIATELECOM_PRODUCT_CDS7) },
+ { } /* Terminating entry */
+ };
+ MODULE_DEVICE_TABLE(usb, option_ids);
+@@ -1916,6 +1939,8 @@ static void option_instat_callback(struct urb *urb)
+ dev_dbg(dev, "%s: type %x req %x\n", __func__,
+ req_pkt->bRequestType, req_pkt->bRequest);
+ }
++ } else if (status == -ENOENT || status == -ESHUTDOWN) {
++ dev_dbg(dev, "%s: urb stopped: %d\n", __func__, status);
+ } else
+ dev_err(dev, "%s: error %d\n", __func__, status);
+
+diff --git a/drivers/usb/serial/pl2303.c b/drivers/usb/serial/pl2303.c
+index b3d5a35c0d4b..e9bad928039f 100644
+--- a/drivers/usb/serial/pl2303.c
++++ b/drivers/usb/serial/pl2303.c
+@@ -45,6 +45,7 @@ static const struct usb_device_id id_table[] = {
+ { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_GPRS) },
+ { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_HCR331) },
+ { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_MOTOROLA) },
++ { USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_ZTEK) },
+ { USB_DEVICE(IODATA_VENDOR_ID, IODATA_PRODUCT_ID) },
+ { USB_DEVICE(IODATA_VENDOR_ID, IODATA_PRODUCT_ID_RSAQ5) },
+ { USB_DEVICE(ATEN_VENDOR_ID, ATEN_PRODUCT_ID) },
+diff --git a/drivers/usb/serial/pl2303.h b/drivers/usb/serial/pl2303.h
+index 42bc082896ac..71fd9da1d6e7 100644
+--- a/drivers/usb/serial/pl2303.h
++++ b/drivers/usb/serial/pl2303.h
+@@ -22,6 +22,7 @@
+ #define PL2303_PRODUCT_ID_GPRS 0x0609
+ #define PL2303_PRODUCT_ID_HCR331 0x331a
+ #define PL2303_PRODUCT_ID_MOTOROLA 0x0307
++#define PL2303_PRODUCT_ID_ZTEK 0xe1f1
+
+ #define ATEN_VENDOR_ID 0x0557
+ #define ATEN_VENDOR_ID2 0x0547
+diff --git a/drivers/usb/serial/sierra.c b/drivers/usb/serial/sierra.c
+index 6f7f01eb556a..46179a0828eb 100644
+--- a/drivers/usb/serial/sierra.c
++++ b/drivers/usb/serial/sierra.c
+@@ -282,14 +282,19 @@ static const struct usb_device_id id_table[] = {
+ /* Sierra Wireless HSPA Non-Composite Device */
+ { USB_DEVICE_AND_INTERFACE_INFO(0x1199, 0x6892, 0xFF, 0xFF, 0xFF)},
+ { USB_DEVICE(0x1199, 0x6893) }, /* Sierra Wireless Device */
+- { USB_DEVICE(0x1199, 0x68A3), /* Sierra Wireless Direct IP modems */
++ /* Sierra Wireless Direct IP modems */
++ { USB_DEVICE_AND_INTERFACE_INFO(0x1199, 0x68A3, 0xFF, 0xFF, 0xFF),
++ .driver_info = (kernel_ulong_t)&direct_ip_interface_blacklist
++ },
++ { USB_DEVICE_AND_INTERFACE_INFO(0x1199, 0x68AA, 0xFF, 0xFF, 0xFF),
+ .driver_info = (kernel_ulong_t)&direct_ip_interface_blacklist
+ },
+ /* AT&T Direct IP LTE modems */
+ { USB_DEVICE_AND_INTERFACE_INFO(0x0F3D, 0x68AA, 0xFF, 0xFF, 0xFF),
+ .driver_info = (kernel_ulong_t)&direct_ip_interface_blacklist
+ },
+- { USB_DEVICE(0x0f3d, 0x68A3), /* Airprime/Sierra Wireless Direct IP modems */
++ /* Airprime/Sierra Wireless Direct IP modems */
++ { USB_DEVICE_AND_INTERFACE_INFO(0x0F3D, 0x68A3, 0xFF, 0xFF, 0xFF),
+ .driver_info = (kernel_ulong_t)&direct_ip_interface_blacklist
+ },
+
+diff --git a/drivers/usb/serial/usb-serial.c b/drivers/usb/serial/usb-serial.c
+index 02de3110fe94..475723c006f9 100644
+--- a/drivers/usb/serial/usb-serial.c
++++ b/drivers/usb/serial/usb-serial.c
+@@ -764,29 +764,39 @@ static int usb_serial_probe(struct usb_interface *interface,
+ if (usb_endpoint_is_bulk_in(endpoint)) {
+ /* we found a bulk in endpoint */
+ dev_dbg(ddev, "found bulk in on endpoint %d\n", i);
+- bulk_in_endpoint[num_bulk_in] = endpoint;
+- ++num_bulk_in;
++ if (num_bulk_in < MAX_NUM_PORTS) {
++ bulk_in_endpoint[num_bulk_in] = endpoint;
++ ++num_bulk_in;
++ }
+ }
+
+ if (usb_endpoint_is_bulk_out(endpoint)) {
+ /* we found a bulk out endpoint */
+ dev_dbg(ddev, "found bulk out on endpoint %d\n", i);
+- bulk_out_endpoint[num_bulk_out] = endpoint;
+- ++num_bulk_out;
++ if (num_bulk_out < MAX_NUM_PORTS) {
++ bulk_out_endpoint[num_bulk_out] = endpoint;
++ ++num_bulk_out;
++ }
+ }
+
+ if (usb_endpoint_is_int_in(endpoint)) {
+ /* we found a interrupt in endpoint */
+ dev_dbg(ddev, "found interrupt in on endpoint %d\n", i);
+- interrupt_in_endpoint[num_interrupt_in] = endpoint;
+- ++num_interrupt_in;
++ if (num_interrupt_in < MAX_NUM_PORTS) {
++ interrupt_in_endpoint[num_interrupt_in] =
++ endpoint;
++ ++num_interrupt_in;
++ }
+ }
+
+ if (usb_endpoint_is_int_out(endpoint)) {
+ /* we found an interrupt out endpoint */
+ dev_dbg(ddev, "found interrupt out on endpoint %d\n", i);
+- interrupt_out_endpoint[num_interrupt_out] = endpoint;
+- ++num_interrupt_out;
++ if (num_interrupt_out < MAX_NUM_PORTS) {
++ interrupt_out_endpoint[num_interrupt_out] =
++ endpoint;
++ ++num_interrupt_out;
++ }
+ }
+ }
+
+@@ -809,8 +819,10 @@ static int usb_serial_probe(struct usb_interface *interface,
+ if (usb_endpoint_is_int_in(endpoint)) {
+ /* we found a interrupt in endpoint */
+ dev_dbg(ddev, "found interrupt in for Prolific device on separate interface\n");
+- interrupt_in_endpoint[num_interrupt_in] = endpoint;
+- ++num_interrupt_in;
++ if (num_interrupt_in < MAX_NUM_PORTS) {
++ interrupt_in_endpoint[num_interrupt_in] = endpoint;
++ ++num_interrupt_in;
++ }
+ }
+ }
+ }
+@@ -850,6 +862,11 @@ static int usb_serial_probe(struct usb_interface *interface,
+ num_ports = type->num_ports;
+ }
+
++ if (num_ports > MAX_NUM_PORTS) {
++ dev_warn(ddev, "too many ports requested: %d\n", num_ports);
++ num_ports = MAX_NUM_PORTS;
++ }
++
+ serial->num_ports = num_ports;
+ serial->num_bulk_in = num_bulk_in;
+ serial->num_bulk_out = num_bulk_out;
+diff --git a/drivers/usb/serial/zte_ev.c b/drivers/usb/serial/zte_ev.c
+index e40ab739c4a6..c9bb107d5e5c 100644
+--- a/drivers/usb/serial/zte_ev.c
++++ b/drivers/usb/serial/zte_ev.c
+@@ -272,28 +272,16 @@ static void zte_ev_usb_serial_close(struct usb_serial_port *port)
+ }
+
+ static const struct usb_device_id id_table[] = {
+- /* AC8710, AC8710T */
+- { USB_DEVICE_AND_INTERFACE_INFO(0x19d2, 0xffff, 0xff, 0xff, 0xff) },
+- /* AC8700 */
+- { USB_DEVICE_AND_INTERFACE_INFO(0x19d2, 0xfffe, 0xff, 0xff, 0xff) },
+- /* MG880 */
+- { USB_DEVICE(0x19d2, 0xfffd) },
+- { USB_DEVICE(0x19d2, 0xfffc) },
+- { USB_DEVICE(0x19d2, 0xfffb) },
+- /* AC8710_V3 */
++ { USB_DEVICE(0x19d2, 0xffec) },
++ { USB_DEVICE(0x19d2, 0xffee) },
+ { USB_DEVICE(0x19d2, 0xfff6) },
+ { USB_DEVICE(0x19d2, 0xfff7) },
+ { USB_DEVICE(0x19d2, 0xfff8) },
+ { USB_DEVICE(0x19d2, 0xfff9) },
+- { USB_DEVICE(0x19d2, 0xffee) },
+- /* AC2716, MC2716 */
+- { USB_DEVICE_AND_INTERFACE_INFO(0x19d2, 0xffed, 0xff, 0xff, 0xff) },
+- /* AD3812 */
+- { USB_DEVICE_AND_INTERFACE_INFO(0x19d2, 0xffeb, 0xff, 0xff, 0xff) },
+- { USB_DEVICE(0x19d2, 0xffec) },
+- { USB_DEVICE(0x05C6, 0x3197) },
+- { USB_DEVICE(0x05C6, 0x6000) },
+- { USB_DEVICE(0x05C6, 0x9008) },
++ { USB_DEVICE(0x19d2, 0xfffb) },
++ { USB_DEVICE(0x19d2, 0xfffc) },
++ /* MG880 */
++ { USB_DEVICE(0x19d2, 0xfffd) },
+ { },
+ };
+ MODULE_DEVICE_TABLE(usb, id_table);
+diff --git a/drivers/usb/storage/unusual_devs.h b/drivers/usb/storage/unusual_devs.h
+index 80a5b366255f..14137ee543a1 100644
+--- a/drivers/usb/storage/unusual_devs.h
++++ b/drivers/usb/storage/unusual_devs.h
+@@ -101,6 +101,12 @@ UNUSUAL_DEV( 0x03f0, 0x4002, 0x0001, 0x0001,
+ "PhotoSmart R707",
+ USB_SC_DEVICE, USB_PR_DEVICE, NULL, US_FL_FIX_CAPACITY),
+
++UNUSUAL_DEV( 0x03f3, 0x0001, 0x0000, 0x9999,
++ "Adaptec",
++ "USBConnect 2000",
++ USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++ US_FL_SCM_MULT_TARG ),
++
+ /* Reported by Sebastian Kapfer <sebastian_kapfer@gmx.net>
+ * and Olaf Hering <olh@suse.de> (different bcd's, same vendor/product)
+ * for USB floppies that need the SINGLE_LUN enforcement.
+@@ -741,6 +747,12 @@ UNUSUAL_DEV( 0x059b, 0x0001, 0x0100, 0x0100,
+ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ US_FL_SINGLE_LUN ),
+
++UNUSUAL_DEV( 0x059b, 0x0040, 0x0100, 0x0100,
++ "Iomega",
++ "Jaz USB Adapter",
++ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++ US_FL_SINGLE_LUN ),
++
+ /* Reported by <Hendryk.Pfeiffer@gmx.de> */
+ UNUSUAL_DEV( 0x059f, 0x0643, 0x0000, 0x0000,
+ "LaCie",
+@@ -1113,6 +1125,18 @@ UNUSUAL_DEV( 0x0851, 0x1543, 0x0200, 0x0200,
+ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ US_FL_NOT_LOCKABLE),
+
++UNUSUAL_DEV( 0x085a, 0x0026, 0x0100, 0x0133,
++ "Xircom",
++ "PortGear USB-SCSI (Mac USB Dock)",
++ USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++ US_FL_SCM_MULT_TARG ),
++
++UNUSUAL_DEV( 0x085a, 0x0028, 0x0100, 0x0133,
++ "Xircom",
++ "PortGear USB to SCSI Converter",
++ USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++ US_FL_SCM_MULT_TARG ),
++
+ /* Submitted by Jan De Luyck <lkml@kcore.org> */
+ UNUSUAL_DEV( 0x08bd, 0x1100, 0x0000, 0x0000,
+ "CITIZEN",
+@@ -1952,6 +1976,14 @@ UNUSUAL_DEV( 0x152d, 0x2329, 0x0100, 0x0100,
+ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ US_FL_IGNORE_RESIDUE | US_FL_SANE_SENSE ),
+
++/* Entrega Technologies U1-SC25 (later Xircom PortGear PGSCSI)
++ * and Mac USB Dock USB-SCSI */
++UNUSUAL_DEV( 0x1645, 0x0007, 0x0100, 0x0133,
++ "Entrega Technologies",
++ "USB to SCSI Converter",
++ USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++ US_FL_SCM_MULT_TARG ),
++
+ /* Reported by Robert Schedel <r.schedel@yahoo.de>
+ * Note: this is a 'super top' device like the above 14cd/6600 device */
+ UNUSUAL_DEV( 0x1652, 0x6600, 0x0201, 0x0201,
+@@ -1974,6 +2006,12 @@ UNUSUAL_DEV( 0x177f, 0x0400, 0x0000, 0x0000,
+ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ US_FL_BULK_IGNORE_TAG | US_FL_MAX_SECTORS_64 ),
+
++UNUSUAL_DEV( 0x1822, 0x0001, 0x0000, 0x9999,
++ "Ariston Technologies",
++ "iConnect USB to SCSI adapter",
++ USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++ US_FL_SCM_MULT_TARG ),
++
+ /* Reported by Hans de Goede <hdegoede@redhat.com>
+ * These Appotech controllers are found in Picture Frames, they provide a
+ * (buggy) emulation of a cdrom drive which contains the windows software
+diff --git a/drivers/uwb/lc-dev.c b/drivers/uwb/lc-dev.c
+index 80079b8fed15..d0303f0dbe15 100644
+--- a/drivers/uwb/lc-dev.c
++++ b/drivers/uwb/lc-dev.c
+@@ -431,16 +431,19 @@ void uwbd_dev_onair(struct uwb_rc *rc, struct uwb_beca_e *bce)
+ uwb_dev->mac_addr = *bce->mac_addr;
+ uwb_dev->dev_addr = bce->dev_addr;
+ dev_set_name(&uwb_dev->dev, "%s", macbuf);
++
++ /* plug the beacon cache */
++ bce->uwb_dev = uwb_dev;
++ uwb_dev->bce = bce;
++ uwb_bce_get(bce); /* released in uwb_dev_sys_release() */
++
+ result = uwb_dev_add(uwb_dev, &rc->uwb_dev.dev, rc);
+ if (result < 0) {
+ dev_err(dev, "new device %s: cannot instantiate device\n",
+ macbuf);
+ goto error_dev_add;
+ }
+- /* plug the beacon cache */
+- bce->uwb_dev = uwb_dev;
+- uwb_dev->bce = bce;
+- uwb_bce_get(bce); /* released in uwb_dev_sys_release() */
++
+ dev_info(dev, "uwb device (mac %s dev %s) connected to %s %s\n",
+ macbuf, devbuf, rc->uwb_dev.dev.parent->bus->name,
+ dev_name(rc->uwb_dev.dev.parent));
+@@ -448,6 +451,8 @@ void uwbd_dev_onair(struct uwb_rc *rc, struct uwb_beca_e *bce)
+ return;
+
+ error_dev_add:
++ bce->uwb_dev = NULL;
++ uwb_bce_put(bce);
+ kfree(uwb_dev);
+ return;
+ }
+diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
+index 5f1e1f3cd186..f8bb36f9d9ce 100644
+--- a/drivers/xen/manage.c
++++ b/drivers/xen/manage.c
+@@ -103,16 +103,11 @@ static void do_suspend(void)
+
+ shutting_down = SHUTDOWN_SUSPEND;
+
+-#ifdef CONFIG_PREEMPT
+- /* If the kernel is preemptible, we need to freeze all the processes
+- to prevent them from being in the middle of a pagetable update
+- during suspend. */
+ err = freeze_processes();
+ if (err) {
+ pr_err("%s: freeze failed %d\n", __func__, err);
+ goto out;
+ }
+-#endif
+
+ err = dpm_suspend_start(PMSG_FREEZE);
+ if (err) {
+@@ -157,10 +152,8 @@ out_resume:
+ dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
+
+ out_thaw:
+-#ifdef CONFIG_PREEMPT
+ thaw_processes();
+ out:
+-#endif
+ shutting_down = SHUTDOWN_INVALID;
+ }
+ #endif /* CONFIG_HIBERNATE_CALLBACKS */
+diff --git a/fs/aio.c b/fs/aio.c
+index 1c9c5f0a9e2b..d72588a4c935 100644
+--- a/fs/aio.c
++++ b/fs/aio.c
+@@ -141,6 +141,7 @@ struct kioctx {
+
+ struct {
+ unsigned tail;
++ unsigned completed_events;
+ spinlock_t completion_lock;
+ } ____cacheline_aligned_in_smp;
+
+@@ -796,6 +797,9 @@ void exit_aio(struct mm_struct *mm)
+ unsigned i = 0;
+
+ while (1) {
++ struct completion requests_done =
++ COMPLETION_INITIALIZER_ONSTACK(requests_done);
++
+ rcu_read_lock();
+ table = rcu_dereference(mm->ioctx_table);
+
+@@ -823,7 +827,10 @@ void exit_aio(struct mm_struct *mm)
+ */
+ ctx->mmap_size = 0;
+
+- kill_ioctx(mm, ctx, NULL);
++ kill_ioctx(mm, ctx, &requests_done);
++
++ /* Wait until all IO for the context are done. */
++ wait_for_completion(&requests_done);
+ }
+ }
+
+@@ -880,6 +887,68 @@ out:
+ return ret;
+ }
+
++/* refill_reqs_available
++ * Updates the reqs_available reference counts used for tracking the
++ * number of free slots in the completion ring. This can be called
++ * from aio_complete() (to optimistically update reqs_available) or
++ * from aio_get_req() (the we're out of events case). It must be
++ * called holding ctx->completion_lock.
++ */
++static void refill_reqs_available(struct kioctx *ctx, unsigned head,
++ unsigned tail)
++{
++ unsigned events_in_ring, completed;
++
++ /* Clamp head since userland can write to it. */
++ head %= ctx->nr_events;
++ if (head <= tail)
++ events_in_ring = tail - head;
++ else
++ events_in_ring = ctx->nr_events - (head - tail);
++
++ completed = ctx->completed_events;
++ if (events_in_ring < completed)
++ completed -= events_in_ring;
++ else
++ completed = 0;
++
++ if (!completed)
++ return;
++
++ ctx->completed_events -= completed;
++ put_reqs_available(ctx, completed);
++}
++
++/* user_refill_reqs_available
++ * Called to refill reqs_available when aio_get_req() encounters an
++ * out of space in the completion ring.
++ */
++static void user_refill_reqs_available(struct kioctx *ctx)
++{
++ spin_lock_irq(&ctx->completion_lock);
++ if (ctx->completed_events) {
++ struct aio_ring *ring;
++ unsigned head;
++
++ /* Access of ring->head may race with aio_read_events_ring()
++ * here, but that's okay since whether we read the old version
++ * or the new version, and either will be valid. The important
++ * part is that head cannot pass tail since we prevent
++ * aio_complete() from updating tail by holding
++ * ctx->completion_lock. Even if head is invalid, the check
++ * against ctx->completed_events below will make sure we do the
++ * safe/right thing.
++ */
++ ring = kmap_atomic(ctx->ring_pages[0]);
++ head = ring->head;
++ kunmap_atomic(ring);
++
++ refill_reqs_available(ctx, head, ctx->tail);
++ }
++
++ spin_unlock_irq(&ctx->completion_lock);
++}
++
+ /* aio_get_req
+ * Allocate a slot for an aio request.
+ * Returns NULL if no requests are free.
+@@ -888,8 +957,11 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
+ {
+ struct kiocb *req;
+
+- if (!get_reqs_available(ctx))
+- return NULL;
++ if (!get_reqs_available(ctx)) {
++ user_refill_reqs_available(ctx);
++ if (!get_reqs_available(ctx))
++ return NULL;
++ }
+
+ req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
+ if (unlikely(!req))
+@@ -948,8 +1020,8 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
+ struct kioctx *ctx = iocb->ki_ctx;
+ struct aio_ring *ring;
+ struct io_event *ev_page, *event;
++ unsigned tail, pos, head;
+ unsigned long flags;
+- unsigned tail, pos;
+
+ /*
+ * Special case handling for sync iocbs:
+@@ -1010,10 +1082,14 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
+ ctx->tail = tail;
+
+ ring = kmap_atomic(ctx->ring_pages[0]);
++ head = ring->head;
+ ring->tail = tail;
+ kunmap_atomic(ring);
+ flush_dcache_page(ctx->ring_pages[0]);
+
++ ctx->completed_events++;
++ if (ctx->completed_events > 1)
++ refill_reqs_available(ctx, head, tail);
+ spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+ pr_debug("added to ring %p at [%u]\n", iocb, tail);
+@@ -1028,7 +1104,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
+
+ /* everything turned out well, dispose of the aiocb. */
+ kiocb_free(iocb);
+- put_reqs_available(ctx, 1);
+
+ /*
+ * We have to order our ring_info tail store above and test
+@@ -1065,6 +1140,12 @@ static long aio_read_events_ring(struct kioctx *ctx,
+ tail = ring->tail;
+ kunmap_atomic(ring);
+
++ /*
++ * Ensure that once we've read the current tail pointer, that
++ * we also see the events that were stored up to the tail.
++ */
++ smp_rmb();
++
+ pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
+
+ if (head == tail)
+diff --git a/fs/buffer.c b/fs/buffer.c
+index eba6e4f621ce..36fdceb82635 100644
+--- a/fs/buffer.c
++++ b/fs/buffer.c
+@@ -1029,7 +1029,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
+ bh = page_buffers(page);
+ if (bh->b_size == size) {
+ end_block = init_page_buffers(page, bdev,
+- index << sizebits, size);
++ (sector_t)index << sizebits,
++ size);
+ goto done;
+ }
+ if (!try_to_free_buffers(page))
+@@ -1050,7 +1051,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
+ */
+ spin_lock(&inode->i_mapping->private_lock);
+ link_dev_buffers(page, bh);
+- end_block = init_page_buffers(page, bdev, index << sizebits, size);
++ end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
++ size);
+ spin_unlock(&inode->i_mapping->private_lock);
+ done:
+ ret = (block < end_block) ? 1 : -ENXIO;
+diff --git a/fs/cachefiles/bind.c b/fs/cachefiles/bind.c
+index d749731dc0ee..fbb08e97438d 100644
+--- a/fs/cachefiles/bind.c
++++ b/fs/cachefiles/bind.c
+@@ -50,18 +50,18 @@ int cachefiles_daemon_bind(struct cachefiles_cache *cache, char *args)
+ cache->brun_percent < 100);
+
+ if (*args) {
+- pr_err("'bind' command doesn't take an argument");
++ pr_err("'bind' command doesn't take an argument\n");
+ return -EINVAL;
+ }
+
+ if (!cache->rootdirname) {
+- pr_err("No cache directory specified");
++ pr_err("No cache directory specified\n");
+ return -EINVAL;
+ }
+
+ /* don't permit already bound caches to be re-bound */
+ if (test_bit(CACHEFILES_READY, &cache->flags)) {
+- pr_err("Cache already bound");
++ pr_err("Cache already bound\n");
+ return -EBUSY;
+ }
+
+@@ -248,7 +248,7 @@ error_open_root:
+ kmem_cache_free(cachefiles_object_jar, fsdef);
+ error_root_object:
+ cachefiles_end_secure(cache, saved_cred);
+- pr_err("Failed to register: %d", ret);
++ pr_err("Failed to register: %d\n", ret);
+ return ret;
+ }
+
+diff --git a/fs/cachefiles/daemon.c b/fs/cachefiles/daemon.c
+index b078d3081d6c..ce1b115dcc28 100644
+--- a/fs/cachefiles/daemon.c
++++ b/fs/cachefiles/daemon.c
+@@ -315,7 +315,7 @@ static unsigned int cachefiles_daemon_poll(struct file *file,
+ static int cachefiles_daemon_range_error(struct cachefiles_cache *cache,
+ char *args)
+ {
+- pr_err("Free space limits must be in range 0%%<=stop<cull<run<100%%");
++ pr_err("Free space limits must be in range 0%%<=stop<cull<run<100%%\n");
+
+ return -EINVAL;
+ }
+@@ -475,12 +475,12 @@ static int cachefiles_daemon_dir(struct cachefiles_cache *cache, char *args)
+ _enter(",%s", args);
+
+ if (!*args) {
+- pr_err("Empty directory specified");
++ pr_err("Empty directory specified\n");
+ return -EINVAL;
+ }
+
+ if (cache->rootdirname) {
+- pr_err("Second cache directory specified");
++ pr_err("Second cache directory specified\n");
+ return -EEXIST;
+ }
+
+@@ -503,12 +503,12 @@ static int cachefiles_daemon_secctx(struct cachefiles_cache *cache, char *args)
+ _enter(",%s", args);
+
+ if (!*args) {
+- pr_err("Empty security context specified");
++ pr_err("Empty security context specified\n");
+ return -EINVAL;
+ }
+
+ if (cache->secctx) {
+- pr_err("Second security context specified");
++ pr_err("Second security context specified\n");
+ return -EINVAL;
+ }
+
+@@ -531,7 +531,7 @@ static int cachefiles_daemon_tag(struct cachefiles_cache *cache, char *args)
+ _enter(",%s", args);
+
+ if (!*args) {
+- pr_err("Empty tag specified");
++ pr_err("Empty tag specified\n");
+ return -EINVAL;
+ }
+
+@@ -562,12 +562,12 @@ static int cachefiles_daemon_cull(struct cachefiles_cache *cache, char *args)
+ goto inval;
+
+ if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+- pr_err("cull applied to unready cache");
++ pr_err("cull applied to unready cache\n");
+ return -EIO;
+ }
+
+ if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+- pr_err("cull applied to dead cache");
++ pr_err("cull applied to dead cache\n");
+ return -EIO;
+ }
+
+@@ -587,11 +587,11 @@ static int cachefiles_daemon_cull(struct cachefiles_cache *cache, char *args)
+
+ notdir:
+ path_put(&path);
+- pr_err("cull command requires dirfd to be a directory");
++ pr_err("cull command requires dirfd to be a directory\n");
+ return -ENOTDIR;
+
+ inval:
+- pr_err("cull command requires dirfd and filename");
++ pr_err("cull command requires dirfd and filename\n");
+ return -EINVAL;
+ }
+
+@@ -614,7 +614,7 @@ static int cachefiles_daemon_debug(struct cachefiles_cache *cache, char *args)
+ return 0;
+
+ inval:
+- pr_err("debug command requires mask");
++ pr_err("debug command requires mask\n");
+ return -EINVAL;
+ }
+
+@@ -634,12 +634,12 @@ static int cachefiles_daemon_inuse(struct cachefiles_cache *cache, char *args)
+ goto inval;
+
+ if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+- pr_err("inuse applied to unready cache");
++ pr_err("inuse applied to unready cache\n");
+ return -EIO;
+ }
+
+ if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+- pr_err("inuse applied to dead cache");
++ pr_err("inuse applied to dead cache\n");
+ return -EIO;
+ }
+
+@@ -659,11 +659,11 @@ static int cachefiles_daemon_inuse(struct cachefiles_cache *cache, char *args)
+
+ notdir:
+ path_put(&path);
+- pr_err("inuse command requires dirfd to be a directory");
++ pr_err("inuse command requires dirfd to be a directory\n");
+ return -ENOTDIR;
+
+ inval:
+- pr_err("inuse command requires dirfd and filename");
++ pr_err("inuse command requires dirfd and filename\n");
+ return -EINVAL;
+ }
+
+diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
+index 3d50998abf57..8c52472d2efa 100644
+--- a/fs/cachefiles/internal.h
++++ b/fs/cachefiles/internal.h
+@@ -255,7 +255,7 @@ extern int cachefiles_remove_object_xattr(struct cachefiles_cache *cache,
+
+ #define cachefiles_io_error(___cache, FMT, ...) \
+ do { \
+- pr_err("I/O Error: " FMT, ##__VA_ARGS__); \
++ pr_err("I/O Error: " FMT"\n", ##__VA_ARGS__); \
+ fscache_io_error(&(___cache)->cache); \
+ set_bit(CACHEFILES_DEAD, &(___cache)->flags); \
+ } while (0)
+diff --git a/fs/cachefiles/main.c b/fs/cachefiles/main.c
+index 180edfb45f66..711f13d8c2de 100644
+--- a/fs/cachefiles/main.c
++++ b/fs/cachefiles/main.c
+@@ -84,7 +84,7 @@ error_proc:
+ error_object_jar:
+ misc_deregister(&cachefiles_dev);
+ error_dev:
+- pr_err("failed to register: %d", ret);
++ pr_err("failed to register: %d\n", ret);
+ return ret;
+ }
+
+diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
+index 5bf2b41e66d3..55c0acb516d4 100644
+--- a/fs/cachefiles/namei.c
++++ b/fs/cachefiles/namei.c
+@@ -543,7 +543,7 @@ lookup_again:
+ next, next->d_inode, next->d_inode->i_ino);
+
+ } else if (!S_ISDIR(next->d_inode->i_mode)) {
+- pr_err("inode %lu is not a directory",
++ pr_err("inode %lu is not a directory\n",
+ next->d_inode->i_ino);
+ ret = -ENOBUFS;
+ goto error;
+@@ -574,7 +574,7 @@ lookup_again:
+ } else if (!S_ISDIR(next->d_inode->i_mode) &&
+ !S_ISREG(next->d_inode->i_mode)
+ ) {
+- pr_err("inode %lu is not a file or directory",
++ pr_err("inode %lu is not a file or directory\n",
+ next->d_inode->i_ino);
+ ret = -ENOBUFS;
+ goto error;
+@@ -768,7 +768,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
+ ASSERT(subdir->d_inode);
+
+ if (!S_ISDIR(subdir->d_inode->i_mode)) {
+- pr_err("%s is not a directory", dirname);
++ pr_err("%s is not a directory\n", dirname);
+ ret = -EIO;
+ goto check_error;
+ }
+@@ -795,13 +795,13 @@ check_error:
+ mkdir_error:
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(subdir);
+- pr_err("mkdir %s failed with error %d", dirname, ret);
++ pr_err("mkdir %s failed with error %d\n", dirname, ret);
+ return ERR_PTR(ret);
+
+ lookup_error:
+ mutex_unlock(&dir->d_inode->i_mutex);
+ ret = PTR_ERR(subdir);
+- pr_err("Lookup %s failed with error %d", dirname, ret);
++ pr_err("Lookup %s failed with error %d\n", dirname, ret);
+ return ERR_PTR(ret);
+
+ nomem_d_alloc:
+@@ -891,7 +891,7 @@ lookup_error:
+ if (ret == -EIO) {
+ cachefiles_io_error(cache, "Lookup failed");
+ } else if (ret != -ENOMEM) {
+- pr_err("Internal error: %d", ret);
++ pr_err("Internal error: %d\n", ret);
+ ret = -EIO;
+ }
+
+@@ -950,7 +950,7 @@ error:
+ }
+
+ if (ret != -ENOMEM) {
+- pr_err("Internal error: %d", ret);
++ pr_err("Internal error: %d\n", ret);
+ ret = -EIO;
+ }
+
+diff --git a/fs/cachefiles/xattr.c b/fs/cachefiles/xattr.c
+index 1ad51ffbb275..acbc1f094fb1 100644
+--- a/fs/cachefiles/xattr.c
++++ b/fs/cachefiles/xattr.c
+@@ -51,7 +51,7 @@ int cachefiles_check_object_type(struct cachefiles_object *object)
+ }
+
+ if (ret != -EEXIST) {
+- pr_err("Can't set xattr on %*.*s [%lu] (err %d)",
++ pr_err("Can't set xattr on %*.*s [%lu] (err %d)\n",
+ dentry->d_name.len, dentry->d_name.len,
+ dentry->d_name.name, dentry->d_inode->i_ino,
+ -ret);
+@@ -64,7 +64,7 @@ int cachefiles_check_object_type(struct cachefiles_object *object)
+ if (ret == -ERANGE)
+ goto bad_type_length;
+
+- pr_err("Can't read xattr on %*.*s [%lu] (err %d)",
++ pr_err("Can't read xattr on %*.*s [%lu] (err %d)\n",
+ dentry->d_name.len, dentry->d_name.len,
+ dentry->d_name.name, dentry->d_inode->i_ino,
+ -ret);
+@@ -85,14 +85,14 @@ error:
+ return ret;
+
+ bad_type_length:
+- pr_err("Cache object %lu type xattr length incorrect",
++ pr_err("Cache object %lu type xattr length incorrect\n",
+ dentry->d_inode->i_ino);
+ ret = -EIO;
+ goto error;
+
+ bad_type:
+ xtype[2] = 0;
+- pr_err("Cache object %*.*s [%lu] type %s not %s",
++ pr_err("Cache object %*.*s [%lu] type %s not %s\n",
+ dentry->d_name.len, dentry->d_name.len,
+ dentry->d_name.name, dentry->d_inode->i_ino,
+ xtype, type);
+@@ -293,7 +293,7 @@ error:
+ return ret;
+
+ bad_type_length:
+- pr_err("Cache object %lu xattr length incorrect",
++ pr_err("Cache object %lu xattr length incorrect\n",
+ dentry->d_inode->i_ino);
+ ret = -EIO;
+ goto error;
+diff --git a/fs/cifs/link.c b/fs/cifs/link.c
+index 68559fd557fb..a5c2812ead68 100644
+--- a/fs/cifs/link.c
++++ b/fs/cifs/link.c
+@@ -213,8 +213,12 @@ create_mf_symlink(const unsigned int xid, struct cifs_tcon *tcon,
+ if (rc)
+ goto out;
+
+- rc = tcon->ses->server->ops->create_mf_symlink(xid, tcon, cifs_sb,
+- fromName, buf, &bytes_written);
++ if (tcon->ses->server->ops->create_mf_symlink)
++ rc = tcon->ses->server->ops->create_mf_symlink(xid, tcon,
++ cifs_sb, fromName, buf, &bytes_written);
++ else
++ rc = -EOPNOTSUPP;
++
+ if (rc)
+ goto out;
+
+diff --git a/fs/eventpoll.c b/fs/eventpoll.c
+index b10b48c2a7af..7bcfff900f05 100644
+--- a/fs/eventpoll.c
++++ b/fs/eventpoll.c
+@@ -1852,7 +1852,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+ goto error_tgt_fput;
+
+ /* Check if EPOLLWAKEUP is allowed */
+- ep_take_care_of_epollwakeup(&epds);
++ if (ep_op_has_event(op))
++ ep_take_care_of_epollwakeup(&epds);
+
+ /*
+ * We have to check that the file structure underneath the file descriptor
+diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
+index 1bbe7c315138..b6874405f0dc 100644
+--- a/fs/ext4/ext4.h
++++ b/fs/ext4/ext4.h
+@@ -1826,7 +1826,7 @@ ext4_group_first_block_no(struct super_block *sb, ext4_group_t group_no)
+ /*
+ * Special error return code only used by dx_probe() and its callers.
+ */
+-#define ERR_BAD_DX_DIR -75000
++#define ERR_BAD_DX_DIR (-(MAX_ERRNO - 1))
+
+ /*
+ * Timeout and state flag for lazy initialization inode thread.
+diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
+index 9e6eced1605b..5e127be91bb6 100644
+--- a/fs/ext4/namei.c
++++ b/fs/ext4/namei.c
+@@ -1227,7 +1227,7 @@ static struct buffer_head * ext4_find_entry (struct inode *dir,
+ buffer */
+ int num = 0;
+ ext4_lblk_t nblocks;
+- int i, err;
++ int i, err = 0;
+ int namelen;
+
+ *res_dir = NULL;
+@@ -1264,7 +1264,11 @@ static struct buffer_head * ext4_find_entry (struct inode *dir,
+ * return. Otherwise, fall back to doing a search the
+ * old fashioned way.
+ */
+- if (bh || (err != ERR_BAD_DX_DIR))
++ if (err == -ENOENT)
++ return NULL;
++ if (err && err != ERR_BAD_DX_DIR)
++ return ERR_PTR(err);
++ if (bh)
+ return bh;
+ dxtrace(printk(KERN_DEBUG "ext4_find_entry: dx failed, "
+ "falling back\n"));
+@@ -1295,6 +1299,11 @@ restart:
+ }
+ num++;
+ bh = ext4_getblk(NULL, dir, b++, 0, &err);
++ if (unlikely(err)) {
++ if (ra_max == 0)
++ return ERR_PTR(err);
++ break;
++ }
+ bh_use[ra_max] = bh;
+ if (bh)
+ ll_rw_block(READ | REQ_META | REQ_PRIO,
+@@ -1417,6 +1426,8 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsi
+ return ERR_PTR(-ENAMETOOLONG);
+
+ bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
++ if (IS_ERR(bh))
++ return (struct dentry *) bh;
+ inode = NULL;
+ if (bh) {
+ __u32 ino = le32_to_cpu(de->inode);
+@@ -1450,6 +1461,8 @@ struct dentry *ext4_get_parent(struct dentry *child)
+ struct buffer_head *bh;
+
+ bh = ext4_find_entry(child->d_inode, &dotdot, &de, NULL);
++ if (IS_ERR(bh))
++ return (struct dentry *) bh;
+ if (!bh)
+ return ERR_PTR(-ENOENT);
+ ino = le32_to_cpu(de->inode);
+@@ -2727,6 +2740,8 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
+
+ retval = -ENOENT;
+ bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
++ if (IS_ERR(bh))
++ return PTR_ERR(bh);
+ if (!bh)
+ goto end_rmdir;
+
+@@ -2794,6 +2809,8 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
+
+ retval = -ENOENT;
+ bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
++ if (IS_ERR(bh))
++ return PTR_ERR(bh);
+ if (!bh)
+ goto end_unlink;
+
+@@ -3121,6 +3138,8 @@ static int ext4_find_delete_entry(handle_t *handle, struct inode *dir,
+ struct ext4_dir_entry_2 *de;
+
+ bh = ext4_find_entry(dir, d_name, &de, NULL);
++ if (IS_ERR(bh))
++ return PTR_ERR(bh);
+ if (bh) {
+ retval = ext4_delete_entry(handle, dir, de, bh);
+ brelse(bh);
+@@ -3205,6 +3224,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ dquot_initialize(new.inode);
+
+ old.bh = ext4_find_entry(old.dir, &old.dentry->d_name, &old.de, NULL);
++ if (IS_ERR(old.bh))
++ return PTR_ERR(old.bh);
+ /*
+ * Check for inode number is _not_ due to possible IO errors.
+ * We might rmdir the source, keep it as pwd of some process
+@@ -3217,6 +3238,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+
+ new.bh = ext4_find_entry(new.dir, &new.dentry->d_name,
+ &new.de, &new.inlined);
++ if (IS_ERR(new.bh)) {
++ retval = PTR_ERR(new.bh);
++ new.bh = NULL;
++ goto end_rename;
++ }
+ if (new.bh) {
+ if (!new.inode) {
+ brelse(new.bh);
+@@ -3345,6 +3371,8 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
+
+ old.bh = ext4_find_entry(old.dir, &old.dentry->d_name,
+ &old.de, &old.inlined);
++ if (IS_ERR(old.bh))
++ return PTR_ERR(old.bh);
+ /*
+ * Check for inode number is _not_ due to possible IO errors.
+ * We might rmdir the source, keep it as pwd of some process
+@@ -3357,6 +3385,11 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
+
+ new.bh = ext4_find_entry(new.dir, &new.dentry->d_name,
+ &new.de, &new.inlined);
++ if (IS_ERR(new.bh)) {
++ retval = PTR_ERR(new.bh);
++ new.bh = NULL;
++ goto end_rename;
++ }
+
+ /* RENAME_EXCHANGE case: old *and* new must both exist */
+ if (!new.bh || le32_to_cpu(new.de->inode) != new.inode->i_ino)
+diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
+index bb0e80f03e2e..1e43b905ff98 100644
+--- a/fs/ext4/resize.c
++++ b/fs/ext4/resize.c
+@@ -575,6 +575,7 @@ handle_bb:
+ bh = bclean(handle, sb, block);
+ if (IS_ERR(bh)) {
+ err = PTR_ERR(bh);
++ bh = NULL;
+ goto out;
+ }
+ overhead = ext4_group_overhead_blocks(sb, group);
+@@ -603,6 +604,7 @@ handle_ib:
+ bh = bclean(handle, sb, block);
+ if (IS_ERR(bh)) {
+ err = PTR_ERR(bh);
++ bh = NULL;
+ goto out;
+ }
+
+diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
+index e62e59477884..9c1a680ee468 100644
+--- a/fs/gfs2/inode.c
++++ b/fs/gfs2/inode.c
+@@ -626,8 +626,10 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
+ if (!IS_ERR(inode)) {
+ d = d_splice_alias(inode, dentry);
+ error = PTR_ERR(d);
+- if (IS_ERR(d))
++ if (IS_ERR(d)) {
++ inode = ERR_CAST(d);
+ goto fail_gunlock;
++ }
+ error = 0;
+ if (file) {
+ if (S_ISREG(inode->i_mode)) {
+@@ -854,7 +856,6 @@ static struct dentry *__gfs2_lookup(struct inode *dir, struct dentry *dentry,
+
+ d = d_splice_alias(inode, dentry);
+ if (IS_ERR(d)) {
+- iput(inode);
+ gfs2_glock_dq_uninit(&gh);
+ return d;
+ }
+diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
+index 8f27c93f8d2e..ec9e082f9ecd 100644
+--- a/fs/lockd/svc.c
++++ b/fs/lockd/svc.c
+@@ -253,13 +253,11 @@ static int lockd_up_net(struct svc_serv *serv, struct net *net)
+
+ error = make_socks(serv, net);
+ if (error < 0)
+- goto err_socks;
++ goto err_bind;
+ set_grace_period(net);
+ dprintk("lockd_up_net: per-net data created; net=%p\n", net);
+ return 0;
+
+-err_socks:
+- svc_rpcb_cleanup(serv, net);
+ err_bind:
+ ln->nlmsvc_users--;
+ return error;
+diff --git a/fs/locks.c b/fs/locks.c
+index 717fbc404e6b..be530f9b13ce 100644
+--- a/fs/locks.c
++++ b/fs/locks.c
+@@ -1595,7 +1595,7 @@ static int generic_add_lease(struct file *filp, long arg, struct file_lock **flp
+ smp_mb();
+ error = check_conflicting_open(dentry, arg);
+ if (error)
+- locks_unlink_lock(flp);
++ locks_unlink_lock(before);
+ out:
+ if (is_deleg)
+ mutex_unlock(&inode->i_mutex);
+diff --git a/fs/namei.c b/fs/namei.c
+index 17ca8b85c308..d4ca42085e1d 100644
+--- a/fs/namei.c
++++ b/fs/namei.c
+@@ -644,24 +644,22 @@ static int complete_walk(struct nameidata *nd)
+
+ static __always_inline void set_root(struct nameidata *nd)
+ {
+- if (!nd->root.mnt)
+- get_fs_root(current->fs, &nd->root);
++ get_fs_root(current->fs, &nd->root);
+ }
+
+ static int link_path_walk(const char *, struct nameidata *);
+
+-static __always_inline void set_root_rcu(struct nameidata *nd)
++static __always_inline unsigned set_root_rcu(struct nameidata *nd)
+ {
+- if (!nd->root.mnt) {
+- struct fs_struct *fs = current->fs;
+- unsigned seq;
++ struct fs_struct *fs = current->fs;
++ unsigned seq, res;
+
+- do {
+- seq = read_seqcount_begin(&fs->seq);
+- nd->root = fs->root;
+- nd->seq = __read_seqcount_begin(&nd->root.dentry->d_seq);
+- } while (read_seqcount_retry(&fs->seq, seq));
+- }
++ do {
++ seq = read_seqcount_begin(&fs->seq);
++ nd->root = fs->root;
++ res = __read_seqcount_begin(&nd->root.dentry->d_seq);
++ } while (read_seqcount_retry(&fs->seq, seq));
++ return res;
+ }
+
+ static void path_put_conditional(struct path *path, struct nameidata *nd)
+@@ -861,7 +859,8 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
+ return PTR_ERR(s);
+ }
+ if (*s == '/') {
+- set_root(nd);
++ if (!nd->root.mnt)
++ set_root(nd);
+ path_put(&nd->path);
+ nd->path = nd->root;
+ path_get(&nd->root);
+@@ -1136,7 +1135,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
+
+ static int follow_dotdot_rcu(struct nameidata *nd)
+ {
+- set_root_rcu(nd);
++ if (!nd->root.mnt)
++ set_root_rcu(nd);
+
+ while (1) {
+ if (nd->path.dentry == nd->root.dentry &&
+@@ -1249,7 +1249,8 @@ static void follow_mount(struct path *path)
+
+ static void follow_dotdot(struct nameidata *nd)
+ {
+- set_root(nd);
++ if (!nd->root.mnt)
++ set_root(nd);
+
+ while(1) {
+ struct dentry *old = nd->path.dentry;
+@@ -1847,7 +1848,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
+ if (*name=='/') {
+ if (flags & LOOKUP_RCU) {
+ rcu_read_lock();
+- set_root_rcu(nd);
++ nd->seq = set_root_rcu(nd);
+ } else {
+ set_root(nd);
+ path_get(&nd->root);
+diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
+index 9b431f44fad9..c3ccfe440390 100644
+--- a/fs/nfs/blocklayout/blocklayout.c
++++ b/fs/nfs/blocklayout/blocklayout.c
+@@ -210,8 +210,7 @@ static void bl_end_io_read(struct bio *bio, int err)
+ SetPageUptodate(bvec->bv_page);
+
+ if (err) {
+- struct nfs_pgio_data *rdata = par->data;
+- struct nfs_pgio_header *header = rdata->header;
++ struct nfs_pgio_header *header = par->data;
+
+ if (!header->pnfs_error)
+ header->pnfs_error = -EIO;
+@@ -224,43 +223,44 @@ static void bl_end_io_read(struct bio *bio, int err)
+ static void bl_read_cleanup(struct work_struct *work)
+ {
+ struct rpc_task *task;
+- struct nfs_pgio_data *rdata;
++ struct nfs_pgio_header *hdr;
+ dprintk("%s enter\n", __func__);
+ task = container_of(work, struct rpc_task, u.tk_work);
+- rdata = container_of(task, struct nfs_pgio_data, task);
+- pnfs_ld_read_done(rdata);
++ hdr = container_of(task, struct nfs_pgio_header, task);
++ pnfs_ld_read_done(hdr);
+ }
+
+ static void
+ bl_end_par_io_read(void *data, int unused)
+ {
+- struct nfs_pgio_data *rdata = data;
++ struct nfs_pgio_header *hdr = data;
+
+- rdata->task.tk_status = rdata->header->pnfs_error;
+- INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
+- schedule_work(&rdata->task.u.tk_work);
++ hdr->task.tk_status = hdr->pnfs_error;
++ INIT_WORK(&hdr->task.u.tk_work, bl_read_cleanup);
++ schedule_work(&hdr->task.u.tk_work);
+ }
+
+ static enum pnfs_try_status
+-bl_read_pagelist(struct nfs_pgio_data *rdata)
++bl_read_pagelist(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *header = rdata->header;
++ struct nfs_pgio_header *header = hdr;
+ int i, hole;
+ struct bio *bio = NULL;
+ struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+ sector_t isect, extent_length = 0;
+ struct parallel_io *par;
+- loff_t f_offset = rdata->args.offset;
+- size_t bytes_left = rdata->args.count;
++ loff_t f_offset = hdr->args.offset;
++ size_t bytes_left = hdr->args.count;
+ unsigned int pg_offset, pg_len;
+- struct page **pages = rdata->args.pages;
+- int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
++ struct page **pages = hdr->args.pages;
++ int pg_index = hdr->args.pgbase >> PAGE_CACHE_SHIFT;
+ const bool is_dio = (header->dreq != NULL);
+
+ dprintk("%s enter nr_pages %u offset %lld count %u\n", __func__,
+- rdata->pages.npages, f_offset, (unsigned int)rdata->args.count);
++ hdr->page_array.npages, f_offset,
++ (unsigned int)hdr->args.count);
+
+- par = alloc_parallel(rdata);
++ par = alloc_parallel(hdr);
+ if (!par)
+ goto use_mds;
+ par->pnfs_callback = bl_end_par_io_read;
+@@ -268,7 +268,7 @@ bl_read_pagelist(struct nfs_pgio_data *rdata)
+
+ isect = (sector_t) (f_offset >> SECTOR_SHIFT);
+ /* Code assumes extents are page-aligned */
+- for (i = pg_index; i < rdata->pages.npages; i++) {
++ for (i = pg_index; i < hdr->page_array.npages; i++) {
+ if (!extent_length) {
+ /* We've used up the previous extent */
+ bl_put_extent(be);
+@@ -317,7 +317,8 @@ bl_read_pagelist(struct nfs_pgio_data *rdata)
+ struct pnfs_block_extent *be_read;
+
+ be_read = (hole && cow_read) ? cow_read : be;
+- bio = do_add_page_to_bio(bio, rdata->pages.npages - i,
++ bio = do_add_page_to_bio(bio,
++ hdr->page_array.npages - i,
+ READ,
+ isect, pages[i], be_read,
+ bl_end_io_read, par,
+@@ -332,10 +333,10 @@ bl_read_pagelist(struct nfs_pgio_data *rdata)
+ extent_length -= PAGE_CACHE_SECTORS;
+ }
+ if ((isect << SECTOR_SHIFT) >= header->inode->i_size) {
+- rdata->res.eof = 1;
+- rdata->res.count = header->inode->i_size - rdata->args.offset;
++ hdr->res.eof = 1;
++ hdr->res.count = header->inode->i_size - hdr->args.offset;
+ } else {
+- rdata->res.count = (isect << SECTOR_SHIFT) - rdata->args.offset;
++ hdr->res.count = (isect << SECTOR_SHIFT) - hdr->args.offset;
+ }
+ out:
+ bl_put_extent(be);
+@@ -390,8 +391,7 @@ static void bl_end_io_write_zero(struct bio *bio, int err)
+ }
+
+ if (unlikely(err)) {
+- struct nfs_pgio_data *data = par->data;
+- struct nfs_pgio_header *header = data->header;
++ struct nfs_pgio_header *header = par->data;
+
+ if (!header->pnfs_error)
+ header->pnfs_error = -EIO;
+@@ -405,8 +405,7 @@ static void bl_end_io_write(struct bio *bio, int err)
+ {
+ struct parallel_io *par = bio->bi_private;
+ const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+- struct nfs_pgio_data *data = par->data;
+- struct nfs_pgio_header *header = data->header;
++ struct nfs_pgio_header *header = par->data;
+
+ if (!uptodate) {
+ if (!header->pnfs_error)
+@@ -423,32 +422,32 @@ static void bl_end_io_write(struct bio *bio, int err)
+ static void bl_write_cleanup(struct work_struct *work)
+ {
+ struct rpc_task *task;
+- struct nfs_pgio_data *wdata;
++ struct nfs_pgio_header *hdr;
+ dprintk("%s enter\n", __func__);
+ task = container_of(work, struct rpc_task, u.tk_work);
+- wdata = container_of(task, struct nfs_pgio_data, task);
+- if (likely(!wdata->header->pnfs_error)) {
++ hdr = container_of(task, struct nfs_pgio_header, task);
++ if (likely(!hdr->pnfs_error)) {
+ /* Marks for LAYOUTCOMMIT */
+- mark_extents_written(BLK_LSEG2EXT(wdata->header->lseg),
+- wdata->args.offset, wdata->args.count);
++ mark_extents_written(BLK_LSEG2EXT(hdr->lseg),
++ hdr->args.offset, hdr->args.count);
+ }
+- pnfs_ld_write_done(wdata);
++ pnfs_ld_write_done(hdr);
+ }
+
+ /* Called when last of bios associated with a bl_write_pagelist call finishes */
+ static void bl_end_par_io_write(void *data, int num_se)
+ {
+- struct nfs_pgio_data *wdata = data;
++ struct nfs_pgio_header *hdr = data;
+
+- if (unlikely(wdata->header->pnfs_error)) {
+- bl_free_short_extents(&BLK_LSEG2EXT(wdata->header->lseg)->bl_inval,
++ if (unlikely(hdr->pnfs_error)) {
++ bl_free_short_extents(&BLK_LSEG2EXT(hdr->lseg)->bl_inval,
+ num_se);
+ }
+
+- wdata->task.tk_status = wdata->header->pnfs_error;
+- wdata->verf.committed = NFS_FILE_SYNC;
+- INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
+- schedule_work(&wdata->task.u.tk_work);
++ hdr->task.tk_status = hdr->pnfs_error;
++ hdr->writeverf.committed = NFS_FILE_SYNC;
++ INIT_WORK(&hdr->task.u.tk_work, bl_write_cleanup);
++ schedule_work(&hdr->task.u.tk_work);
+ }
+
+ /* FIXME STUB - mark intersection of layout and page as bad, so is not
+@@ -673,18 +672,17 @@ check_page:
+ }
+
+ static enum pnfs_try_status
+-bl_write_pagelist(struct nfs_pgio_data *wdata, int sync)
++bl_write_pagelist(struct nfs_pgio_header *header, int sync)
+ {
+- struct nfs_pgio_header *header = wdata->header;
+ int i, ret, npg_zero, pg_index, last = 0;
+ struct bio *bio = NULL;
+ struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+ sector_t isect, last_isect = 0, extent_length = 0;
+ struct parallel_io *par = NULL;
+- loff_t offset = wdata->args.offset;
+- size_t count = wdata->args.count;
++ loff_t offset = header->args.offset;
++ size_t count = header->args.count;
+ unsigned int pg_offset, pg_len, saved_len;
+- struct page **pages = wdata->args.pages;
++ struct page **pages = header->args.pages;
+ struct page *page;
+ pgoff_t index;
+ u64 temp;
+@@ -699,11 +697,11 @@ bl_write_pagelist(struct nfs_pgio_data *wdata, int sync)
+ dprintk("pnfsblock nonblock aligned DIO writes. Resend MDS\n");
+ goto out_mds;
+ }
+- /* At this point, wdata->pages is a (sequential) list of nfs_pages.
++ /* At this point, header->page_aray is a (sequential) list of nfs_pages.
+ * We want to write each, and if there is an error set pnfs_error
+ * to have it redone using nfs.
+ */
+- par = alloc_parallel(wdata);
++ par = alloc_parallel(header);
+ if (!par)
+ goto out_mds;
+ par->pnfs_callback = bl_end_par_io_write;
+@@ -790,8 +788,8 @@ next_page:
+ bio = bl_submit_bio(WRITE, bio);
+
+ /* Middle pages */
+- pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
+- for (i = pg_index; i < wdata->pages.npages; i++) {
++ pg_index = header->args.pgbase >> PAGE_CACHE_SHIFT;
++ for (i = pg_index; i < header->page_array.npages; i++) {
+ if (!extent_length) {
+ /* We've used up the previous extent */
+ bl_put_extent(be);
+@@ -862,7 +860,8 @@ next_page:
+ }
+
+
+- bio = do_add_page_to_bio(bio, wdata->pages.npages - i, WRITE,
++ bio = do_add_page_to_bio(bio, header->page_array.npages - i,
++ WRITE,
+ isect, pages[i], be,
+ bl_end_io_write, par,
+ pg_offset, pg_len);
+@@ -890,7 +889,7 @@ next_page:
+ }
+
+ write_done:
+- wdata->res.count = wdata->args.count;
++ header->res.count = header->args.count;
+ out:
+ bl_put_extent(be);
+ bl_put_extent(cow_read);
+diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
+index f11b9eed0de1..1b34eeb0d8de 100644
+--- a/fs/nfs/direct.c
++++ b/fs/nfs/direct.c
+@@ -148,8 +148,8 @@ static void nfs_direct_set_hdr_verf(struct nfs_direct_req *dreq,
+ {
+ struct nfs_writeverf *verfp;
+
+- verfp = nfs_direct_select_verf(dreq, hdr->data->ds_clp,
+- hdr->data->ds_idx);
++ verfp = nfs_direct_select_verf(dreq, hdr->ds_clp,
++ hdr->ds_idx);
+ WARN_ON_ONCE(verfp->committed >= 0);
+ memcpy(verfp, &hdr->verf, sizeof(struct nfs_writeverf));
+ WARN_ON_ONCE(verfp->committed < 0);
+@@ -169,8 +169,8 @@ static int nfs_direct_set_or_cmp_hdr_verf(struct nfs_direct_req *dreq,
+ {
+ struct nfs_writeverf *verfp;
+
+- verfp = nfs_direct_select_verf(dreq, hdr->data->ds_clp,
+- hdr->data->ds_idx);
++ verfp = nfs_direct_select_verf(dreq, hdr->ds_clp,
++ hdr->ds_idx);
+ if (verfp->committed < 0) {
+ nfs_direct_set_hdr_verf(dreq, hdr);
+ return 0;
+diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
+index d2eba1c13b7e..a596a1938b52 100644
+--- a/fs/nfs/filelayout/filelayout.c
++++ b/fs/nfs/filelayout/filelayout.c
+@@ -84,19 +84,18 @@ filelayout_get_dserver_offset(struct pnfs_layout_segment *lseg, loff_t offset)
+ BUG();
+ }
+
+-static void filelayout_reset_write(struct nfs_pgio_data *data)
++static void filelayout_reset_write(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+- struct rpc_task *task = &data->task;
++ struct rpc_task *task = &hdr->task;
+
+ if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags)) {
+ dprintk("%s Reset task %5u for i/o through MDS "
+ "(req %s/%llu, %u bytes @ offset %llu)\n", __func__,
+- data->task.tk_pid,
++ hdr->task.tk_pid,
+ hdr->inode->i_sb->s_id,
+ (unsigned long long)NFS_FILEID(hdr->inode),
+- data->args.count,
+- (unsigned long long)data->args.offset);
++ hdr->args.count,
++ (unsigned long long)hdr->args.offset);
+
+ task->tk_status = pnfs_write_done_resend_to_mds(hdr->inode,
+ &hdr->pages,
+@@ -105,19 +104,18 @@ static void filelayout_reset_write(struct nfs_pgio_data *data)
+ }
+ }
+
+-static void filelayout_reset_read(struct nfs_pgio_data *data)
++static void filelayout_reset_read(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+- struct rpc_task *task = &data->task;
++ struct rpc_task *task = &hdr->task;
+
+ if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags)) {
+ dprintk("%s Reset task %5u for i/o through MDS "
+ "(req %s/%llu, %u bytes @ offset %llu)\n", __func__,
+- data->task.tk_pid,
++ hdr->task.tk_pid,
+ hdr->inode->i_sb->s_id,
+ (unsigned long long)NFS_FILEID(hdr->inode),
+- data->args.count,
+- (unsigned long long)data->args.offset);
++ hdr->args.count,
++ (unsigned long long)hdr->args.offset);
+
+ task->tk_status = pnfs_read_done_resend_to_mds(hdr->inode,
+ &hdr->pages,
+@@ -243,18 +241,17 @@ wait_on_recovery:
+ /* NFS_PROTO call done callback routines */
+
+ static int filelayout_read_done_cb(struct rpc_task *task,
+- struct nfs_pgio_data *data)
++ struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+ int err;
+
+- trace_nfs4_pnfs_read(data, task->tk_status);
+- err = filelayout_async_handle_error(task, data->args.context->state,
+- data->ds_clp, hdr->lseg);
++ trace_nfs4_pnfs_read(hdr, task->tk_status);
++ err = filelayout_async_handle_error(task, hdr->args.context->state,
++ hdr->ds_clp, hdr->lseg);
+
+ switch (err) {
+ case -NFS4ERR_RESET_TO_MDS:
+- filelayout_reset_read(data);
++ filelayout_reset_read(hdr);
+ return task->tk_status;
+ case -EAGAIN:
+ rpc_restart_call_prepare(task);
+@@ -270,15 +267,14 @@ static int filelayout_read_done_cb(struct rpc_task *task,
+ * rfc5661 is not clear about which credential should be used.
+ */
+ static void
+-filelayout_set_layoutcommit(struct nfs_pgio_data *wdata)
++filelayout_set_layoutcommit(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = wdata->header;
+
+ if (FILELAYOUT_LSEG(hdr->lseg)->commit_through_mds ||
+- wdata->res.verf->committed == NFS_FILE_SYNC)
++ hdr->res.verf->committed == NFS_FILE_SYNC)
+ return;
+
+- pnfs_set_layoutcommit(wdata);
++ pnfs_set_layoutcommit(hdr);
+ dprintk("%s inode %lu pls_end_pos %lu\n", __func__, hdr->inode->i_ino,
+ (unsigned long) NFS_I(hdr->inode)->layout->plh_lwb);
+ }
+@@ -305,83 +301,82 @@ filelayout_reset_to_mds(struct pnfs_layout_segment *lseg)
+ */
+ static void filelayout_read_prepare(struct rpc_task *task, void *data)
+ {
+- struct nfs_pgio_data *rdata = data;
++ struct nfs_pgio_header *hdr = data;
+
+- if (unlikely(test_bit(NFS_CONTEXT_BAD, &rdata->args.context->flags))) {
++ if (unlikely(test_bit(NFS_CONTEXT_BAD, &hdr->args.context->flags))) {
+ rpc_exit(task, -EIO);
+ return;
+ }
+- if (filelayout_reset_to_mds(rdata->header->lseg)) {
++ if (filelayout_reset_to_mds(hdr->lseg)) {
+ dprintk("%s task %u reset io to MDS\n", __func__, task->tk_pid);
+- filelayout_reset_read(rdata);
++ filelayout_reset_read(hdr);
+ rpc_exit(task, 0);
+ return;
+ }
+- rdata->pgio_done_cb = filelayout_read_done_cb;
++ hdr->pgio_done_cb = filelayout_read_done_cb;
+
+- if (nfs41_setup_sequence(rdata->ds_clp->cl_session,
+- &rdata->args.seq_args,
+- &rdata->res.seq_res,
++ if (nfs41_setup_sequence(hdr->ds_clp->cl_session,
++ &hdr->args.seq_args,
++ &hdr->res.seq_res,
+ task))
+ return;
+- if (nfs4_set_rw_stateid(&rdata->args.stateid, rdata->args.context,
+- rdata->args.lock_context, FMODE_READ) == -EIO)
++ if (nfs4_set_rw_stateid(&hdr->args.stateid, hdr->args.context,
++ hdr->args.lock_context, FMODE_READ) == -EIO)
+ rpc_exit(task, -EIO); /* lost lock, terminate I/O */
+ }
+
+ static void filelayout_read_call_done(struct rpc_task *task, void *data)
+ {
+- struct nfs_pgio_data *rdata = data;
++ struct nfs_pgio_header *hdr = data;
+
+ dprintk("--> %s task->tk_status %d\n", __func__, task->tk_status);
+
+- if (test_bit(NFS_IOHDR_REDO, &rdata->header->flags) &&
++ if (test_bit(NFS_IOHDR_REDO, &hdr->flags) &&
+ task->tk_status == 0) {
+- nfs41_sequence_done(task, &rdata->res.seq_res);
++ nfs41_sequence_done(task, &hdr->res.seq_res);
+ return;
+ }
+
+ /* Note this may cause RPC to be resent */
+- rdata->header->mds_ops->rpc_call_done(task, data);
++ hdr->mds_ops->rpc_call_done(task, data);
+ }
+
+ static void filelayout_read_count_stats(struct rpc_task *task, void *data)
+ {
+- struct nfs_pgio_data *rdata = data;
++ struct nfs_pgio_header *hdr = data;
+
+- rpc_count_iostats(task, NFS_SERVER(rdata->header->inode)->client->cl_metrics);
++ rpc_count_iostats(task, NFS_SERVER(hdr->inode)->client->cl_metrics);
+ }
+
+ static void filelayout_read_release(void *data)
+ {
+- struct nfs_pgio_data *rdata = data;
+- struct pnfs_layout_hdr *lo = rdata->header->lseg->pls_layout;
++ struct nfs_pgio_header *hdr = data;
++ struct pnfs_layout_hdr *lo = hdr->lseg->pls_layout;
+
+ filelayout_fenceme(lo->plh_inode, lo);
+- nfs_put_client(rdata->ds_clp);
+- rdata->header->mds_ops->rpc_release(data);
++ nfs_put_client(hdr->ds_clp);
++ hdr->mds_ops->rpc_release(data);
+ }
+
+ static int filelayout_write_done_cb(struct rpc_task *task,
+- struct nfs_pgio_data *data)
++ struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+ int err;
+
+- trace_nfs4_pnfs_write(data, task->tk_status);
+- err = filelayout_async_handle_error(task, data->args.context->state,
+- data->ds_clp, hdr->lseg);
++ trace_nfs4_pnfs_write(hdr, task->tk_status);
++ err = filelayout_async_handle_error(task, hdr->args.context->state,
++ hdr->ds_clp, hdr->lseg);
+
+ switch (err) {
+ case -NFS4ERR_RESET_TO_MDS:
+- filelayout_reset_write(data);
++ filelayout_reset_write(hdr);
+ return task->tk_status;
+ case -EAGAIN:
+ rpc_restart_call_prepare(task);
+ return -EAGAIN;
+ }
+
+- filelayout_set_layoutcommit(data);
++ filelayout_set_layoutcommit(hdr);
+ return 0;
+ }
+
+@@ -419,57 +414,57 @@ static int filelayout_commit_done_cb(struct rpc_task *task,
+
+ static void filelayout_write_prepare(struct rpc_task *task, void *data)
+ {
+- struct nfs_pgio_data *wdata = data;
++ struct nfs_pgio_header *hdr = data;
+
+- if (unlikely(test_bit(NFS_CONTEXT_BAD, &wdata->args.context->flags))) {
++ if (unlikely(test_bit(NFS_CONTEXT_BAD, &hdr->args.context->flags))) {
+ rpc_exit(task, -EIO);
+ return;
+ }
+- if (filelayout_reset_to_mds(wdata->header->lseg)) {
++ if (filelayout_reset_to_mds(hdr->lseg)) {
+ dprintk("%s task %u reset io to MDS\n", __func__, task->tk_pid);
+- filelayout_reset_write(wdata);
++ filelayout_reset_write(hdr);
+ rpc_exit(task, 0);
+ return;
+ }
+- if (nfs41_setup_sequence(wdata->ds_clp->cl_session,
+- &wdata->args.seq_args,
+- &wdata->res.seq_res,
++ if (nfs41_setup_sequence(hdr->ds_clp->cl_session,
++ &hdr->args.seq_args,
++ &hdr->res.seq_res,
+ task))
+ return;
+- if (nfs4_set_rw_stateid(&wdata->args.stateid, wdata->args.context,
+- wdata->args.lock_context, FMODE_WRITE) == -EIO)
++ if (nfs4_set_rw_stateid(&hdr->args.stateid, hdr->args.context,
++ hdr->args.lock_context, FMODE_WRITE) == -EIO)
+ rpc_exit(task, -EIO); /* lost lock, terminate I/O */
+ }
+
+ static void filelayout_write_call_done(struct rpc_task *task, void *data)
+ {
+- struct nfs_pgio_data *wdata = data;
++ struct nfs_pgio_header *hdr = data;
+
+- if (test_bit(NFS_IOHDR_REDO, &wdata->header->flags) &&
++ if (test_bit(NFS_IOHDR_REDO, &hdr->flags) &&
+ task->tk_status == 0) {
+- nfs41_sequence_done(task, &wdata->res.seq_res);
++ nfs41_sequence_done(task, &hdr->res.seq_res);
+ return;
+ }
+
+ /* Note this may cause RPC to be resent */
+- wdata->header->mds_ops->rpc_call_done(task, data);
++ hdr->mds_ops->rpc_call_done(task, data);
+ }
+
+ static void filelayout_write_count_stats(struct rpc_task *task, void *data)
+ {
+- struct nfs_pgio_data *wdata = data;
++ struct nfs_pgio_header *hdr = data;
+
+- rpc_count_iostats(task, NFS_SERVER(wdata->header->inode)->client->cl_metrics);
++ rpc_count_iostats(task, NFS_SERVER(hdr->inode)->client->cl_metrics);
+ }
+
+ static void filelayout_write_release(void *data)
+ {
+- struct nfs_pgio_data *wdata = data;
+- struct pnfs_layout_hdr *lo = wdata->header->lseg->pls_layout;
++ struct nfs_pgio_header *hdr = data;
++ struct pnfs_layout_hdr *lo = hdr->lseg->pls_layout;
+
+ filelayout_fenceme(lo->plh_inode, lo);
+- nfs_put_client(wdata->ds_clp);
+- wdata->header->mds_ops->rpc_release(data);
++ nfs_put_client(hdr->ds_clp);
++ hdr->mds_ops->rpc_release(data);
+ }
+
+ static void filelayout_commit_prepare(struct rpc_task *task, void *data)
+@@ -529,19 +524,18 @@ static const struct rpc_call_ops filelayout_commit_call_ops = {
+ };
+
+ static enum pnfs_try_status
+-filelayout_read_pagelist(struct nfs_pgio_data *data)
++filelayout_read_pagelist(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+ struct pnfs_layout_segment *lseg = hdr->lseg;
+ struct nfs4_pnfs_ds *ds;
+ struct rpc_clnt *ds_clnt;
+- loff_t offset = data->args.offset;
++ loff_t offset = hdr->args.offset;
+ u32 j, idx;
+ struct nfs_fh *fh;
+
+ dprintk("--> %s ino %lu pgbase %u req %Zu@%llu\n",
+ __func__, hdr->inode->i_ino,
+- data->args.pgbase, (size_t)data->args.count, offset);
++ hdr->args.pgbase, (size_t)hdr->args.count, offset);
+
+ /* Retrieve the correct rpc_client for the byte range */
+ j = nfs4_fl_calc_j_index(lseg, offset);
+@@ -559,30 +553,29 @@ filelayout_read_pagelist(struct nfs_pgio_data *data)
+
+ /* No multipath support. Use first DS */
+ atomic_inc(&ds->ds_clp->cl_count);
+- data->ds_clp = ds->ds_clp;
+- data->ds_idx = idx;
++ hdr->ds_clp = ds->ds_clp;
++ hdr->ds_idx = idx;
+ fh = nfs4_fl_select_ds_fh(lseg, j);
+ if (fh)
+- data->args.fh = fh;
++ hdr->args.fh = fh;
+
+- data->args.offset = filelayout_get_dserver_offset(lseg, offset);
+- data->mds_offset = offset;
++ hdr->args.offset = filelayout_get_dserver_offset(lseg, offset);
++ hdr->mds_offset = offset;
+
+ /* Perform an asynchronous read to ds */
+- nfs_initiate_pgio(ds_clnt, data,
++ nfs_initiate_pgio(ds_clnt, hdr,
+ &filelayout_read_call_ops, 0, RPC_TASK_SOFTCONN);
+ return PNFS_ATTEMPTED;
+ }
+
+ /* Perform async writes. */
+ static enum pnfs_try_status
+-filelayout_write_pagelist(struct nfs_pgio_data *data, int sync)
++filelayout_write_pagelist(struct nfs_pgio_header *hdr, int sync)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+ struct pnfs_layout_segment *lseg = hdr->lseg;
+ struct nfs4_pnfs_ds *ds;
+ struct rpc_clnt *ds_clnt;
+- loff_t offset = data->args.offset;
++ loff_t offset = hdr->args.offset;
+ u32 j, idx;
+ struct nfs_fh *fh;
+
+@@ -598,21 +591,20 @@ filelayout_write_pagelist(struct nfs_pgio_data *data, int sync)
+ return PNFS_NOT_ATTEMPTED;
+
+ dprintk("%s ino %lu sync %d req %Zu@%llu DS: %s cl_count %d\n",
+- __func__, hdr->inode->i_ino, sync, (size_t) data->args.count,
++ __func__, hdr->inode->i_ino, sync, (size_t) hdr->args.count,
+ offset, ds->ds_remotestr, atomic_read(&ds->ds_clp->cl_count));
+
+- data->pgio_done_cb = filelayout_write_done_cb;
++ hdr->pgio_done_cb = filelayout_write_done_cb;
+ atomic_inc(&ds->ds_clp->cl_count);
+- data->ds_clp = ds->ds_clp;
+- data->ds_idx = idx;
++ hdr->ds_clp = ds->ds_clp;
++ hdr->ds_idx = idx;
+ fh = nfs4_fl_select_ds_fh(lseg, j);
+ if (fh)
+- data->args.fh = fh;
+-
+- data->args.offset = filelayout_get_dserver_offset(lseg, offset);
++ hdr->args.fh = fh;
++ hdr->args.offset = filelayout_get_dserver_offset(lseg, offset);
+
+ /* Perform an asynchronous write */
+- nfs_initiate_pgio(ds_clnt, data,
++ nfs_initiate_pgio(ds_clnt, hdr,
+ &filelayout_write_call_ops, sync,
+ RPC_TASK_SOFTCONN);
+ return PNFS_ATTEMPTED;
+@@ -1023,6 +1015,7 @@ static u32 select_bucket_index(struct nfs4_filelayout_segment *fl, u32 j)
+
+ /* The generic layer is about to remove the req from the commit list.
+ * If this will make the bucket empty, it will need to put the lseg reference.
++ * Note this is must be called holding the inode (/cinfo) lock
+ */
+ static void
+ filelayout_clear_request_commit(struct nfs_page *req,
+@@ -1030,7 +1023,6 @@ filelayout_clear_request_commit(struct nfs_page *req,
+ {
+ struct pnfs_layout_segment *freeme = NULL;
+
+- spin_lock(cinfo->lock);
+ if (!test_and_clear_bit(PG_COMMIT_TO_DS, &req->wb_flags))
+ goto out;
+ cinfo->ds->nwritten--;
+@@ -1045,8 +1037,7 @@ filelayout_clear_request_commit(struct nfs_page *req,
+ }
+ out:
+ nfs_request_remove_commit_list(req, cinfo);
+- spin_unlock(cinfo->lock);
+- pnfs_put_lseg(freeme);
++ pnfs_put_lseg_async(freeme);
+ }
+
+ static struct list_head *
+diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
+index f415cbf9f6c3..4d0eecbc98bc 100644
+--- a/fs/nfs/internal.h
++++ b/fs/nfs/internal.h
+@@ -238,11 +238,11 @@ void nfs_set_pgio_error(struct nfs_pgio_header *hdr, int error, loff_t pos);
+ int nfs_iocounter_wait(struct nfs_io_counter *c);
+
+ extern const struct nfs_pageio_ops nfs_pgio_rw_ops;
+-struct nfs_rw_header *nfs_rw_header_alloc(const struct nfs_rw_ops *);
+-void nfs_rw_header_free(struct nfs_pgio_header *);
+-void nfs_pgio_data_release(struct nfs_pgio_data *);
++struct nfs_pgio_header *nfs_pgio_header_alloc(const struct nfs_rw_ops *);
++void nfs_pgio_header_free(struct nfs_pgio_header *);
++void nfs_pgio_data_destroy(struct nfs_pgio_header *);
+ int nfs_generic_pgio(struct nfs_pageio_descriptor *, struct nfs_pgio_header *);
+-int nfs_initiate_pgio(struct rpc_clnt *, struct nfs_pgio_data *,
++int nfs_initiate_pgio(struct rpc_clnt *, struct nfs_pgio_header *,
+ const struct rpc_call_ops *, int, int);
+ void nfs_free_request(struct nfs_page *req);
+
+@@ -482,7 +482,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
+ extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
+
+ /* nfs4proc.c */
+-extern void __nfs4_read_done_cb(struct nfs_pgio_data *);
++extern void __nfs4_read_done_cb(struct nfs_pgio_header *);
+ extern struct nfs_client *nfs4_init_client(struct nfs_client *clp,
+ const struct rpc_timeout *timeparms,
+ const char *ip_addr);
+diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
+index f0afa291fd58..809670eba52a 100644
+--- a/fs/nfs/nfs3proc.c
++++ b/fs/nfs/nfs3proc.c
+@@ -795,41 +795,44 @@ nfs3_proc_pathconf(struct nfs_server *server, struct nfs_fh *fhandle,
+ return status;
+ }
+
+-static int nfs3_read_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs3_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+- struct inode *inode = data->header->inode;
++ struct inode *inode = hdr->inode;
+
+ if (nfs3_async_handle_jukebox(task, inode))
+ return -EAGAIN;
+
+ nfs_invalidate_atime(inode);
+- nfs_refresh_inode(inode, &data->fattr);
++ nfs_refresh_inode(inode, &hdr->fattr);
+ return 0;
+ }
+
+-static void nfs3_proc_read_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs3_proc_read_setup(struct nfs_pgio_header *hdr,
++ struct rpc_message *msg)
+ {
+ msg->rpc_proc = &nfs3_procedures[NFS3PROC_READ];
+ }
+
+-static int nfs3_proc_pgio_rpc_prepare(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs3_proc_pgio_rpc_prepare(struct rpc_task *task,
++ struct nfs_pgio_header *hdr)
+ {
+ rpc_call_start(task);
+ return 0;
+ }
+
+-static int nfs3_write_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs3_write_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+- struct inode *inode = data->header->inode;
++ struct inode *inode = hdr->inode;
+
+ if (nfs3_async_handle_jukebox(task, inode))
+ return -EAGAIN;
+ if (task->tk_status >= 0)
+- nfs_post_op_update_inode_force_wcc(inode, data->res.fattr);
++ nfs_post_op_update_inode_force_wcc(inode, hdr->res.fattr);
+ return 0;
+ }
+
+-static void nfs3_proc_write_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs3_proc_write_setup(struct nfs_pgio_header *hdr,
++ struct rpc_message *msg)
+ {
+ msg->rpc_proc = &nfs3_procedures[NFS3PROC_WRITE];
+ }
+diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
+index ba2affa51941..b8ea4a26998c 100644
+--- a/fs/nfs/nfs4_fs.h
++++ b/fs/nfs/nfs4_fs.h
+@@ -337,11 +337,11 @@ nfs4_state_protect(struct nfs_client *clp, unsigned long sp4_mode,
+ */
+ static inline void
+ nfs4_state_protect_write(struct nfs_client *clp, struct rpc_clnt **clntp,
+- struct rpc_message *msg, struct nfs_pgio_data *wdata)
++ struct rpc_message *msg, struct nfs_pgio_header *hdr)
+ {
+ if (_nfs4_state_protect(clp, NFS_SP4_MACH_CRED_WRITE, clntp, msg) &&
+ !test_bit(NFS_SP4_MACH_CRED_COMMIT, &clp->cl_sp4_flags))
+- wdata->args.stable = NFS_FILE_SYNC;
++ hdr->args.stable = NFS_FILE_SYNC;
+ }
+ #else /* CONFIG_NFS_v4_1 */
+ static inline struct nfs4_session *nfs4_get_session(const struct nfs_server *server)
+@@ -369,7 +369,7 @@ nfs4_state_protect(struct nfs_client *clp, unsigned long sp4_flags,
+
+ static inline void
+ nfs4_state_protect_write(struct nfs_client *clp, struct rpc_clnt **clntp,
+- struct rpc_message *msg, struct nfs_pgio_data *wdata)
++ struct rpc_message *msg, struct nfs_pgio_header *hdr)
+ {
+ }
+ #endif /* CONFIG_NFS_V4_1 */
+diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
+index aa9ef4876046..6e045d5ee950 100644
+--- a/fs/nfs/nfs4client.c
++++ b/fs/nfs/nfs4client.c
+@@ -482,6 +482,16 @@ int nfs40_walk_client_list(struct nfs_client *new,
+
+ spin_lock(&nn->nfs_client_lock);
+ list_for_each_entry(pos, &nn->nfs_client_list, cl_share_link) {
++
++ if (pos->rpc_ops != new->rpc_ops)
++ continue;
++
++ if (pos->cl_proto != new->cl_proto)
++ continue;
++
++ if (pos->cl_minorversion != new->cl_minorversion)
++ continue;
++
+ /* If "pos" isn't marked ready, we can't trust the
+ * remaining fields in "pos" */
+ if (pos->cl_cons_state > NFS_CS_READY) {
+@@ -501,15 +511,6 @@ int nfs40_walk_client_list(struct nfs_client *new,
+ if (pos->cl_cons_state != NFS_CS_READY)
+ continue;
+
+- if (pos->rpc_ops != new->rpc_ops)
+- continue;
+-
+- if (pos->cl_proto != new->cl_proto)
+- continue;
+-
+- if (pos->cl_minorversion != new->cl_minorversion)
+- continue;
+-
+ if (pos->cl_clientid != new->cl_clientid)
+ continue;
+
+@@ -622,6 +623,16 @@ int nfs41_walk_client_list(struct nfs_client *new,
+
+ spin_lock(&nn->nfs_client_lock);
+ list_for_each_entry(pos, &nn->nfs_client_list, cl_share_link) {
++
++ if (pos->rpc_ops != new->rpc_ops)
++ continue;
++
++ if (pos->cl_proto != new->cl_proto)
++ continue;
++
++ if (pos->cl_minorversion != new->cl_minorversion)
++ continue;
++
+ /* If "pos" isn't marked ready, we can't trust the
+ * remaining fields in "pos", especially the client
+ * ID and serverowner fields. Wait for CREATE_SESSION
+@@ -647,15 +658,6 @@ int nfs41_walk_client_list(struct nfs_client *new,
+ if (pos->cl_cons_state != NFS_CS_READY)
+ continue;
+
+- if (pos->rpc_ops != new->rpc_ops)
+- continue;
+-
+- if (pos->cl_proto != new->cl_proto)
+- continue;
+-
+- if (pos->cl_minorversion != new->cl_minorversion)
+- continue;
+-
+ if (!nfs4_match_clientids(pos, new))
+ continue;
+
+diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
+index dac979866f83..3275e94538e7 100644
+--- a/fs/nfs/nfs4proc.c
++++ b/fs/nfs/nfs4proc.c
+@@ -2599,23 +2599,23 @@ static void nfs4_close_prepare(struct rpc_task *task, void *data)
+ is_rdwr = test_bit(NFS_O_RDWR_STATE, &state->flags);
+ is_rdonly = test_bit(NFS_O_RDONLY_STATE, &state->flags);
+ is_wronly = test_bit(NFS_O_WRONLY_STATE, &state->flags);
+- /* Calculate the current open share mode */
+- calldata->arg.fmode = 0;
+- if (is_rdonly || is_rdwr)
+- calldata->arg.fmode |= FMODE_READ;
+- if (is_wronly || is_rdwr)
+- calldata->arg.fmode |= FMODE_WRITE;
+ /* Calculate the change in open mode */
++ calldata->arg.fmode = 0;
+ if (state->n_rdwr == 0) {
+- if (state->n_rdonly == 0) {
+- call_close |= is_rdonly || is_rdwr;
+- calldata->arg.fmode &= ~FMODE_READ;
+- }
+- if (state->n_wronly == 0) {
+- call_close |= is_wronly || is_rdwr;
+- calldata->arg.fmode &= ~FMODE_WRITE;
+- }
+- }
++ if (state->n_rdonly == 0)
++ call_close |= is_rdonly;
++ else if (is_rdonly)
++ calldata->arg.fmode |= FMODE_READ;
++ if (state->n_wronly == 0)
++ call_close |= is_wronly;
++ else if (is_wronly)
++ calldata->arg.fmode |= FMODE_WRITE;
++ } else if (is_rdwr)
++ calldata->arg.fmode |= FMODE_READ|FMODE_WRITE;
++
++ if (calldata->arg.fmode == 0)
++ call_close |= is_rdwr;
++
+ if (!nfs4_valid_open_stateid(state))
+ call_close = 0;
+ spin_unlock(&state->owner->so_lock);
+@@ -4041,24 +4041,25 @@ static bool nfs4_error_stateid_expired(int err)
+ return false;
+ }
+
+-void __nfs4_read_done_cb(struct nfs_pgio_data *data)
++void __nfs4_read_done_cb(struct nfs_pgio_header *hdr)
+ {
+- nfs_invalidate_atime(data->header->inode);
++ nfs_invalidate_atime(hdr->inode);
+ }
+
+-static int nfs4_read_done_cb(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_read_done_cb(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+- struct nfs_server *server = NFS_SERVER(data->header->inode);
++ struct nfs_server *server = NFS_SERVER(hdr->inode);
+
+- trace_nfs4_read(data, task->tk_status);
+- if (nfs4_async_handle_error(task, server, data->args.context->state) == -EAGAIN) {
++ trace_nfs4_read(hdr, task->tk_status);
++ if (nfs4_async_handle_error(task, server,
++ hdr->args.context->state) == -EAGAIN) {
+ rpc_restart_call_prepare(task);
+ return -EAGAIN;
+ }
+
+- __nfs4_read_done_cb(data);
++ __nfs4_read_done_cb(hdr);
+ if (task->tk_status > 0)
+- renew_lease(server, data->timestamp);
++ renew_lease(server, hdr->timestamp);
+ return 0;
+ }
+
+@@ -4076,54 +4077,59 @@ static bool nfs4_read_stateid_changed(struct rpc_task *task,
+ return true;
+ }
+
+-static int nfs4_read_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+
+ dprintk("--> %s\n", __func__);
+
+- if (!nfs4_sequence_done(task, &data->res.seq_res))
++ if (!nfs4_sequence_done(task, &hdr->res.seq_res))
+ return -EAGAIN;
+- if (nfs4_read_stateid_changed(task, &data->args))
++ if (nfs4_read_stateid_changed(task, &hdr->args))
+ return -EAGAIN;
+- return data->pgio_done_cb ? data->pgio_done_cb(task, data) :
+- nfs4_read_done_cb(task, data);
++ return hdr->pgio_done_cb ? hdr->pgio_done_cb(task, hdr) :
++ nfs4_read_done_cb(task, hdr);
+ }
+
+-static void nfs4_proc_read_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs4_proc_read_setup(struct nfs_pgio_header *hdr,
++ struct rpc_message *msg)
+ {
+- data->timestamp = jiffies;
+- data->pgio_done_cb = nfs4_read_done_cb;
++ hdr->timestamp = jiffies;
++ hdr->pgio_done_cb = nfs4_read_done_cb;
+ msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_READ];
+- nfs4_init_sequence(&data->args.seq_args, &data->res.seq_res, 0);
++ nfs4_init_sequence(&hdr->args.seq_args, &hdr->res.seq_res, 0);
+ }
+
+-static int nfs4_proc_pgio_rpc_prepare(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_proc_pgio_rpc_prepare(struct rpc_task *task,
++ struct nfs_pgio_header *hdr)
+ {
+- if (nfs4_setup_sequence(NFS_SERVER(data->header->inode),
+- &data->args.seq_args,
+- &data->res.seq_res,
++ if (nfs4_setup_sequence(NFS_SERVER(hdr->inode),
++ &hdr->args.seq_args,
++ &hdr->res.seq_res,
+ task))
+ return 0;
+- if (nfs4_set_rw_stateid(&data->args.stateid, data->args.context,
+- data->args.lock_context, data->header->rw_ops->rw_mode) == -EIO)
++ if (nfs4_set_rw_stateid(&hdr->args.stateid, hdr->args.context,
++ hdr->args.lock_context,
++ hdr->rw_ops->rw_mode) == -EIO)
+ return -EIO;
+- if (unlikely(test_bit(NFS_CONTEXT_BAD, &data->args.context->flags)))
++ if (unlikely(test_bit(NFS_CONTEXT_BAD, &hdr->args.context->flags)))
+ return -EIO;
+ return 0;
+ }
+
+-static int nfs4_write_done_cb(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_write_done_cb(struct rpc_task *task,
++ struct nfs_pgio_header *hdr)
+ {
+- struct inode *inode = data->header->inode;
++ struct inode *inode = hdr->inode;
+
+- trace_nfs4_write(data, task->tk_status);
+- if (nfs4_async_handle_error(task, NFS_SERVER(inode), data->args.context->state) == -EAGAIN) {
++ trace_nfs4_write(hdr, task->tk_status);
++ if (nfs4_async_handle_error(task, NFS_SERVER(inode),
++ hdr->args.context->state) == -EAGAIN) {
+ rpc_restart_call_prepare(task);
+ return -EAGAIN;
+ }
+ if (task->tk_status >= 0) {
+- renew_lease(NFS_SERVER(inode), data->timestamp);
+- nfs_post_op_update_inode_force_wcc(inode, &data->fattr);
++ renew_lease(NFS_SERVER(inode), hdr->timestamp);
++ nfs_post_op_update_inode_force_wcc(inode, &hdr->fattr);
+ }
+ return 0;
+ }
+@@ -4142,23 +4148,21 @@ static bool nfs4_write_stateid_changed(struct rpc_task *task,
+ return true;
+ }
+
+-static int nfs4_write_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_write_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+- if (!nfs4_sequence_done(task, &data->res.seq_res))
++ if (!nfs4_sequence_done(task, &hdr->res.seq_res))
+ return -EAGAIN;
+- if (nfs4_write_stateid_changed(task, &data->args))
++ if (nfs4_write_stateid_changed(task, &hdr->args))
+ return -EAGAIN;
+- return data->pgio_done_cb ? data->pgio_done_cb(task, data) :
+- nfs4_write_done_cb(task, data);
++ return hdr->pgio_done_cb ? hdr->pgio_done_cb(task, hdr) :
++ nfs4_write_done_cb(task, hdr);
+ }
+
+ static
+-bool nfs4_write_need_cache_consistency_data(const struct nfs_pgio_data *data)
++bool nfs4_write_need_cache_consistency_data(struct nfs_pgio_header *hdr)
+ {
+- const struct nfs_pgio_header *hdr = data->header;
+-
+ /* Don't request attributes for pNFS or O_DIRECT writes */
+- if (data->ds_clp != NULL || hdr->dreq != NULL)
++ if (hdr->ds_clp != NULL || hdr->dreq != NULL)
+ return false;
+ /* Otherwise, request attributes if and only if we don't hold
+ * a delegation
+@@ -4166,23 +4170,24 @@ bool nfs4_write_need_cache_consistency_data(const struct nfs_pgio_data *data)
+ return nfs4_have_delegation(hdr->inode, FMODE_READ) == 0;
+ }
+
+-static void nfs4_proc_write_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs4_proc_write_setup(struct nfs_pgio_header *hdr,
++ struct rpc_message *msg)
+ {
+- struct nfs_server *server = NFS_SERVER(data->header->inode);
++ struct nfs_server *server = NFS_SERVER(hdr->inode);
+
+- if (!nfs4_write_need_cache_consistency_data(data)) {
+- data->args.bitmask = NULL;
+- data->res.fattr = NULL;
++ if (!nfs4_write_need_cache_consistency_data(hdr)) {
++ hdr->args.bitmask = NULL;
++ hdr->res.fattr = NULL;
+ } else
+- data->args.bitmask = server->cache_consistency_bitmask;
++ hdr->args.bitmask = server->cache_consistency_bitmask;
+
+- if (!data->pgio_done_cb)
+- data->pgio_done_cb = nfs4_write_done_cb;
+- data->res.server = server;
+- data->timestamp = jiffies;
++ if (!hdr->pgio_done_cb)
++ hdr->pgio_done_cb = nfs4_write_done_cb;
++ hdr->res.server = server;
++ hdr->timestamp = jiffies;
+
+ msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_WRITE];
+- nfs4_init_sequence(&data->args.seq_args, &data->res.seq_res, 1);
++ nfs4_init_sequence(&hdr->args.seq_args, &hdr->res.seq_res, 1);
+ }
+
+ static void nfs4_proc_commit_rpc_prepare(struct rpc_task *task, struct nfs_commit_data *data)
+diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h
+index 0a744f3a86f6..1c32adbe728d 100644
+--- a/fs/nfs/nfs4trace.h
++++ b/fs/nfs/nfs4trace.h
+@@ -932,11 +932,11 @@ DEFINE_NFS4_IDMAP_EVENT(nfs4_map_gid_to_group);
+
+ DECLARE_EVENT_CLASS(nfs4_read_event,
+ TP_PROTO(
+- const struct nfs_pgio_data *data,
++ const struct nfs_pgio_header *hdr,
+ int error
+ ),
+
+- TP_ARGS(data, error),
++ TP_ARGS(hdr, error),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+@@ -948,12 +948,12 @@ DECLARE_EVENT_CLASS(nfs4_read_event,
+ ),
+
+ TP_fast_assign(
+- const struct inode *inode = data->header->inode;
++ const struct inode *inode = hdr->inode;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->fileid = NFS_FILEID(inode);
+ __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode));
+- __entry->offset = data->args.offset;
+- __entry->count = data->args.count;
++ __entry->offset = hdr->args.offset;
++ __entry->count = hdr->args.count;
+ __entry->error = error;
+ ),
+
+@@ -972,10 +972,10 @@ DECLARE_EVENT_CLASS(nfs4_read_event,
+ #define DEFINE_NFS4_READ_EVENT(name) \
+ DEFINE_EVENT(nfs4_read_event, name, \
+ TP_PROTO( \
+- const struct nfs_pgio_data *data, \
++ const struct nfs_pgio_header *hdr, \
+ int error \
+ ), \
+- TP_ARGS(data, error))
++ TP_ARGS(hdr, error))
+ DEFINE_NFS4_READ_EVENT(nfs4_read);
+ #ifdef CONFIG_NFS_V4_1
+ DEFINE_NFS4_READ_EVENT(nfs4_pnfs_read);
+@@ -983,11 +983,11 @@ DEFINE_NFS4_READ_EVENT(nfs4_pnfs_read);
+
+ DECLARE_EVENT_CLASS(nfs4_write_event,
+ TP_PROTO(
+- const struct nfs_pgio_data *data,
++ const struct nfs_pgio_header *hdr,
+ int error
+ ),
+
+- TP_ARGS(data, error),
++ TP_ARGS(hdr, error),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+@@ -999,12 +999,12 @@ DECLARE_EVENT_CLASS(nfs4_write_event,
+ ),
+
+ TP_fast_assign(
+- const struct inode *inode = data->header->inode;
++ const struct inode *inode = hdr->inode;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->fileid = NFS_FILEID(inode);
+ __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode));
+- __entry->offset = data->args.offset;
+- __entry->count = data->args.count;
++ __entry->offset = hdr->args.offset;
++ __entry->count = hdr->args.count;
+ __entry->error = error;
+ ),
+
+@@ -1024,10 +1024,10 @@ DECLARE_EVENT_CLASS(nfs4_write_event,
+ #define DEFINE_NFS4_WRITE_EVENT(name) \
+ DEFINE_EVENT(nfs4_write_event, name, \
+ TP_PROTO( \
+- const struct nfs_pgio_data *data, \
++ const struct nfs_pgio_header *hdr, \
+ int error \
+ ), \
+- TP_ARGS(data, error))
++ TP_ARGS(hdr, error))
+ DEFINE_NFS4_WRITE_EVENT(nfs4_write);
+ #ifdef CONFIG_NFS_V4_1
+ DEFINE_NFS4_WRITE_EVENT(nfs4_pnfs_write);
+diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
+index 611320753db2..ae05278b3761 100644
+--- a/fs/nfs/objlayout/objio_osd.c
++++ b/fs/nfs/objlayout/objio_osd.c
+@@ -439,22 +439,21 @@ static void _read_done(struct ore_io_state *ios, void *private)
+ objlayout_read_done(&objios->oir, status, objios->sync);
+ }
+
+-int objio_read_pagelist(struct nfs_pgio_data *rdata)
++int objio_read_pagelist(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = rdata->header;
+ struct objio_state *objios;
+ int ret;
+
+ ret = objio_alloc_io_state(NFS_I(hdr->inode)->layout, true,
+- hdr->lseg, rdata->args.pages, rdata->args.pgbase,
+- rdata->args.offset, rdata->args.count, rdata,
++ hdr->lseg, hdr->args.pages, hdr->args.pgbase,
++ hdr->args.offset, hdr->args.count, hdr,
+ GFP_KERNEL, &objios);
+ if (unlikely(ret))
+ return ret;
+
+ objios->ios->done = _read_done;
+ dprintk("%s: offset=0x%llx length=0x%x\n", __func__,
+- rdata->args.offset, rdata->args.count);
++ hdr->args.offset, hdr->args.count);
+ ret = ore_read(objios->ios);
+ if (unlikely(ret))
+ objio_free_result(&objios->oir);
+@@ -487,11 +486,11 @@ static void _write_done(struct ore_io_state *ios, void *private)
+ static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
+ {
+ struct objio_state *objios = priv;
+- struct nfs_pgio_data *wdata = objios->oir.rpcdata;
+- struct address_space *mapping = wdata->header->inode->i_mapping;
++ struct nfs_pgio_header *hdr = objios->oir.rpcdata;
++ struct address_space *mapping = hdr->inode->i_mapping;
+ pgoff_t index = offset / PAGE_SIZE;
+ struct page *page;
+- loff_t i_size = i_size_read(wdata->header->inode);
++ loff_t i_size = i_size_read(hdr->inode);
+
+ if (offset >= i_size) {
+ *uptodate = true;
+@@ -531,15 +530,14 @@ static const struct _ore_r4w_op _r4w_op = {
+ .put_page = &__r4w_put_page,
+ };
+
+-int objio_write_pagelist(struct nfs_pgio_data *wdata, int how)
++int objio_write_pagelist(struct nfs_pgio_header *hdr, int how)
+ {
+- struct nfs_pgio_header *hdr = wdata->header;
+ struct objio_state *objios;
+ int ret;
+
+ ret = objio_alloc_io_state(NFS_I(hdr->inode)->layout, false,
+- hdr->lseg, wdata->args.pages, wdata->args.pgbase,
+- wdata->args.offset, wdata->args.count, wdata, GFP_NOFS,
++ hdr->lseg, hdr->args.pages, hdr->args.pgbase,
++ hdr->args.offset, hdr->args.count, hdr, GFP_NOFS,
+ &objios);
+ if (unlikely(ret))
+ return ret;
+@@ -551,7 +549,7 @@ int objio_write_pagelist(struct nfs_pgio_data *wdata, int how)
+ objios->ios->done = _write_done;
+
+ dprintk("%s: offset=0x%llx length=0x%x\n", __func__,
+- wdata->args.offset, wdata->args.count);
++ hdr->args.offset, hdr->args.count);
+ ret = ore_write(objios->ios);
+ if (unlikely(ret)) {
+ objio_free_result(&objios->oir);
+diff --git a/fs/nfs/objlayout/objlayout.c b/fs/nfs/objlayout/objlayout.c
+index 765d3f54e986..86312787cee6 100644
+--- a/fs/nfs/objlayout/objlayout.c
++++ b/fs/nfs/objlayout/objlayout.c
+@@ -229,36 +229,36 @@ objlayout_io_set_result(struct objlayout_io_res *oir, unsigned index,
+ static void _rpc_read_complete(struct work_struct *work)
+ {
+ struct rpc_task *task;
+- struct nfs_pgio_data *rdata;
++ struct nfs_pgio_header *hdr;
+
+ dprintk("%s enter\n", __func__);
+ task = container_of(work, struct rpc_task, u.tk_work);
+- rdata = container_of(task, struct nfs_pgio_data, task);
++ hdr = container_of(task, struct nfs_pgio_header, task);
+
+- pnfs_ld_read_done(rdata);
++ pnfs_ld_read_done(hdr);
+ }
+
+ void
+ objlayout_read_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
+ {
+- struct nfs_pgio_data *rdata = oir->rpcdata;
++ struct nfs_pgio_header *hdr = oir->rpcdata;
+
+- oir->status = rdata->task.tk_status = status;
++ oir->status = hdr->task.tk_status = status;
+ if (status >= 0)
+- rdata->res.count = status;
++ hdr->res.count = status;
+ else
+- rdata->header->pnfs_error = status;
++ hdr->pnfs_error = status;
+ objlayout_iodone(oir);
+ /* must not use oir after this point */
+
+ dprintk("%s: Return status=%zd eof=%d sync=%d\n", __func__,
+- status, rdata->res.eof, sync);
++ status, hdr->res.eof, sync);
+
+ if (sync)
+- pnfs_ld_read_done(rdata);
++ pnfs_ld_read_done(hdr);
+ else {
+- INIT_WORK(&rdata->task.u.tk_work, _rpc_read_complete);
+- schedule_work(&rdata->task.u.tk_work);
++ INIT_WORK(&hdr->task.u.tk_work, _rpc_read_complete);
++ schedule_work(&hdr->task.u.tk_work);
+ }
+ }
+
+@@ -266,12 +266,11 @@ objlayout_read_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
+ * Perform sync or async reads.
+ */
+ enum pnfs_try_status
+-objlayout_read_pagelist(struct nfs_pgio_data *rdata)
++objlayout_read_pagelist(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = rdata->header;
+ struct inode *inode = hdr->inode;
+- loff_t offset = rdata->args.offset;
+- size_t count = rdata->args.count;
++ loff_t offset = hdr->args.offset;
++ size_t count = hdr->args.count;
+ int err;
+ loff_t eof;
+
+@@ -279,23 +278,23 @@ objlayout_read_pagelist(struct nfs_pgio_data *rdata)
+ if (unlikely(offset + count > eof)) {
+ if (offset >= eof) {
+ err = 0;
+- rdata->res.count = 0;
+- rdata->res.eof = 1;
++ hdr->res.count = 0;
++ hdr->res.eof = 1;
+ /*FIXME: do we need to call pnfs_ld_read_done() */
+ goto out;
+ }
+ count = eof - offset;
+ }
+
+- rdata->res.eof = (offset + count) >= eof;
+- _fix_verify_io_params(hdr->lseg, &rdata->args.pages,
+- &rdata->args.pgbase,
+- rdata->args.offset, rdata->args.count);
++ hdr->res.eof = (offset + count) >= eof;
++ _fix_verify_io_params(hdr->lseg, &hdr->args.pages,
++ &hdr->args.pgbase,
++ hdr->args.offset, hdr->args.count);
+
+ dprintk("%s: inode(%lx) offset 0x%llx count 0x%Zx eof=%d\n",
+- __func__, inode->i_ino, offset, count, rdata->res.eof);
++ __func__, inode->i_ino, offset, count, hdr->res.eof);
+
+- err = objio_read_pagelist(rdata);
++ err = objio_read_pagelist(hdr);
+ out:
+ if (unlikely(err)) {
+ hdr->pnfs_error = err;
+@@ -312,38 +311,38 @@ objlayout_read_pagelist(struct nfs_pgio_data *rdata)
+ static void _rpc_write_complete(struct work_struct *work)
+ {
+ struct rpc_task *task;
+- struct nfs_pgio_data *wdata;
++ struct nfs_pgio_header *hdr;
+
+ dprintk("%s enter\n", __func__);
+ task = container_of(work, struct rpc_task, u.tk_work);
+- wdata = container_of(task, struct nfs_pgio_data, task);
++ hdr = container_of(task, struct nfs_pgio_header, task);
+
+- pnfs_ld_write_done(wdata);
++ pnfs_ld_write_done(hdr);
+ }
+
+ void
+ objlayout_write_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
+ {
+- struct nfs_pgio_data *wdata = oir->rpcdata;
++ struct nfs_pgio_header *hdr = oir->rpcdata;
+
+- oir->status = wdata->task.tk_status = status;
++ oir->status = hdr->task.tk_status = status;
+ if (status >= 0) {
+- wdata->res.count = status;
+- wdata->verf.committed = oir->committed;
++ hdr->res.count = status;
++ hdr->writeverf.committed = oir->committed;
+ } else {
+- wdata->header->pnfs_error = status;
++ hdr->pnfs_error = status;
+ }
+ objlayout_iodone(oir);
+ /* must not use oir after this point */
+
+ dprintk("%s: Return status %zd committed %d sync=%d\n", __func__,
+- status, wdata->verf.committed, sync);
++ status, hdr->writeverf.committed, sync);
+
+ if (sync)
+- pnfs_ld_write_done(wdata);
++ pnfs_ld_write_done(hdr);
+ else {
+- INIT_WORK(&wdata->task.u.tk_work, _rpc_write_complete);
+- schedule_work(&wdata->task.u.tk_work);
++ INIT_WORK(&hdr->task.u.tk_work, _rpc_write_complete);
++ schedule_work(&hdr->task.u.tk_work);
+ }
+ }
+
+@@ -351,17 +350,15 @@ objlayout_write_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
+ * Perform sync or async writes.
+ */
+ enum pnfs_try_status
+-objlayout_write_pagelist(struct nfs_pgio_data *wdata,
+- int how)
++objlayout_write_pagelist(struct nfs_pgio_header *hdr, int how)
+ {
+- struct nfs_pgio_header *hdr = wdata->header;
+ int err;
+
+- _fix_verify_io_params(hdr->lseg, &wdata->args.pages,
+- &wdata->args.pgbase,
+- wdata->args.offset, wdata->args.count);
++ _fix_verify_io_params(hdr->lseg, &hdr->args.pages,
++ &hdr->args.pgbase,
++ hdr->args.offset, hdr->args.count);
+
+- err = objio_write_pagelist(wdata, how);
++ err = objio_write_pagelist(hdr, how);
+ if (unlikely(err)) {
+ hdr->pnfs_error = err;
+ dprintk("%s: Returned Error %d\n", __func__, err);
+diff --git a/fs/nfs/objlayout/objlayout.h b/fs/nfs/objlayout/objlayout.h
+index 01e041029a6c..fd13f1d2f136 100644
+--- a/fs/nfs/objlayout/objlayout.h
++++ b/fs/nfs/objlayout/objlayout.h
+@@ -119,8 +119,8 @@ extern void objio_free_lseg(struct pnfs_layout_segment *lseg);
+ */
+ extern void objio_free_result(struct objlayout_io_res *oir);
+
+-extern int objio_read_pagelist(struct nfs_pgio_data *rdata);
+-extern int objio_write_pagelist(struct nfs_pgio_data *wdata, int how);
++extern int objio_read_pagelist(struct nfs_pgio_header *rdata);
++extern int objio_write_pagelist(struct nfs_pgio_header *wdata, int how);
+
+ /*
+ * callback API
+@@ -168,10 +168,10 @@ extern struct pnfs_layout_segment *objlayout_alloc_lseg(
+ extern void objlayout_free_lseg(struct pnfs_layout_segment *);
+
+ extern enum pnfs_try_status objlayout_read_pagelist(
+- struct nfs_pgio_data *);
++ struct nfs_pgio_header *);
+
+ extern enum pnfs_try_status objlayout_write_pagelist(
+- struct nfs_pgio_data *,
++ struct nfs_pgio_header *,
+ int how);
+
+ extern void objlayout_encode_layoutcommit(
+diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
+index 17fab89f6358..34136ff5abf0 100644
+--- a/fs/nfs/pagelist.c
++++ b/fs/nfs/pagelist.c
+@@ -145,19 +145,51 @@ static int nfs_wait_bit_uninterruptible(void *word)
+ /*
+ * nfs_page_group_lock - lock the head of the page group
+ * @req - request in group that is to be locked
++ * @nonblock - if true don't block waiting for lock
+ *
+ * this lock must be held if modifying the page group list
++ *
++ * return 0 on success, < 0 on error: -EDELAY if nonblocking or the
++ * result from wait_on_bit_lock
++ *
++ * NOTE: calling with nonblock=false should always have set the
++ * lock bit (see fs/buffer.c and other uses of wait_on_bit_lock
++ * with TASK_UNINTERRUPTIBLE), so there is no need to check the result.
++ */
++int
++nfs_page_group_lock(struct nfs_page *req, bool nonblock)
++{
++ struct nfs_page *head = req->wb_head;
++
++ WARN_ON_ONCE(head != head->wb_head);
++
++ if (!test_and_set_bit(PG_HEADLOCK, &head->wb_flags))
++ return 0;
++
++ if (!nonblock)
++ return wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
++ nfs_wait_bit_uninterruptible,
++ TASK_UNINTERRUPTIBLE);
++
++ return -EAGAIN;
++}
++
++/*
++ * nfs_page_group_lock_wait - wait for the lock to clear, but don't grab it
++ * @req - a request in the group
++ *
++ * This is a blocking call to wait for the group lock to be cleared.
+ */
+ void
+-nfs_page_group_lock(struct nfs_page *req)
++nfs_page_group_lock_wait(struct nfs_page *req)
+ {
+ struct nfs_page *head = req->wb_head;
+
+ WARN_ON_ONCE(head != head->wb_head);
+
+- wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
+- nfs_wait_bit_uninterruptible,
+- TASK_UNINTERRUPTIBLE);
++ wait_on_bit(&head->wb_flags, PG_HEADLOCK,
++ nfs_wait_bit_uninterruptible,
++ TASK_UNINTERRUPTIBLE);
+ }
+
+ /*
+@@ -218,7 +250,7 @@ bool nfs_page_group_sync_on_bit(struct nfs_page *req, unsigned int bit)
+ {
+ bool ret;
+
+- nfs_page_group_lock(req);
++ nfs_page_group_lock(req, false);
+ ret = nfs_page_group_sync_on_bit_locked(req, bit);
+ nfs_page_group_unlock(req);
+
+@@ -462,123 +494,72 @@ size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
+ }
+ EXPORT_SYMBOL_GPL(nfs_generic_pg_test);
+
+-static inline struct nfs_rw_header *NFS_RW_HEADER(struct nfs_pgio_header *hdr)
+-{
+- return container_of(hdr, struct nfs_rw_header, header);
+-}
+-
+-/**
+- * nfs_rw_header_alloc - Allocate a header for a read or write
+- * @ops: Read or write function vector
+- */
+-struct nfs_rw_header *nfs_rw_header_alloc(const struct nfs_rw_ops *ops)
++struct nfs_pgio_header *nfs_pgio_header_alloc(const struct nfs_rw_ops *ops)
+ {
+- struct nfs_rw_header *header = ops->rw_alloc_header();
+-
+- if (header) {
+- struct nfs_pgio_header *hdr = &header->header;
++ struct nfs_pgio_header *hdr = ops->rw_alloc_header();
+
++ if (hdr) {
+ INIT_LIST_HEAD(&hdr->pages);
+ spin_lock_init(&hdr->lock);
+- atomic_set(&hdr->refcnt, 0);
+ hdr->rw_ops = ops;
+ }
+- return header;
++ return hdr;
+ }
+-EXPORT_SYMBOL_GPL(nfs_rw_header_alloc);
++EXPORT_SYMBOL_GPL(nfs_pgio_header_alloc);
+
+ /*
+- * nfs_rw_header_free - Free a read or write header
++ * nfs_pgio_header_free - Free a read or write header
+ * @hdr: The header to free
+ */
+-void nfs_rw_header_free(struct nfs_pgio_header *hdr)
++void nfs_pgio_header_free(struct nfs_pgio_header *hdr)
+ {
+- hdr->rw_ops->rw_free_header(NFS_RW_HEADER(hdr));
+-}
+-EXPORT_SYMBOL_GPL(nfs_rw_header_free);
+-
+-/**
+- * nfs_pgio_data_alloc - Allocate pageio data
+- * @hdr: The header making a request
+- * @pagecount: Number of pages to create
+- */
+-static struct nfs_pgio_data *nfs_pgio_data_alloc(struct nfs_pgio_header *hdr,
+- unsigned int pagecount)
+-{
+- struct nfs_pgio_data *data, *prealloc;
+-
+- prealloc = &NFS_RW_HEADER(hdr)->rpc_data;
+- if (prealloc->header == NULL)
+- data = prealloc;
+- else
+- data = kzalloc(sizeof(*data), GFP_KERNEL);
+- if (!data)
+- goto out;
+-
+- if (nfs_pgarray_set(&data->pages, pagecount)) {
+- data->header = hdr;
+- atomic_inc(&hdr->refcnt);
+- } else {
+- if (data != prealloc)
+- kfree(data);
+- data = NULL;
+- }
+-out:
+- return data;
++ hdr->rw_ops->rw_free_header(hdr);
+ }
++EXPORT_SYMBOL_GPL(nfs_pgio_header_free);
+
+ /**
+- * nfs_pgio_data_release - Properly free pageio data
+- * @data: The data to release
++ * nfs_pgio_data_destroy - make @hdr suitable for reuse
++ *
++ * Frees memory and releases refs from nfs_generic_pgio, so that it may
++ * be called again.
++ *
++ * @hdr: A header that has had nfs_generic_pgio called
+ */
+-void nfs_pgio_data_release(struct nfs_pgio_data *data)
++void nfs_pgio_data_destroy(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+- struct nfs_rw_header *pageio_header = NFS_RW_HEADER(hdr);
+-
+- put_nfs_open_context(data->args.context);
+- if (data->pages.pagevec != data->pages.page_array)
+- kfree(data->pages.pagevec);
+- if (data == &pageio_header->rpc_data) {
+- data->header = NULL;
+- data = NULL;
+- }
+- if (atomic_dec_and_test(&hdr->refcnt))
+- hdr->completion_ops->completion(hdr);
+- /* Note: we only free the rpc_task after callbacks are done.
+- * See the comment in rpc_free_task() for why
+- */
+- kfree(data);
++ put_nfs_open_context(hdr->args.context);
++ if (hdr->page_array.pagevec != hdr->page_array.page_array)
++ kfree(hdr->page_array.pagevec);
+ }
+-EXPORT_SYMBOL_GPL(nfs_pgio_data_release);
++EXPORT_SYMBOL_GPL(nfs_pgio_data_destroy);
+
+ /**
+ * nfs_pgio_rpcsetup - Set up arguments for a pageio call
+- * @data: The pageio data
++ * @hdr: The pageio hdr
+ * @count: Number of bytes to read
+ * @offset: Initial offset
+ * @how: How to commit data (writes only)
+ * @cinfo: Commit information for the call (writes only)
+ */
+-static void nfs_pgio_rpcsetup(struct nfs_pgio_data *data,
++static void nfs_pgio_rpcsetup(struct nfs_pgio_header *hdr,
+ unsigned int count, unsigned int offset,
+ int how, struct nfs_commit_info *cinfo)
+ {
+- struct nfs_page *req = data->header->req;
++ struct nfs_page *req = hdr->req;
+
+ /* Set up the RPC argument and reply structs
+- * NB: take care not to mess about with data->commit et al. */
++ * NB: take care not to mess about with hdr->commit et al. */
+
+- data->args.fh = NFS_FH(data->header->inode);
+- data->args.offset = req_offset(req) + offset;
++ hdr->args.fh = NFS_FH(hdr->inode);
++ hdr->args.offset = req_offset(req) + offset;
+ /* pnfs_set_layoutcommit needs this */
+- data->mds_offset = data->args.offset;
+- data->args.pgbase = req->wb_pgbase + offset;
+- data->args.pages = data->pages.pagevec;
+- data->args.count = count;
+- data->args.context = get_nfs_open_context(req->wb_context);
+- data->args.lock_context = req->wb_lock_context;
+- data->args.stable = NFS_UNSTABLE;
++ hdr->mds_offset = hdr->args.offset;
++ hdr->args.pgbase = req->wb_pgbase + offset;
++ hdr->args.pages = hdr->page_array.pagevec;
++ hdr->args.count = count;
++ hdr->args.context = get_nfs_open_context(req->wb_context);
++ hdr->args.lock_context = req->wb_lock_context;
++ hdr->args.stable = NFS_UNSTABLE;
+ switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
+ case 0:
+ break;
+@@ -586,59 +567,60 @@ static void nfs_pgio_rpcsetup(struct nfs_pgio_data *data,
+ if (nfs_reqs_to_commit(cinfo))
+ break;
+ default:
+- data->args.stable = NFS_FILE_SYNC;
++ hdr->args.stable = NFS_FILE_SYNC;
+ }
+
+- data->res.fattr = &data->fattr;
+- data->res.count = count;
+- data->res.eof = 0;
+- data->res.verf = &data->verf;
+- nfs_fattr_init(&data->fattr);
++ hdr->res.fattr = &hdr->fattr;
++ hdr->res.count = count;
++ hdr->res.eof = 0;
++ hdr->res.verf = &hdr->writeverf;
++ nfs_fattr_init(&hdr->fattr);
+ }
+
+ /**
+- * nfs_pgio_prepare - Prepare pageio data to go over the wire
++ * nfs_pgio_prepare - Prepare pageio hdr to go over the wire
+ * @task: The current task
+- * @calldata: pageio data to prepare
++ * @calldata: pageio header to prepare
+ */
+ static void nfs_pgio_prepare(struct rpc_task *task, void *calldata)
+ {
+- struct nfs_pgio_data *data = calldata;
++ struct nfs_pgio_header *hdr = calldata;
+ int err;
+- err = NFS_PROTO(data->header->inode)->pgio_rpc_prepare(task, data);
++ err = NFS_PROTO(hdr->inode)->pgio_rpc_prepare(task, hdr);
+ if (err)
+ rpc_exit(task, err);
+ }
+
+-int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_data *data,
++int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_header *hdr,
+ const struct rpc_call_ops *call_ops, int how, int flags)
+ {
++ struct inode *inode = hdr->inode;
+ struct rpc_task *task;
+ struct rpc_message msg = {
+- .rpc_argp = &data->args,
+- .rpc_resp = &data->res,
+- .rpc_cred = data->header->cred,
++ .rpc_argp = &hdr->args,
++ .rpc_resp = &hdr->res,
++ .rpc_cred = hdr->cred,
+ };
+ struct rpc_task_setup task_setup_data = {
+ .rpc_client = clnt,
+- .task = &data->task,
++ .task = &hdr->task,
+ .rpc_message = &msg,
+ .callback_ops = call_ops,
+- .callback_data = data,
++ .callback_data = hdr,
+ .workqueue = nfsiod_workqueue,
+ .flags = RPC_TASK_ASYNC | flags,
+ };
+ int ret = 0;
+
+- data->header->rw_ops->rw_initiate(data, &msg, &task_setup_data, how);
++ hdr->rw_ops->rw_initiate(hdr, &msg, &task_setup_data, how);
+
+ dprintk("NFS: %5u initiated pgio call "
+ "(req %s/%llu, %u bytes @ offset %llu)\n",
+- data->task.tk_pid,
+- data->header->inode->i_sb->s_id,
+- (unsigned long long)NFS_FILEID(data->header->inode),
+- data->args.count,
+- (unsigned long long)data->args.offset);
++ hdr->task.tk_pid,
++ inode->i_sb->s_id,
++ (unsigned long long)NFS_FILEID(inode),
++ hdr->args.count,
++ (unsigned long long)hdr->args.offset);
+
+ task = rpc_run_task(&task_setup_data);
+ if (IS_ERR(task)) {
+@@ -665,22 +647,23 @@ static int nfs_pgio_error(struct nfs_pageio_descriptor *desc,
+ struct nfs_pgio_header *hdr)
+ {
+ set_bit(NFS_IOHDR_REDO, &hdr->flags);
+- nfs_pgio_data_release(hdr->data);
+- hdr->data = NULL;
++ nfs_pgio_data_destroy(hdr);
++ hdr->completion_ops->completion(hdr);
+ desc->pg_completion_ops->error_cleanup(&desc->pg_list);
+ return -ENOMEM;
+ }
+
+ /**
+ * nfs_pgio_release - Release pageio data
+- * @calldata: The pageio data to release
++ * @calldata: The pageio header to release
+ */
+ static void nfs_pgio_release(void *calldata)
+ {
+- struct nfs_pgio_data *data = calldata;
+- if (data->header->rw_ops->rw_release)
+- data->header->rw_ops->rw_release(data);
+- nfs_pgio_data_release(data);
++ struct nfs_pgio_header *hdr = calldata;
++ if (hdr->rw_ops->rw_release)
++ hdr->rw_ops->rw_release(hdr);
++ nfs_pgio_data_destroy(hdr);
++ hdr->completion_ops->completion(hdr);
+ }
+
+ /**
+@@ -721,22 +704,22 @@ EXPORT_SYMBOL_GPL(nfs_pageio_init);
+ /**
+ * nfs_pgio_result - Basic pageio error handling
+ * @task: The task that ran
+- * @calldata: Pageio data to check
++ * @calldata: Pageio header to check
+ */
+ static void nfs_pgio_result(struct rpc_task *task, void *calldata)
+ {
+- struct nfs_pgio_data *data = calldata;
+- struct inode *inode = data->header->inode;
++ struct nfs_pgio_header *hdr = calldata;
++ struct inode *inode = hdr->inode;
+
+ dprintk("NFS: %s: %5u, (status %d)\n", __func__,
+ task->tk_pid, task->tk_status);
+
+- if (data->header->rw_ops->rw_done(task, data, inode) != 0)
++ if (hdr->rw_ops->rw_done(task, hdr, inode) != 0)
+ return;
+ if (task->tk_status < 0)
+- nfs_set_pgio_error(data->header, task->tk_status, data->args.offset);
++ nfs_set_pgio_error(hdr, task->tk_status, hdr->args.offset);
+ else
+- data->header->rw_ops->rw_result(task, data);
++ hdr->rw_ops->rw_result(task, hdr);
+ }
+
+ /*
+@@ -751,32 +734,42 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
+ struct nfs_pgio_header *hdr)
+ {
+ struct nfs_page *req;
+- struct page **pages;
+- struct nfs_pgio_data *data;
++ struct page **pages,
++ *last_page;
+ struct list_head *head = &desc->pg_list;
+ struct nfs_commit_info cinfo;
++ unsigned int pagecount, pageused;
+
+- data = nfs_pgio_data_alloc(hdr, nfs_page_array_len(desc->pg_base,
+- desc->pg_count));
+- if (!data)
++ pagecount = nfs_page_array_len(desc->pg_base, desc->pg_count);
++ if (!nfs_pgarray_set(&hdr->page_array, pagecount))
+ return nfs_pgio_error(desc, hdr);
+
+ nfs_init_cinfo(&cinfo, desc->pg_inode, desc->pg_dreq);
+- pages = data->pages.pagevec;
++ pages = hdr->page_array.pagevec;
++ last_page = NULL;
++ pageused = 0;
+ while (!list_empty(head)) {
+ req = nfs_list_entry(head->next);
+ nfs_list_remove_request(req);
+ nfs_list_add_request(req, &hdr->pages);
+- *pages++ = req->wb_page;
++
++ if (WARN_ON_ONCE(pageused >= pagecount))
++ return nfs_pgio_error(desc, hdr);
++
++ if (!last_page || last_page != req->wb_page) {
++ *pages++ = last_page = req->wb_page;
++ pageused++;
++ }
+ }
++ if (WARN_ON_ONCE(pageused != pagecount))
++ return nfs_pgio_error(desc, hdr);
+
+ if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
+ (desc->pg_moreio || nfs_reqs_to_commit(&cinfo)))
+ desc->pg_ioflags &= ~FLUSH_COND_STABLE;
+
+ /* Set up the argument struct */
+- nfs_pgio_rpcsetup(data, desc->pg_count, 0, desc->pg_ioflags, &cinfo);
+- hdr->data = data;
++ nfs_pgio_rpcsetup(hdr, desc->pg_count, 0, desc->pg_ioflags, &cinfo);
+ desc->pg_rpc_callops = &nfs_pgio_common_ops;
+ return 0;
+ }
+@@ -784,25 +777,20 @@ EXPORT_SYMBOL_GPL(nfs_generic_pgio);
+
+ static int nfs_generic_pg_pgios(struct nfs_pageio_descriptor *desc)
+ {
+- struct nfs_rw_header *rw_hdr;
+ struct nfs_pgio_header *hdr;
+ int ret;
+
+- rw_hdr = nfs_rw_header_alloc(desc->pg_rw_ops);
+- if (!rw_hdr) {
++ hdr = nfs_pgio_header_alloc(desc->pg_rw_ops);
++ if (!hdr) {
+ desc->pg_completion_ops->error_cleanup(&desc->pg_list);
+ return -ENOMEM;
+ }
+- hdr = &rw_hdr->header;
+- nfs_pgheader_init(desc, hdr, nfs_rw_header_free);
+- atomic_inc(&hdr->refcnt);
++ nfs_pgheader_init(desc, hdr, nfs_pgio_header_free);
+ ret = nfs_generic_pgio(desc, hdr);
+ if (ret == 0)
+ ret = nfs_initiate_pgio(NFS_CLIENT(hdr->inode),
+- hdr->data, desc->pg_rpc_callops,
++ hdr, desc->pg_rpc_callops,
+ desc->pg_ioflags, 0);
+- if (atomic_dec_and_test(&hdr->refcnt))
+- hdr->completion_ops->completion(hdr);
+ return ret;
+ }
+
+@@ -845,6 +833,14 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
+ return false;
+ if (req_offset(req) != req_offset(prev) + prev->wb_bytes)
+ return false;
++ if (req->wb_page == prev->wb_page) {
++ if (req->wb_pgbase != prev->wb_pgbase + prev->wb_bytes)
++ return false;
++ } else {
++ if (req->wb_pgbase != 0 ||
++ prev->wb_pgbase + prev->wb_bytes != PAGE_CACHE_SIZE)
++ return false;
++ }
+ }
+ size = pgio->pg_ops->pg_test(pgio, prev, req);
+ WARN_ON_ONCE(size > req->wb_bytes);
+@@ -916,7 +912,7 @@ static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
+ unsigned int bytes_left = 0;
+ unsigned int offset, pgbase;
+
+- nfs_page_group_lock(req);
++ nfs_page_group_lock(req, false);
+
+ subreq = req;
+ bytes_left = subreq->wb_bytes;
+@@ -938,7 +934,7 @@ static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
+ if (desc->pg_recoalesce)
+ return 0;
+ /* retry add_request for this subreq */
+- nfs_page_group_lock(req);
++ nfs_page_group_lock(req, false);
+ continue;
+ }
+
+diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
+index 6fdcd233d6f7..5f3eb3df7c59 100644
+--- a/fs/nfs/pnfs.c
++++ b/fs/nfs/pnfs.c
+@@ -361,6 +361,23 @@ pnfs_put_lseg(struct pnfs_layout_segment *lseg)
+ }
+ EXPORT_SYMBOL_GPL(pnfs_put_lseg);
+
++static void pnfs_put_lseg_async_work(struct work_struct *work)
++{
++ struct pnfs_layout_segment *lseg;
++
++ lseg = container_of(work, struct pnfs_layout_segment, pls_work);
++
++ pnfs_put_lseg(lseg);
++}
++
++void
++pnfs_put_lseg_async(struct pnfs_layout_segment *lseg)
++{
++ INIT_WORK(&lseg->pls_work, pnfs_put_lseg_async_work);
++ schedule_work(&lseg->pls_work);
++}
++EXPORT_SYMBOL_GPL(pnfs_put_lseg_async);
++
+ static u64
+ end_offset(u64 start, u64 len)
+ {
+@@ -1502,9 +1519,8 @@ int pnfs_write_done_resend_to_mds(struct inode *inode,
+ }
+ EXPORT_SYMBOL_GPL(pnfs_write_done_resend_to_mds);
+
+-static void pnfs_ld_handle_write_error(struct nfs_pgio_data *data)
++static void pnfs_ld_handle_write_error(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+
+ dprintk("pnfs write error = %d\n", hdr->pnfs_error);
+ if (NFS_SERVER(hdr->inode)->pnfs_curr_ld->flags &
+@@ -1512,7 +1528,7 @@ static void pnfs_ld_handle_write_error(struct nfs_pgio_data *data)
+ pnfs_return_layout(hdr->inode);
+ }
+ if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags))
+- data->task.tk_status = pnfs_write_done_resend_to_mds(hdr->inode,
++ hdr->task.tk_status = pnfs_write_done_resend_to_mds(hdr->inode,
+ &hdr->pages,
+ hdr->completion_ops,
+ hdr->dreq);
+@@ -1521,41 +1537,36 @@ static void pnfs_ld_handle_write_error(struct nfs_pgio_data *data)
+ /*
+ * Called by non rpc-based layout drivers
+ */
+-void pnfs_ld_write_done(struct nfs_pgio_data *data)
++void pnfs_ld_write_done(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+-
+- trace_nfs4_pnfs_write(data, hdr->pnfs_error);
++ trace_nfs4_pnfs_write(hdr, hdr->pnfs_error);
+ if (!hdr->pnfs_error) {
+- pnfs_set_layoutcommit(data);
+- hdr->mds_ops->rpc_call_done(&data->task, data);
++ pnfs_set_layoutcommit(hdr);
++ hdr->mds_ops->rpc_call_done(&hdr->task, hdr);
+ } else
+- pnfs_ld_handle_write_error(data);
+- hdr->mds_ops->rpc_release(data);
++ pnfs_ld_handle_write_error(hdr);
++ hdr->mds_ops->rpc_release(hdr);
+ }
+ EXPORT_SYMBOL_GPL(pnfs_ld_write_done);
+
+ static void
+ pnfs_write_through_mds(struct nfs_pageio_descriptor *desc,
+- struct nfs_pgio_data *data)
++ struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+-
+ if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags)) {
+ list_splice_tail_init(&hdr->pages, &desc->pg_list);
+ nfs_pageio_reset_write_mds(desc);
+ desc->pg_recoalesce = 1;
+ }
+- nfs_pgio_data_release(data);
++ nfs_pgio_data_destroy(hdr);
+ }
+
+ static enum pnfs_try_status
+-pnfs_try_to_write_data(struct nfs_pgio_data *wdata,
++pnfs_try_to_write_data(struct nfs_pgio_header *hdr,
+ const struct rpc_call_ops *call_ops,
+ struct pnfs_layout_segment *lseg,
+ int how)
+ {
+- struct nfs_pgio_header *hdr = wdata->header;
+ struct inode *inode = hdr->inode;
+ enum pnfs_try_status trypnfs;
+ struct nfs_server *nfss = NFS_SERVER(inode);
+@@ -1563,8 +1574,8 @@ pnfs_try_to_write_data(struct nfs_pgio_data *wdata,
+ hdr->mds_ops = call_ops;
+
+ dprintk("%s: Writing ino:%lu %u@%llu (how %d)\n", __func__,
+- inode->i_ino, wdata->args.count, wdata->args.offset, how);
+- trypnfs = nfss->pnfs_curr_ld->write_pagelist(wdata, how);
++ inode->i_ino, hdr->args.count, hdr->args.offset, how);
++ trypnfs = nfss->pnfs_curr_ld->write_pagelist(hdr, how);
+ if (trypnfs != PNFS_NOT_ATTEMPTED)
+ nfs_inc_stats(inode, NFSIOS_PNFS_WRITE);
+ dprintk("%s End (trypnfs:%d)\n", __func__, trypnfs);
+@@ -1575,51 +1586,45 @@ static void
+ pnfs_do_write(struct nfs_pageio_descriptor *desc,
+ struct nfs_pgio_header *hdr, int how)
+ {
+- struct nfs_pgio_data *data = hdr->data;
+ const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
+ struct pnfs_layout_segment *lseg = desc->pg_lseg;
+ enum pnfs_try_status trypnfs;
+
+ desc->pg_lseg = NULL;
+- trypnfs = pnfs_try_to_write_data(data, call_ops, lseg, how);
++ trypnfs = pnfs_try_to_write_data(hdr, call_ops, lseg, how);
+ if (trypnfs == PNFS_NOT_ATTEMPTED)
+- pnfs_write_through_mds(desc, data);
++ pnfs_write_through_mds(desc, hdr);
+ pnfs_put_lseg(lseg);
+ }
+
+ static void pnfs_writehdr_free(struct nfs_pgio_header *hdr)
+ {
+ pnfs_put_lseg(hdr->lseg);
+- nfs_rw_header_free(hdr);
++ nfs_pgio_header_free(hdr);
+ }
+ EXPORT_SYMBOL_GPL(pnfs_writehdr_free);
+
+ int
+ pnfs_generic_pg_writepages(struct nfs_pageio_descriptor *desc)
+ {
+- struct nfs_rw_header *whdr;
+ struct nfs_pgio_header *hdr;
+ int ret;
+
+- whdr = nfs_rw_header_alloc(desc->pg_rw_ops);
+- if (!whdr) {
++ hdr = nfs_pgio_header_alloc(desc->pg_rw_ops);
++ if (!hdr) {
+ desc->pg_completion_ops->error_cleanup(&desc->pg_list);
+ pnfs_put_lseg(desc->pg_lseg);
+ desc->pg_lseg = NULL;
+ return -ENOMEM;
+ }
+- hdr = &whdr->header;
+ nfs_pgheader_init(desc, hdr, pnfs_writehdr_free);
+ hdr->lseg = pnfs_get_lseg(desc->pg_lseg);
+- atomic_inc(&hdr->refcnt);
+ ret = nfs_generic_pgio(desc, hdr);
+ if (ret != 0) {
+ pnfs_put_lseg(desc->pg_lseg);
+ desc->pg_lseg = NULL;
+ } else
+ pnfs_do_write(desc, hdr, desc->pg_ioflags);
+- if (atomic_dec_and_test(&hdr->refcnt))
+- hdr->completion_ops->completion(hdr);
+ return ret;
+ }
+ EXPORT_SYMBOL_GPL(pnfs_generic_pg_writepages);
+@@ -1652,17 +1657,15 @@ int pnfs_read_done_resend_to_mds(struct inode *inode,
+ }
+ EXPORT_SYMBOL_GPL(pnfs_read_done_resend_to_mds);
+
+-static void pnfs_ld_handle_read_error(struct nfs_pgio_data *data)
++static void pnfs_ld_handle_read_error(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+-
+ dprintk("pnfs read error = %d\n", hdr->pnfs_error);
+ if (NFS_SERVER(hdr->inode)->pnfs_curr_ld->flags &
+ PNFS_LAYOUTRET_ON_ERROR) {
+ pnfs_return_layout(hdr->inode);
+ }
+ if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags))
+- data->task.tk_status = pnfs_read_done_resend_to_mds(hdr->inode,
++ hdr->task.tk_status = pnfs_read_done_resend_to_mds(hdr->inode,
+ &hdr->pages,
+ hdr->completion_ops,
+ hdr->dreq);
+@@ -1671,43 +1674,38 @@ static void pnfs_ld_handle_read_error(struct nfs_pgio_data *data)
+ /*
+ * Called by non rpc-based layout drivers
+ */
+-void pnfs_ld_read_done(struct nfs_pgio_data *data)
++void pnfs_ld_read_done(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+-
+- trace_nfs4_pnfs_read(data, hdr->pnfs_error);
++ trace_nfs4_pnfs_read(hdr, hdr->pnfs_error);
+ if (likely(!hdr->pnfs_error)) {
+- __nfs4_read_done_cb(data);
+- hdr->mds_ops->rpc_call_done(&data->task, data);
++ __nfs4_read_done_cb(hdr);
++ hdr->mds_ops->rpc_call_done(&hdr->task, hdr);
+ } else
+- pnfs_ld_handle_read_error(data);
+- hdr->mds_ops->rpc_release(data);
++ pnfs_ld_handle_read_error(hdr);
++ hdr->mds_ops->rpc_release(hdr);
+ }
+ EXPORT_SYMBOL_GPL(pnfs_ld_read_done);
+
+ static void
+ pnfs_read_through_mds(struct nfs_pageio_descriptor *desc,
+- struct nfs_pgio_data *data)
++ struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+-
+ if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags)) {
+ list_splice_tail_init(&hdr->pages, &desc->pg_list);
+ nfs_pageio_reset_read_mds(desc);
+ desc->pg_recoalesce = 1;
+ }
+- nfs_pgio_data_release(data);
++ nfs_pgio_data_destroy(hdr);
+ }
+
+ /*
+ * Call the appropriate parallel I/O subsystem read function.
+ */
+ static enum pnfs_try_status
+-pnfs_try_to_read_data(struct nfs_pgio_data *rdata,
++pnfs_try_to_read_data(struct nfs_pgio_header *hdr,
+ const struct rpc_call_ops *call_ops,
+ struct pnfs_layout_segment *lseg)
+ {
+- struct nfs_pgio_header *hdr = rdata->header;
+ struct inode *inode = hdr->inode;
+ struct nfs_server *nfss = NFS_SERVER(inode);
+ enum pnfs_try_status trypnfs;
+@@ -1715,9 +1713,9 @@ pnfs_try_to_read_data(struct nfs_pgio_data *rdata,
+ hdr->mds_ops = call_ops;
+
+ dprintk("%s: Reading ino:%lu %u@%llu\n",
+- __func__, inode->i_ino, rdata->args.count, rdata->args.offset);
++ __func__, inode->i_ino, hdr->args.count, hdr->args.offset);
+
+- trypnfs = nfss->pnfs_curr_ld->read_pagelist(rdata);
++ trypnfs = nfss->pnfs_curr_ld->read_pagelist(hdr);
+ if (trypnfs != PNFS_NOT_ATTEMPTED)
+ nfs_inc_stats(inode, NFSIOS_PNFS_READ);
+ dprintk("%s End (trypnfs:%d)\n", __func__, trypnfs);
+@@ -1727,52 +1725,46 @@ pnfs_try_to_read_data(struct nfs_pgio_data *rdata,
+ static void
+ pnfs_do_read(struct nfs_pageio_descriptor *desc, struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_data *data = hdr->data;
+ const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
+ struct pnfs_layout_segment *lseg = desc->pg_lseg;
+ enum pnfs_try_status trypnfs;
+
+ desc->pg_lseg = NULL;
+- trypnfs = pnfs_try_to_read_data(data, call_ops, lseg);
++ trypnfs = pnfs_try_to_read_data(hdr, call_ops, lseg);
+ if (trypnfs == PNFS_NOT_ATTEMPTED)
+- pnfs_read_through_mds(desc, data);
++ pnfs_read_through_mds(desc, hdr);
+ pnfs_put_lseg(lseg);
+ }
+
+ static void pnfs_readhdr_free(struct nfs_pgio_header *hdr)
+ {
+ pnfs_put_lseg(hdr->lseg);
+- nfs_rw_header_free(hdr);
++ nfs_pgio_header_free(hdr);
+ }
+ EXPORT_SYMBOL_GPL(pnfs_readhdr_free);
+
+ int
+ pnfs_generic_pg_readpages(struct nfs_pageio_descriptor *desc)
+ {
+- struct nfs_rw_header *rhdr;
+ struct nfs_pgio_header *hdr;
+ int ret;
+
+- rhdr = nfs_rw_header_alloc(desc->pg_rw_ops);
+- if (!rhdr) {
++ hdr = nfs_pgio_header_alloc(desc->pg_rw_ops);
++ if (!hdr) {
+ desc->pg_completion_ops->error_cleanup(&desc->pg_list);
+ ret = -ENOMEM;
+ pnfs_put_lseg(desc->pg_lseg);
+ desc->pg_lseg = NULL;
+ return ret;
+ }
+- hdr = &rhdr->header;
+ nfs_pgheader_init(desc, hdr, pnfs_readhdr_free);
+ hdr->lseg = pnfs_get_lseg(desc->pg_lseg);
+- atomic_inc(&hdr->refcnt);
+ ret = nfs_generic_pgio(desc, hdr);
+ if (ret != 0) {
+ pnfs_put_lseg(desc->pg_lseg);
+ desc->pg_lseg = NULL;
+ } else
+ pnfs_do_read(desc, hdr);
+- if (atomic_dec_and_test(&hdr->refcnt))
+- hdr->completion_ops->completion(hdr);
+ return ret;
+ }
+ EXPORT_SYMBOL_GPL(pnfs_generic_pg_readpages);
+@@ -1820,12 +1812,11 @@ void pnfs_set_lo_fail(struct pnfs_layout_segment *lseg)
+ EXPORT_SYMBOL_GPL(pnfs_set_lo_fail);
+
+ void
+-pnfs_set_layoutcommit(struct nfs_pgio_data *wdata)
++pnfs_set_layoutcommit(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = wdata->header;
+ struct inode *inode = hdr->inode;
+ struct nfs_inode *nfsi = NFS_I(inode);
+- loff_t end_pos = wdata->mds_offset + wdata->res.count;
++ loff_t end_pos = hdr->mds_offset + hdr->res.count;
+ bool mark_as_dirty = false;
+
+ spin_lock(&inode->i_lock);
+diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
+index 4fb309a2b4c4..ae22a9ccc1b9 100644
+--- a/fs/nfs/pnfs.h
++++ b/fs/nfs/pnfs.h
+@@ -32,6 +32,7 @@
+
+ #include <linux/nfs_fs.h>
+ #include <linux/nfs_page.h>
++#include <linux/workqueue.h>
+
+ enum {
+ NFS_LSEG_VALID = 0, /* cleared when lseg is recalled/returned */
+@@ -46,6 +47,7 @@ struct pnfs_layout_segment {
+ atomic_t pls_refcount;
+ unsigned long pls_flags;
+ struct pnfs_layout_hdr *pls_layout;
++ struct work_struct pls_work;
+ };
+
+ enum pnfs_try_status {
+@@ -113,8 +115,8 @@ struct pnfs_layoutdriver_type {
+ * Return PNFS_ATTEMPTED to indicate the layout code has attempted
+ * I/O, else return PNFS_NOT_ATTEMPTED to fall back to normal NFS
+ */
+- enum pnfs_try_status (*read_pagelist) (struct nfs_pgio_data *nfs_data);
+- enum pnfs_try_status (*write_pagelist) (struct nfs_pgio_data *nfs_data, int how);
++ enum pnfs_try_status (*read_pagelist)(struct nfs_pgio_header *);
++ enum pnfs_try_status (*write_pagelist)(struct nfs_pgio_header *, int);
+
+ void (*free_deviceid_node) (struct nfs4_deviceid_node *);
+
+@@ -179,6 +181,7 @@ extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp);
+ /* pnfs.c */
+ void pnfs_get_layout_hdr(struct pnfs_layout_hdr *lo);
+ void pnfs_put_lseg(struct pnfs_layout_segment *lseg);
++void pnfs_put_lseg_async(struct pnfs_layout_segment *lseg);
+
+ void set_pnfs_layoutdriver(struct nfs_server *, const struct nfs_fh *, u32);
+ void unset_pnfs_layoutdriver(struct nfs_server *);
+@@ -213,13 +216,13 @@ bool pnfs_roc(struct inode *ino);
+ void pnfs_roc_release(struct inode *ino);
+ void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
+ bool pnfs_roc_drain(struct inode *ino, u32 *barrier, struct rpc_task *task);
+-void pnfs_set_layoutcommit(struct nfs_pgio_data *wdata);
++void pnfs_set_layoutcommit(struct nfs_pgio_header *);
+ void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data);
+ int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
+ int _pnfs_return_layout(struct inode *);
+ int pnfs_commit_and_return_layout(struct inode *);
+-void pnfs_ld_write_done(struct nfs_pgio_data *);
+-void pnfs_ld_read_done(struct nfs_pgio_data *);
++void pnfs_ld_write_done(struct nfs_pgio_header *);
++void pnfs_ld_read_done(struct nfs_pgio_header *);
+ struct pnfs_layout_segment *pnfs_update_layout(struct inode *ino,
+ struct nfs_open_context *ctx,
+ loff_t pos,
+@@ -410,6 +413,10 @@ static inline void pnfs_put_lseg(struct pnfs_layout_segment *lseg)
+ {
+ }
+
++static inline void pnfs_put_lseg_async(struct pnfs_layout_segment *lseg)
++{
++}
++
+ static inline int pnfs_return_layout(struct inode *ino)
+ {
+ return 0;
+diff --git a/fs/nfs/proc.c b/fs/nfs/proc.c
+index c171ce1a8a30..b09cc23d6f43 100644
+--- a/fs/nfs/proc.c
++++ b/fs/nfs/proc.c
+@@ -578,46 +578,49 @@ nfs_proc_pathconf(struct nfs_server *server, struct nfs_fh *fhandle,
+ return 0;
+ }
+
+-static int nfs_read_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+- struct inode *inode = data->header->inode;
++ struct inode *inode = hdr->inode;
+
+ nfs_invalidate_atime(inode);
+ if (task->tk_status >= 0) {
+- nfs_refresh_inode(inode, data->res.fattr);
++ nfs_refresh_inode(inode, hdr->res.fattr);
+ /* Emulate the eof flag, which isn't normally needed in NFSv2
+ * as it is guaranteed to always return the file attributes
+ */
+- if (data->args.offset + data->res.count >= data->res.fattr->size)
+- data->res.eof = 1;
++ if (hdr->args.offset + hdr->res.count >= hdr->res.fattr->size)
++ hdr->res.eof = 1;
+ }
+ return 0;
+ }
+
+-static void nfs_proc_read_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs_proc_read_setup(struct nfs_pgio_header *hdr,
++ struct rpc_message *msg)
+ {
+ msg->rpc_proc = &nfs_procedures[NFSPROC_READ];
+ }
+
+-static int nfs_proc_pgio_rpc_prepare(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs_proc_pgio_rpc_prepare(struct rpc_task *task,
++ struct nfs_pgio_header *hdr)
+ {
+ rpc_call_start(task);
+ return 0;
+ }
+
+-static int nfs_write_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs_write_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+- struct inode *inode = data->header->inode;
++ struct inode *inode = hdr->inode;
+
+ if (task->tk_status >= 0)
+- nfs_post_op_update_inode_force_wcc(inode, data->res.fattr);
++ nfs_post_op_update_inode_force_wcc(inode, hdr->res.fattr);
+ return 0;
+ }
+
+-static void nfs_proc_write_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs_proc_write_setup(struct nfs_pgio_header *hdr,
++ struct rpc_message *msg)
+ {
+ /* Note: NFSv2 ignores @stable and always uses NFS_FILE_SYNC */
+- data->args.stable = NFS_FILE_SYNC;
++ hdr->args.stable = NFS_FILE_SYNC;
+ msg->rpc_proc = &nfs_procedures[NFSPROC_WRITE];
+ }
+
+diff --git a/fs/nfs/read.c b/fs/nfs/read.c
+index e818a475ca64..b1532b73fea3 100644
+--- a/fs/nfs/read.c
++++ b/fs/nfs/read.c
+@@ -33,12 +33,12 @@ static const struct nfs_rw_ops nfs_rw_read_ops;
+
+ static struct kmem_cache *nfs_rdata_cachep;
+
+-static struct nfs_rw_header *nfs_readhdr_alloc(void)
++static struct nfs_pgio_header *nfs_readhdr_alloc(void)
+ {
+ return kmem_cache_zalloc(nfs_rdata_cachep, GFP_KERNEL);
+ }
+
+-static void nfs_readhdr_free(struct nfs_rw_header *rhdr)
++static void nfs_readhdr_free(struct nfs_pgio_header *rhdr)
+ {
+ kmem_cache_free(nfs_rdata_cachep, rhdr);
+ }
+@@ -172,14 +172,15 @@ out:
+ hdr->release(hdr);
+ }
+
+-static void nfs_initiate_read(struct nfs_pgio_data *data, struct rpc_message *msg,
++static void nfs_initiate_read(struct nfs_pgio_header *hdr,
++ struct rpc_message *msg,
+ struct rpc_task_setup *task_setup_data, int how)
+ {
+- struct inode *inode = data->header->inode;
++ struct inode *inode = hdr->inode;
+ int swap_flags = IS_SWAPFILE(inode) ? NFS_RPC_SWAPFLAGS : 0;
+
+ task_setup_data->flags |= swap_flags;
+- NFS_PROTO(inode)->read_setup(data, msg);
++ NFS_PROTO(inode)->read_setup(hdr, msg);
+ }
+
+ static void
+@@ -203,14 +204,15 @@ static const struct nfs_pgio_completion_ops nfs_async_read_completion_ops = {
+ * This is the callback from RPC telling us whether a reply was
+ * received or some error occurred (timeout or socket shutdown).
+ */
+-static int nfs_readpage_done(struct rpc_task *task, struct nfs_pgio_data *data,
++static int nfs_readpage_done(struct rpc_task *task,
++ struct nfs_pgio_header *hdr,
+ struct inode *inode)
+ {
+- int status = NFS_PROTO(inode)->read_done(task, data);
++ int status = NFS_PROTO(inode)->read_done(task, hdr);
+ if (status != 0)
+ return status;
+
+- nfs_add_stats(inode, NFSIOS_SERVERREADBYTES, data->res.count);
++ nfs_add_stats(inode, NFSIOS_SERVERREADBYTES, hdr->res.count);
+
+ if (task->tk_status == -ESTALE) {
+ set_bit(NFS_INO_STALE, &NFS_I(inode)->flags);
+@@ -219,34 +221,34 @@ static int nfs_readpage_done(struct rpc_task *task, struct nfs_pgio_data *data,
+ return 0;
+ }
+
+-static void nfs_readpage_retry(struct rpc_task *task, struct nfs_pgio_data *data)
++static void nfs_readpage_retry(struct rpc_task *task,
++ struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_args *argp = &data->args;
+- struct nfs_pgio_res *resp = &data->res;
++ struct nfs_pgio_args *argp = &hdr->args;
++ struct nfs_pgio_res *resp = &hdr->res;
+
+ /* This is a short read! */
+- nfs_inc_stats(data->header->inode, NFSIOS_SHORTREAD);
++ nfs_inc_stats(hdr->inode, NFSIOS_SHORTREAD);
+ /* Has the server at least made some progress? */
+ if (resp->count == 0) {
+- nfs_set_pgio_error(data->header, -EIO, argp->offset);
++ nfs_set_pgio_error(hdr, -EIO, argp->offset);
+ return;
+ }
+- /* Yes, so retry the read at the end of the data */
+- data->mds_offset += resp->count;
++ /* Yes, so retry the read at the end of the hdr */
++ hdr->mds_offset += resp->count;
+ argp->offset += resp->count;
+ argp->pgbase += resp->count;
+ argp->count -= resp->count;
+ rpc_restart_call_prepare(task);
+ }
+
+-static void nfs_readpage_result(struct rpc_task *task, struct nfs_pgio_data *data)
++static void nfs_readpage_result(struct rpc_task *task,
++ struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+-
+- if (data->res.eof) {
++ if (hdr->res.eof) {
+ loff_t bound;
+
+- bound = data->args.offset + data->res.count;
++ bound = hdr->args.offset + hdr->res.count;
+ spin_lock(&hdr->lock);
+ if (bound < hdr->io_start + hdr->good_bytes) {
+ set_bit(NFS_IOHDR_EOF, &hdr->flags);
+@@ -254,8 +256,8 @@ static void nfs_readpage_result(struct rpc_task *task, struct nfs_pgio_data *dat
+ hdr->good_bytes = bound - hdr->io_start;
+ }
+ spin_unlock(&hdr->lock);
+- } else if (data->res.count != data->args.count)
+- nfs_readpage_retry(task, data);
++ } else if (hdr->res.count != hdr->args.count)
++ nfs_readpage_retry(task, hdr);
+ }
+
+ /*
+@@ -404,7 +406,7 @@ out:
+ int __init nfs_init_readpagecache(void)
+ {
+ nfs_rdata_cachep = kmem_cache_create("nfs_read_data",
+- sizeof(struct nfs_rw_header),
++ sizeof(struct nfs_pgio_header),
+ 0, SLAB_HWCACHE_ALIGN,
+ NULL);
+ if (nfs_rdata_cachep == NULL)
+diff --git a/fs/nfs/write.c b/fs/nfs/write.c
+index 5e2f10304548..ecb0f9fd5632 100644
+--- a/fs/nfs/write.c
++++ b/fs/nfs/write.c
+@@ -71,18 +71,18 @@ void nfs_commit_free(struct nfs_commit_data *p)
+ }
+ EXPORT_SYMBOL_GPL(nfs_commit_free);
+
+-static struct nfs_rw_header *nfs_writehdr_alloc(void)
++static struct nfs_pgio_header *nfs_writehdr_alloc(void)
+ {
+- struct nfs_rw_header *p = mempool_alloc(nfs_wdata_mempool, GFP_NOIO);
++ struct nfs_pgio_header *p = mempool_alloc(nfs_wdata_mempool, GFP_NOIO);
+
+ if (p)
+ memset(p, 0, sizeof(*p));
+ return p;
+ }
+
+-static void nfs_writehdr_free(struct nfs_rw_header *whdr)
++static void nfs_writehdr_free(struct nfs_pgio_header *hdr)
+ {
+- mempool_free(whdr, nfs_wdata_mempool);
++ mempool_free(hdr, nfs_wdata_mempool);
+ }
+
+ static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
+@@ -216,7 +216,7 @@ static bool nfs_page_group_covers_page(struct nfs_page *req)
+ unsigned int pos = 0;
+ unsigned int len = nfs_page_length(req->wb_page);
+
+- nfs_page_group_lock(req);
++ nfs_page_group_lock(req, false);
+
+ do {
+ tmp = nfs_page_group_search_locked(req->wb_head, pos);
+@@ -379,8 +379,6 @@ nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
+ subreq->wb_head = subreq;
+ subreq->wb_this_page = subreq;
+
+- nfs_clear_request_commit(subreq);
+-
+ /* subreq is now totally disconnected from page group or any
+ * write / commit lists. last chance to wake any waiters */
+ nfs_unlock_request(subreq);
+@@ -455,8 +453,23 @@ try_again:
+ return NULL;
+ }
+
++ /* holding inode lock, so always make a non-blocking call to try the
++ * page group lock */
++ ret = nfs_page_group_lock(head, true);
++ if (ret < 0) {
++ spin_unlock(&inode->i_lock);
++
++ if (!nonblock && ret == -EAGAIN) {
++ nfs_page_group_lock_wait(head);
++ nfs_release_request(head);
++ goto try_again;
++ }
++
++ nfs_release_request(head);
++ return ERR_PTR(ret);
++ }
++
+ /* lock each request in the page group */
+- nfs_page_group_lock(head);
+ subreq = head;
+ do {
+ /*
+@@ -488,7 +501,7 @@ try_again:
+ * Commit list removal accounting is done after locks are dropped */
+ subreq = head;
+ do {
+- nfs_list_remove_request(subreq);
++ nfs_clear_request_commit(subreq);
+ subreq = subreq->wb_this_page;
+ } while (subreq != head);
+
+@@ -518,15 +531,11 @@ try_again:
+
+ nfs_page_group_unlock(head);
+
+- /* drop lock to clear_request_commit the head req and clean up
+- * requests on destroy list */
++ /* drop lock to clean uprequests on destroy list */
+ spin_unlock(&inode->i_lock);
+
+ nfs_destroy_unlinked_subrequests(destroy_list, head);
+
+- /* clean up commit list state */
+- nfs_clear_request_commit(head);
+-
+ /* still holds ref on head from nfs_page_find_head_request_locked
+ * and still has lock on head from lock loop */
+ return head;
+@@ -808,6 +817,7 @@ nfs_clear_page_commit(struct page *page)
+ dec_bdi_stat(page_file_mapping(page)->backing_dev_info, BDI_RECLAIMABLE);
+ }
+
++/* Called holding inode (/cinfo) lock */
+ static void
+ nfs_clear_request_commit(struct nfs_page *req)
+ {
+@@ -817,20 +827,18 @@ nfs_clear_request_commit(struct nfs_page *req)
+
+ nfs_init_cinfo_from_inode(&cinfo, inode);
+ if (!pnfs_clear_request_commit(req, &cinfo)) {
+- spin_lock(cinfo.lock);
+ nfs_request_remove_commit_list(req, &cinfo);
+- spin_unlock(cinfo.lock);
+ }
+ nfs_clear_page_commit(req->wb_page);
+ }
+ }
+
+ static inline
+-int nfs_write_need_commit(struct nfs_pgio_data *data)
++int nfs_write_need_commit(struct nfs_pgio_header *hdr)
+ {
+- if (data->verf.committed == NFS_DATA_SYNC)
+- return data->header->lseg == NULL;
+- return data->verf.committed != NFS_FILE_SYNC;
++ if (hdr->writeverf.committed == NFS_DATA_SYNC)
++ return hdr->lseg == NULL;
++ return hdr->writeverf.committed != NFS_FILE_SYNC;
+ }
+
+ #else
+@@ -857,7 +865,7 @@ nfs_clear_request_commit(struct nfs_page *req)
+ }
+
+ static inline
+-int nfs_write_need_commit(struct nfs_pgio_data *data)
++int nfs_write_need_commit(struct nfs_pgio_header *hdr)
+ {
+ return 0;
+ }
+@@ -1038,9 +1046,9 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
+ else
+ req->wb_bytes = rqend - req->wb_offset;
+ out_unlock:
+- spin_unlock(&inode->i_lock);
+ if (req)
+ nfs_clear_request_commit(req);
++ spin_unlock(&inode->i_lock);
+ return req;
+ out_flushme:
+ spin_unlock(&inode->i_lock);
+@@ -1241,17 +1249,18 @@ static int flush_task_priority(int how)
+ return RPC_PRIORITY_NORMAL;
+ }
+
+-static void nfs_initiate_write(struct nfs_pgio_data *data, struct rpc_message *msg,
++static void nfs_initiate_write(struct nfs_pgio_header *hdr,
++ struct rpc_message *msg,
+ struct rpc_task_setup *task_setup_data, int how)
+ {
+- struct inode *inode = data->header->inode;
++ struct inode *inode = hdr->inode;
+ int priority = flush_task_priority(how);
+
+ task_setup_data->priority = priority;
+- NFS_PROTO(inode)->write_setup(data, msg);
++ NFS_PROTO(inode)->write_setup(hdr, msg);
+
+ nfs4_state_protect_write(NFS_SERVER(inode)->nfs_client,
+- &task_setup_data->rpc_client, msg, data);
++ &task_setup_data->rpc_client, msg, hdr);
+ }
+
+ /* If a nfs_flush_* function fails, it should remove reqs from @head and
+@@ -1313,18 +1322,17 @@ void nfs_commit_prepare(struct rpc_task *task, void *calldata)
+ NFS_PROTO(data->inode)->commit_rpc_prepare(task, data);
+ }
+
+-static void nfs_writeback_release_common(struct nfs_pgio_data *data)
++static void nfs_writeback_release_common(struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_header *hdr = data->header;
+- int status = data->task.tk_status;
++ int status = hdr->task.tk_status;
+
+- if ((status >= 0) && nfs_write_need_commit(data)) {
++ if ((status >= 0) && nfs_write_need_commit(hdr)) {
+ spin_lock(&hdr->lock);
+ if (test_bit(NFS_IOHDR_NEED_RESCHED, &hdr->flags))
+ ; /* Do nothing */
+ else if (!test_and_set_bit(NFS_IOHDR_NEED_COMMIT, &hdr->flags))
+- memcpy(&hdr->verf, &data->verf, sizeof(hdr->verf));
+- else if (memcmp(&hdr->verf, &data->verf, sizeof(hdr->verf)))
++ memcpy(&hdr->verf, &hdr->writeverf, sizeof(hdr->verf));
++ else if (memcmp(&hdr->verf, &hdr->writeverf, sizeof(hdr->verf)))
+ set_bit(NFS_IOHDR_NEED_RESCHED, &hdr->flags);
+ spin_unlock(&hdr->lock);
+ }
+@@ -1358,7 +1366,8 @@ static int nfs_should_remove_suid(const struct inode *inode)
+ /*
+ * This function is called when the WRITE call is complete.
+ */
+-static int nfs_writeback_done(struct rpc_task *task, struct nfs_pgio_data *data,
++static int nfs_writeback_done(struct rpc_task *task,
++ struct nfs_pgio_header *hdr,
+ struct inode *inode)
+ {
+ int status;
+@@ -1370,13 +1379,14 @@ static int nfs_writeback_done(struct rpc_task *task, struct nfs_pgio_data *data,
+ * another writer had changed the file, but some applications
+ * depend on tighter cache coherency when writing.
+ */
+- status = NFS_PROTO(inode)->write_done(task, data);
++ status = NFS_PROTO(inode)->write_done(task, hdr);
+ if (status != 0)
+ return status;
+- nfs_add_stats(inode, NFSIOS_SERVERWRITTENBYTES, data->res.count);
++ nfs_add_stats(inode, NFSIOS_SERVERWRITTENBYTES, hdr->res.count);
+
+ #if IS_ENABLED(CONFIG_NFS_V3) || IS_ENABLED(CONFIG_NFS_V4)
+- if (data->res.verf->committed < data->args.stable && task->tk_status >= 0) {
++ if (hdr->res.verf->committed < hdr->args.stable &&
++ task->tk_status >= 0) {
+ /* We tried a write call, but the server did not
+ * commit data to stable storage even though we
+ * requested it.
+@@ -1392,7 +1402,7 @@ static int nfs_writeback_done(struct rpc_task *task, struct nfs_pgio_data *data,
+ dprintk("NFS: faulty NFS server %s:"
+ " (committed = %d) != (stable = %d)\n",
+ NFS_SERVER(inode)->nfs_client->cl_hostname,
+- data->res.verf->committed, data->args.stable);
++ hdr->res.verf->committed, hdr->args.stable);
+ complain = jiffies + 300 * HZ;
+ }
+ }
+@@ -1407,16 +1417,17 @@ static int nfs_writeback_done(struct rpc_task *task, struct nfs_pgio_data *data,
+ /*
+ * This function is called when the WRITE call is complete.
+ */
+-static void nfs_writeback_result(struct rpc_task *task, struct nfs_pgio_data *data)
++static void nfs_writeback_result(struct rpc_task *task,
++ struct nfs_pgio_header *hdr)
+ {
+- struct nfs_pgio_args *argp = &data->args;
+- struct nfs_pgio_res *resp = &data->res;
++ struct nfs_pgio_args *argp = &hdr->args;
++ struct nfs_pgio_res *resp = &hdr->res;
+
+ if (resp->count < argp->count) {
+ static unsigned long complain;
+
+ /* This a short write! */
+- nfs_inc_stats(data->header->inode, NFSIOS_SHORTWRITE);
++ nfs_inc_stats(hdr->inode, NFSIOS_SHORTWRITE);
+
+ /* Has the server at least made some progress? */
+ if (resp->count == 0) {
+@@ -1426,14 +1437,14 @@ static void nfs_writeback_result(struct rpc_task *task, struct nfs_pgio_data *da
+ argp->count);
+ complain = jiffies + 300 * HZ;
+ }
+- nfs_set_pgio_error(data->header, -EIO, argp->offset);
++ nfs_set_pgio_error(hdr, -EIO, argp->offset);
+ task->tk_status = -EIO;
+ return;
+ }
+ /* Was this an NFSv2 write or an NFSv3 stable write? */
+ if (resp->verf->committed != NFS_UNSTABLE) {
+ /* Resend from where the server left off */
+- data->mds_offset += resp->count;
++ hdr->mds_offset += resp->count;
+ argp->offset += resp->count;
+ argp->pgbase += resp->count;
+ argp->count -= resp->count;
+@@ -1884,7 +1895,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
+ int __init nfs_init_writepagecache(void)
+ {
+ nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
+- sizeof(struct nfs_rw_header),
++ sizeof(struct nfs_pgio_header),
+ 0, SLAB_HWCACHE_ALIGN,
+ NULL);
+ if (nfs_wdata_cachep == NULL)
+diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
+index 944275c8f56d..1d5103dfc203 100644
+--- a/fs/nfsd/nfs4xdr.c
++++ b/fs/nfsd/nfs4xdr.c
+@@ -2662,6 +2662,7 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
+ struct xdr_stream *xdr = cd->xdr;
+ int start_offset = xdr->buf->len;
+ int cookie_offset;
++ u32 name_and_cookie;
+ int entry_bytes;
+ __be32 nfserr = nfserr_toosmall;
+ __be64 wire_offset;
+@@ -2723,7 +2724,14 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
+ cd->rd_maxcount -= entry_bytes;
+ if (!cd->rd_dircount)
+ goto fail;
+- cd->rd_dircount--;
++ /*
++ * RFC 3530 14.2.24 describes rd_dircount as only a "hint", so
++ * let's always let through the first entry, at least:
++ */
++ name_and_cookie = 4 * XDR_QUADLEN(namlen) + 8;
++ if (name_and_cookie > cd->rd_dircount && cd->cookie_offset)
++ goto fail;
++ cd->rd_dircount -= min(cd->rd_dircount, name_and_cookie);
+ cd->cookie_offset = cookie_offset;
+ skip_entry:
+ cd->common.err = nfs_ok;
+@@ -3104,7 +3112,8 @@ static __be32 nfsd4_encode_splice_read(
+
+ buf->page_len = maxcount;
+ buf->len += maxcount;
+- xdr->page_ptr += (maxcount + PAGE_SIZE - 1) / PAGE_SIZE;
++ xdr->page_ptr += (buf->page_base + maxcount + PAGE_SIZE - 1)
++ / PAGE_SIZE;
+
+ /* Use rest of head for padding and remaining ops: */
+ buf->tail[0].iov_base = xdr->p;
+@@ -3333,6 +3342,10 @@ nfsd4_encode_readdir(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4
+ }
+ maxcount = min_t(int, maxcount-16, bytes_left);
+
++ /* RFC 3530 14.2.24 allows us to ignore dircount when it's 0: */
++ if (!readdir->rd_dircount)
++ readdir->rd_dircount = INT_MAX;
++
+ readdir->xdr = xdr;
+ readdir->rd_maxcount = maxcount;
+ readdir->common.err = 0;
+diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
+index 6252b173a465..d071e7f23de2 100644
+--- a/fs/nilfs2/inode.c
++++ b/fs/nilfs2/inode.c
+@@ -24,6 +24,7 @@
+ #include <linux/buffer_head.h>
+ #include <linux/gfp.h>
+ #include <linux/mpage.h>
++#include <linux/pagemap.h>
+ #include <linux/writeback.h>
+ #include <linux/aio.h>
+ #include "nilfs.h"
+@@ -219,10 +220,10 @@ static int nilfs_writepage(struct page *page, struct writeback_control *wbc)
+
+ static int nilfs_set_page_dirty(struct page *page)
+ {
++ struct inode *inode = page->mapping->host;
+ int ret = __set_page_dirty_nobuffers(page);
+
+ if (page_has_buffers(page)) {
+- struct inode *inode = page->mapping->host;
+ unsigned nr_dirty = 0;
+ struct buffer_head *bh, *head;
+
+@@ -245,6 +246,10 @@ static int nilfs_set_page_dirty(struct page *page)
+
+ if (nr_dirty)
+ nilfs_set_file_dirty(inode, nr_dirty);
++ } else if (ret) {
++ unsigned nr_dirty = 1 << (PAGE_CACHE_SHIFT - inode->i_blkbits);
++
++ nilfs_set_file_dirty(inode, nr_dirty);
+ }
+ return ret;
+ }
+diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
+index 238a5930cb3c..9d7e2b9659cb 100644
+--- a/fs/notify/fdinfo.c
++++ b/fs/notify/fdinfo.c
+@@ -42,7 +42,7 @@ static int show_mark_fhandle(struct seq_file *m, struct inode *inode)
+ {
+ struct {
+ struct file_handle handle;
+- u8 pad[64];
++ u8 pad[MAX_HANDLE_SZ];
+ } f;
+ int size, ret, i;
+
+@@ -50,7 +50,7 @@ static int show_mark_fhandle(struct seq_file *m, struct inode *inode)
+ size = f.handle.handle_bytes >> 2;
+
+ ret = exportfs_encode_inode_fh(inode, (struct fid *)f.handle.f_handle, &size, 0);
+- if ((ret == 255) || (ret == -ENOSPC)) {
++ if ((ret == FILEID_INVALID) || (ret < 0)) {
+ WARN_ONCE(1, "Can't encode file handler for inotify: %d\n", ret);
+ return 0;
+ }
+diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
+index 82abf0cc9a12..9d405d6d2504 100644
+--- a/fs/ocfs2/dlm/dlmmaster.c
++++ b/fs/ocfs2/dlm/dlmmaster.c
+@@ -655,12 +655,9 @@ void dlm_lockres_clear_refmap_bit(struct dlm_ctxt *dlm,
+ clear_bit(bit, res->refmap);
+ }
+
+-
+-void dlm_lockres_grab_inflight_ref(struct dlm_ctxt *dlm,
++static void __dlm_lockres_grab_inflight_ref(struct dlm_ctxt *dlm,
+ struct dlm_lock_resource *res)
+ {
+- assert_spin_locked(&res->spinlock);
+-
+ res->inflight_locks++;
+
+ mlog(0, "%s: res %.*s, inflight++: now %u, %ps()\n", dlm->name,
+@@ -668,6 +665,13 @@ void dlm_lockres_grab_inflight_ref(struct dlm_ctxt *dlm,
+ __builtin_return_address(0));
+ }
+
++void dlm_lockres_grab_inflight_ref(struct dlm_ctxt *dlm,
++ struct dlm_lock_resource *res)
++{
++ assert_spin_locked(&res->spinlock);
++ __dlm_lockres_grab_inflight_ref(dlm, res);
++}
++
+ void dlm_lockres_drop_inflight_ref(struct dlm_ctxt *dlm,
+ struct dlm_lock_resource *res)
+ {
+@@ -894,10 +898,8 @@ lookup:
+ /* finally add the lockres to its hash bucket */
+ __dlm_insert_lockres(dlm, res);
+
+- /* Grab inflight ref to pin the resource */
+- spin_lock(&res->spinlock);
+- dlm_lockres_grab_inflight_ref(dlm, res);
+- spin_unlock(&res->spinlock);
++ /* since this lockres is new it doesn't not require the spinlock */
++ __dlm_lockres_grab_inflight_ref(dlm, res);
+
+ /* get an extra ref on the mle in case this is a BLOCK
+ * if so, the creator of the BLOCK may try to put the last
+diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
+index 61e8a9b021dd..42234a871b22 100644
+--- a/fs/ufs/inode.c
++++ b/fs/ufs/inode.c
+@@ -902,9 +902,6 @@ void ufs_evict_inode(struct inode * inode)
+ invalidate_inode_buffers(inode);
+ clear_inode(inode);
+
+- if (want_delete) {
+- lock_ufs(inode->i_sb);
+- ufs_free_inode (inode);
+- unlock_ufs(inode->i_sb);
+- }
++ if (want_delete)
++ ufs_free_inode(inode);
+ }
+diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
+index 90d74b8f8eba..2df62a73f20c 100644
+--- a/fs/ufs/namei.c
++++ b/fs/ufs/namei.c
+@@ -126,12 +126,12 @@ static int ufs_symlink (struct inode * dir, struct dentry * dentry,
+ if (l > sb->s_blocksize)
+ goto out_notlocked;
+
+- lock_ufs(dir->i_sb);
+ inode = ufs_new_inode(dir, S_IFLNK | S_IRWXUGO);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+- goto out;
++ goto out_notlocked;
+
++ lock_ufs(dir->i_sb);
+ if (l > UFS_SB(sb)->s_uspi->s_maxsymlinklen) {
+ /* slow symlink */
+ inode->i_op = &ufs_symlink_inode_operations;
+@@ -181,13 +181,9 @@ static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
+ struct inode * inode;
+ int err;
+
+- lock_ufs(dir->i_sb);
+- inode_inc_link_count(dir);
+-
+ inode = ufs_new_inode(dir, S_IFDIR|mode);
+- err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+- goto out_dir;
++ return PTR_ERR(inode);
+
+ inode->i_op = &ufs_dir_inode_operations;
+ inode->i_fop = &ufs_dir_operations;
+@@ -195,6 +191,9 @@ static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
+
+ inode_inc_link_count(inode);
+
++ lock_ufs(dir->i_sb);
++ inode_inc_link_count(dir);
++
+ err = ufs_make_empty(inode, dir);
+ if (err)
+ goto out_fail;
+@@ -212,7 +211,6 @@ out_fail:
+ inode_dec_link_count(inode);
+ inode_dec_link_count(inode);
+ iput (inode);
+-out_dir:
+ inode_dec_link_count(dir);
+ unlock_ufs(dir->i_sb);
+ goto out;
+diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
+index 0826a4407e8e..d07aa9b7fb99 100644
+--- a/include/acpi/acpi_bus.h
++++ b/include/acpi/acpi_bus.h
+@@ -118,6 +118,7 @@ struct acpi_device;
+ struct acpi_hotplug_profile {
+ struct kobject kobj;
+ int (*scan_dependent)(struct acpi_device *adev);
++ void (*notify_online)(struct acpi_device *adev);
+ bool enabled:1;
+ bool demand_offline:1;
+ };
+diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
+index a5183da3ef92..f2fcd3ed5676 100644
+--- a/include/drm/ttm/ttm_bo_driver.h
++++ b/include/drm/ttm/ttm_bo_driver.h
+@@ -182,6 +182,7 @@ struct ttm_mem_type_manager_func {
+ * @man: Pointer to a memory type manager.
+ * @bo: Pointer to the buffer object we're allocating space for.
+ * @placement: Placement details.
++ * @flags: Additional placement flags.
+ * @mem: Pointer to a struct ttm_mem_reg to be filled in.
+ *
+ * This function should allocate space in the memory type managed
+@@ -206,6 +207,7 @@ struct ttm_mem_type_manager_func {
+ int (*get_node)(struct ttm_mem_type_manager *man,
+ struct ttm_buffer_object *bo,
+ struct ttm_placement *placement,
++ uint32_t flags,
+ struct ttm_mem_reg *mem);
+
+ /**
+diff --git a/include/linux/ccp.h b/include/linux/ccp.h
+index ebcc9d146219..7f437036baa4 100644
+--- a/include/linux/ccp.h
++++ b/include/linux/ccp.h
+@@ -27,6 +27,13 @@ struct ccp_cmd;
+ defined(CONFIG_CRYPTO_DEV_CCP_DD_MODULE)
+
+ /**
++ * ccp_present - check if a CCP device is present
++ *
++ * Returns zero if a CCP device is present, -ENODEV otherwise.
++ */
++int ccp_present(void);
++
++/**
+ * ccp_enqueue_cmd - queue an operation for processing by the CCP
+ *
+ * @cmd: ccp_cmd struct to be processed
+@@ -53,6 +60,11 @@ int ccp_enqueue_cmd(struct ccp_cmd *cmd);
+
+ #else /* CONFIG_CRYPTO_DEV_CCP_DD is not enabled */
+
++static inline int ccp_present(void)
++{
++ return -ENODEV;
++}
++
+ static inline int ccp_enqueue_cmd(struct ccp_cmd *cmd)
+ {
+ return -ENODEV;
+diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
+index 404a686a3644..721de254ba7a 100644
+--- a/include/linux/ftrace.h
++++ b/include/linux/ftrace.h
+@@ -103,6 +103,15 @@ enum {
+ FTRACE_OPS_FL_DELETED = 1 << 8,
+ };
+
++#ifdef CONFIG_DYNAMIC_FTRACE
++/* The hash used to know what functions callbacks trace */
++struct ftrace_ops_hash {
++ struct ftrace_hash *notrace_hash;
++ struct ftrace_hash *filter_hash;
++ struct mutex regex_lock;
++};
++#endif
++
+ /*
+ * Note, ftrace_ops can be referenced outside of RCU protection.
+ * (Although, for perf, the control ops prevent that). If ftrace_ops is
+@@ -121,8 +130,8 @@ struct ftrace_ops {
+ int __percpu *disabled;
+ void *private;
+ #ifdef CONFIG_DYNAMIC_FTRACE
+- struct ftrace_hash *notrace_hash;
+- struct ftrace_hash *filter_hash;
++ struct ftrace_ops_hash local_hash;
++ struct ftrace_ops_hash *func_hash;
+ struct mutex regex_lock;
+ #endif
+ };
+diff --git a/include/linux/iio/trigger.h b/include/linux/iio/trigger.h
+index 369cf2cd5144..68f46cd5d514 100644
+--- a/include/linux/iio/trigger.h
++++ b/include/linux/iio/trigger.h
+@@ -84,10 +84,12 @@ static inline void iio_trigger_put(struct iio_trigger *trig)
+ put_device(&trig->dev);
+ }
+
+-static inline void iio_trigger_get(struct iio_trigger *trig)
++static inline struct iio_trigger *iio_trigger_get(struct iio_trigger *trig)
+ {
+ get_device(&trig->dev);
+ __module_get(trig->ops->owner);
++
++ return trig;
+ }
+
+ /**
+diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
+index 7d9096d95d4a..55a486421fdd 100644
+--- a/include/linux/nfs_page.h
++++ b/include/linux/nfs_page.h
+@@ -62,12 +62,13 @@ struct nfs_pageio_ops {
+
+ struct nfs_rw_ops {
+ const fmode_t rw_mode;
+- struct nfs_rw_header *(*rw_alloc_header)(void);
+- void (*rw_free_header)(struct nfs_rw_header *);
+- void (*rw_release)(struct nfs_pgio_data *);
+- int (*rw_done)(struct rpc_task *, struct nfs_pgio_data *, struct inode *);
+- void (*rw_result)(struct rpc_task *, struct nfs_pgio_data *);
+- void (*rw_initiate)(struct nfs_pgio_data *, struct rpc_message *,
++ struct nfs_pgio_header *(*rw_alloc_header)(void);
++ void (*rw_free_header)(struct nfs_pgio_header *);
++ void (*rw_release)(struct nfs_pgio_header *);
++ int (*rw_done)(struct rpc_task *, struct nfs_pgio_header *,
++ struct inode *);
++ void (*rw_result)(struct rpc_task *, struct nfs_pgio_header *);
++ void (*rw_initiate)(struct nfs_pgio_header *, struct rpc_message *,
+ struct rpc_task_setup *, int);
+ };
+
+@@ -119,7 +120,8 @@ extern size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
+ extern int nfs_wait_on_request(struct nfs_page *);
+ extern void nfs_unlock_request(struct nfs_page *req);
+ extern void nfs_unlock_and_release_request(struct nfs_page *);
+-extern void nfs_page_group_lock(struct nfs_page *);
++extern int nfs_page_group_lock(struct nfs_page *, bool);
++extern void nfs_page_group_lock_wait(struct nfs_page *);
+ extern void nfs_page_group_unlock(struct nfs_page *);
+ extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
+
+diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
+index 9a1396e70310..2c35d524ffc6 100644
+--- a/include/linux/nfs_xdr.h
++++ b/include/linux/nfs_xdr.h
+@@ -1257,14 +1257,10 @@ enum {
+ NFS_IOHDR_NEED_RESCHED,
+ };
+
+-struct nfs_pgio_data;
+-
+ struct nfs_pgio_header {
+ struct inode *inode;
+ struct rpc_cred *cred;
+ struct list_head pages;
+- struct nfs_pgio_data *data;
+- atomic_t refcnt;
+ struct nfs_page *req;
+ struct nfs_writeverf verf; /* Used for writes */
+ struct pnfs_layout_segment *lseg;
+@@ -1281,28 +1277,23 @@ struct nfs_pgio_header {
+ int error; /* merge with pnfs_error */
+ unsigned long good_bytes; /* boundary of good data */
+ unsigned long flags;
+-};
+
+-struct nfs_pgio_data {
+- struct nfs_pgio_header *header;
++ /*
++ * rpc data
++ */
+ struct rpc_task task;
+ struct nfs_fattr fattr;
+- struct nfs_writeverf verf; /* Used for writes */
++ struct nfs_writeverf writeverf; /* Used for writes */
+ struct nfs_pgio_args args; /* argument struct */
+ struct nfs_pgio_res res; /* result struct */
+ unsigned long timestamp; /* For lease renewal */
+- int (*pgio_done_cb) (struct rpc_task *task, struct nfs_pgio_data *data);
++ int (*pgio_done_cb)(struct rpc_task *, struct nfs_pgio_header *);
+ __u64 mds_offset; /* Filelayout dense stripe */
+- struct nfs_page_array pages;
++ struct nfs_page_array page_array;
+ struct nfs_client *ds_clp; /* pNFS data server */
+ int ds_idx; /* ds index if ds_clp is set */
+ };
+
+-struct nfs_rw_header {
+- struct nfs_pgio_header header;
+- struct nfs_pgio_data rpc_data;
+-};
+-
+ struct nfs_mds_commit_info {
+ atomic_t rpcs_out;
+ unsigned long ncommit;
+@@ -1432,11 +1423,12 @@ struct nfs_rpc_ops {
+ struct nfs_pathconf *);
+ int (*set_capabilities)(struct nfs_server *, struct nfs_fh *);
+ int (*decode_dirent)(struct xdr_stream *, struct nfs_entry *, int);
+- int (*pgio_rpc_prepare)(struct rpc_task *, struct nfs_pgio_data *);
+- void (*read_setup) (struct nfs_pgio_data *, struct rpc_message *);
+- int (*read_done) (struct rpc_task *, struct nfs_pgio_data *);
+- void (*write_setup) (struct nfs_pgio_data *, struct rpc_message *);
+- int (*write_done) (struct rpc_task *, struct nfs_pgio_data *);
++ int (*pgio_rpc_prepare)(struct rpc_task *,
++ struct nfs_pgio_header *);
++ void (*read_setup)(struct nfs_pgio_header *, struct rpc_message *);
++ int (*read_done)(struct rpc_task *, struct nfs_pgio_header *);
++ void (*write_setup)(struct nfs_pgio_header *, struct rpc_message *);
++ int (*write_done)(struct rpc_task *, struct nfs_pgio_header *);
+ void (*commit_setup) (struct nfs_commit_data *, struct rpc_message *);
+ void (*commit_rpc_prepare)(struct rpc_task *, struct nfs_commit_data *);
+ int (*commit_done) (struct rpc_task *, struct nfs_commit_data *);
+diff --git a/include/linux/pci.h b/include/linux/pci.h
+index 466bcd111d85..97fe7ebf2e25 100644
+--- a/include/linux/pci.h
++++ b/include/linux/pci.h
+@@ -303,6 +303,7 @@ struct pci_dev {
+ D3cold, not set for devices
+ powered on/off by the
+ corresponding bridge */
++ unsigned int ignore_hotplug:1; /* Ignore hotplug events */
+ unsigned int d3_delay; /* D3->D0 transition time in ms */
+ unsigned int d3cold_delay; /* D3cold->D0 transition time in ms */
+
+@@ -1019,6 +1020,11 @@ bool pci_dev_run_wake(struct pci_dev *dev);
+ bool pci_check_pme_status(struct pci_dev *dev);
+ void pci_pme_wakeup_bus(struct pci_bus *bus);
+
++static inline void pci_ignore_hotplug(struct pci_dev *dev)
++{
++ dev->ignore_hotplug = 1;
++}
++
+ static inline int pci_enable_wake(struct pci_dev *dev, pci_power_t state,
+ bool enable)
+ {
+diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
+index 535f158977b9..8cf350325dc6 100644
+--- a/include/linux/seqlock.h
++++ b/include/linux/seqlock.h
+@@ -164,8 +164,6 @@ static inline unsigned read_seqcount_begin(const seqcount_t *s)
+ static inline unsigned raw_seqcount_begin(const seqcount_t *s)
+ {
+ unsigned ret = ACCESS_ONCE(s->sequence);
+-
+- seqcount_lockdep_reader_access(s);
+ smp_rmb();
+ return ret & ~1;
+ }
+diff --git a/include/linux/vga_switcheroo.h b/include/linux/vga_switcheroo.h
+index 502073a53dd3..b483abd34493 100644
+--- a/include/linux/vga_switcheroo.h
++++ b/include/linux/vga_switcheroo.h
+@@ -64,6 +64,7 @@ int vga_switcheroo_get_client_state(struct pci_dev *dev);
+ void vga_switcheroo_set_dynamic_switch(struct pci_dev *pdev, enum vga_switcheroo_state dynamic);
+
+ int vga_switcheroo_init_domain_pm_ops(struct device *dev, struct dev_pm_domain *domain);
++void vga_switcheroo_fini_domain_pm_ops(struct device *dev);
+ int vga_switcheroo_init_domain_pm_optimus_hdmi_audio(struct device *dev, struct dev_pm_domain *domain);
+ #else
+
+@@ -82,6 +83,7 @@ static inline int vga_switcheroo_get_client_state(struct pci_dev *dev) { return
+ static inline void vga_switcheroo_set_dynamic_switch(struct pci_dev *pdev, enum vga_switcheroo_state dynamic) {}
+
+ static inline int vga_switcheroo_init_domain_pm_ops(struct device *dev, struct dev_pm_domain *domain) { return -EINVAL; }
++static inline void vga_switcheroo_fini_domain_pm_ops(struct device *dev) {}
+ static inline int vga_switcheroo_init_domain_pm_optimus_hdmi_audio(struct device *dev, struct dev_pm_domain *domain) { return -EINVAL; }
+
+ #endif
+diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
+index a0cc2e95ed1b..b996e6cde6bb 100644
+--- a/include/linux/workqueue.h
++++ b/include/linux/workqueue.h
+@@ -419,7 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,
+ alloc_workqueue("%s", WQ_FREEZABLE | WQ_UNBOUND | WQ_MEM_RECLAIM, \
+ 1, (name))
+ #define create_singlethread_workqueue(name) \
+- alloc_workqueue("%s", WQ_UNBOUND | WQ_MEM_RECLAIM, 1, (name))
++ alloc_ordered_workqueue("%s", WQ_MEM_RECLAIM, name)
+
+ extern void destroy_workqueue(struct workqueue_struct *wq);
+
+diff --git a/include/net/regulatory.h b/include/net/regulatory.h
+index 259992444e80..dad7ab20a8cb 100644
+--- a/include/net/regulatory.h
++++ b/include/net/regulatory.h
+@@ -167,7 +167,7 @@ struct ieee80211_reg_rule {
+ struct ieee80211_regdomain {
+ struct rcu_head rcu_head;
+ u32 n_reg_rules;
+- char alpha2[2];
++ char alpha2[3];
+ enum nl80211_dfs_regions dfs_region;
+ struct ieee80211_reg_rule reg_rules[];
+ };
+diff --git a/include/uapi/drm/radeon_drm.h b/include/uapi/drm/radeon_drm.h
+index 1cc0b610f162..79719f940ea4 100644
+--- a/include/uapi/drm/radeon_drm.h
++++ b/include/uapi/drm/radeon_drm.h
+@@ -942,6 +942,7 @@ struct drm_radeon_cs_chunk {
+ };
+
+ /* drm_radeon_cs_reloc.flags */
++#define RADEON_RELOC_PRIO_MASK (0xf << 0)
+
+ struct drm_radeon_cs_reloc {
+ uint32_t handle;
+diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
+index c38355c1f3c9..1590c49cae57 100644
+--- a/include/uapi/linux/xattr.h
++++ b/include/uapi/linux/xattr.h
+@@ -13,7 +13,7 @@
+ #ifndef _UAPI_LINUX_XATTR_H
+ #define _UAPI_LINUX_XATTR_H
+
+-#ifdef __UAPI_DEF_XATTR
++#if __UAPI_DEF_XATTR
+ #define __USE_KERNEL_XATTR_DEFS
+
+ #define XATTR_CREATE 0x1 /* set value, fail if attr already exists */
+diff --git a/kernel/cgroup.c b/kernel/cgroup.c
+index 70776aec2562..0a46b2aa9dfb 100644
+--- a/kernel/cgroup.c
++++ b/kernel/cgroup.c
+@@ -1031,6 +1031,11 @@ static void cgroup_get(struct cgroup *cgrp)
+ css_get(&cgrp->self);
+ }
+
++static bool cgroup_tryget(struct cgroup *cgrp)
++{
++ return css_tryget(&cgrp->self);
++}
++
+ static void cgroup_put(struct cgroup *cgrp)
+ {
+ css_put(&cgrp->self);
+@@ -1091,7 +1096,8 @@ static struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn)
+ * protection against removal. Ensure @cgrp stays accessible and
+ * break the active_ref protection.
+ */
+- cgroup_get(cgrp);
++ if (!cgroup_tryget(cgrp))
++ return NULL;
+ kernfs_break_active_protection(kn);
+
+ mutex_lock(&cgroup_mutex);
+@@ -3827,7 +3833,6 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
+
+ l = cgroup_pidlist_find_create(cgrp, type);
+ if (!l) {
+- mutex_unlock(&cgrp->pidlist_mutex);
+ pidlist_free(array);
+ return -ENOMEM;
+ }
+@@ -4236,6 +4241,15 @@ static void css_release_work_fn(struct work_struct *work)
+ /* cgroup release path */
+ cgroup_idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
+ cgrp->id = -1;
++
++ /*
++ * There are two control paths which try to determine
++ * cgroup from dentry without going through kernfs -
++ * cgroupstats_build() and css_tryget_online_from_dir().
++ * Those are supported by RCU protecting clearing of
++ * cgrp->kn->priv backpointer.
++ */
++ RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv, NULL);
+ }
+
+ mutex_unlock(&cgroup_mutex);
+@@ -4387,6 +4401,11 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
+ struct kernfs_node *kn;
+ int ssid, ret;
+
++ /* Do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable.
++ */
++ if (strchr(name, '\n'))
++ return -EINVAL;
++
+ parent = cgroup_kn_lock_live(parent_kn);
+ if (!parent)
+ return -ENODEV;
+@@ -4656,16 +4675,6 @@ static int cgroup_rmdir(struct kernfs_node *kn)
+
+ cgroup_kn_unlock(kn);
+
+- /*
+- * There are two control paths which try to determine cgroup from
+- * dentry without going through kernfs - cgroupstats_build() and
+- * css_tryget_online_from_dir(). Those are supported by RCU
+- * protecting clearing of cgrp->kn->priv backpointer, which should
+- * happen after all files under it have been removed.
+- */
+- if (!ret)
+- RCU_INIT_POINTER(*(void __rcu __force **)&kn->priv, NULL);
+-
+ cgroup_put(cgrp);
+ return ret;
+ }
+@@ -5231,7 +5240,7 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
+ /*
+ * This path doesn't originate from kernfs and @kn could already
+ * have been or be removed at any point. @kn->priv is RCU
+- * protected for this access. See cgroup_rmdir() for details.
++ * protected for this access. See css_release_work_fn() for details.
+ */
+ cgrp = rcu_dereference(kn->priv);
+ if (cgrp)
+diff --git a/kernel/events/core.c b/kernel/events/core.c
+index 6b17ac1b0c2a..f626c9f1f3c0 100644
+--- a/kernel/events/core.c
++++ b/kernel/events/core.c
+@@ -1523,6 +1523,11 @@ retry:
+ */
+ if (ctx->is_active) {
+ raw_spin_unlock_irq(&ctx->lock);
++ /*
++ * Reload the task pointer, it might have been changed by
++ * a concurrent perf_event_context_sched_out().
++ */
++ task = ctx->task;
+ goto retry;
+ }
+
+@@ -1966,6 +1971,11 @@ retry:
+ */
+ if (ctx->is_active) {
+ raw_spin_unlock_irq(&ctx->lock);
++ /*
++ * Reload the task pointer, it might have been changed by
++ * a concurrent perf_event_context_sched_out().
++ */
++ task = ctx->task;
+ goto retry;
+ }
+
+diff --git a/kernel/futex.c b/kernel/futex.c
+index b632b5f3f094..c20fb395a672 100644
+--- a/kernel/futex.c
++++ b/kernel/futex.c
+@@ -2628,6 +2628,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
+ * shared futexes. We need to compare the keys:
+ */
+ if (match_futex(&q.key, &key2)) {
++ queue_unlock(hb);
+ ret = -EINVAL;
+ goto out_put_keys;
+ }
+diff --git a/kernel/kcmp.c b/kernel/kcmp.c
+index e30ac0fe61c3..0aa69ea1d8fd 100644
+--- a/kernel/kcmp.c
++++ b/kernel/kcmp.c
+@@ -44,11 +44,12 @@ static long kptr_obfuscate(long v, int type)
+ */
+ static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
+ {
+- long ret;
++ long t1, t2;
+
+- ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);
++ t1 = kptr_obfuscate((long)v1, type);
++ t2 = kptr_obfuscate((long)v2, type);
+
+- return (ret < 0) | ((ret > 0) << 1);
++ return (t1 < t2) | ((t1 > t2) << 1);
+ }
+
+ /* The caller must have pinned the task */
+diff --git a/kernel/module.c b/kernel/module.c
+index 81e727cf6df9..673aeb0c25dc 100644
+--- a/kernel/module.c
++++ b/kernel/module.c
+@@ -3308,6 +3308,11 @@ static int load_module(struct load_info *info, const char __user *uargs,
+ mutex_lock(&module_mutex);
+ module_bug_cleanup(mod);
+ mutex_unlock(&module_mutex);
++
++ /* we can't deallocate the module until we clear memory protection */
++ unset_module_init_ro_nx(mod);
++ unset_module_core_ro_nx(mod);
++
+ ddebug_cleanup:
+ dynamic_debug_remove(info->debug);
+ synchronize_sched();
+diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
+index 13e839dbca07..971285d5b7a0 100644
+--- a/kernel/printk/printk.c
++++ b/kernel/printk/printk.c
+@@ -1617,15 +1617,15 @@ asmlinkage int vprintk_emit(int facility, int level,
+ raw_spin_lock(&logbuf_lock);
+ logbuf_cpu = this_cpu;
+
+- if (recursion_bug) {
++ if (unlikely(recursion_bug)) {
+ static const char recursion_msg[] =
+ "BUG: recent printk recursion!";
+
+ recursion_bug = 0;
+- text_len = strlen(recursion_msg);
+ /* emit KERN_CRIT message */
+ printed_len += log_store(0, 2, LOG_PREFIX|LOG_NEWLINE, 0,
+- NULL, 0, recursion_msg, text_len);
++ NULL, 0, recursion_msg,
++ strlen(recursion_msg));
+ }
+
+ /*
+diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
+index fe75444ae7ec..cd45a0727a16 100644
+--- a/kernel/time/alarmtimer.c
++++ b/kernel/time/alarmtimer.c
+@@ -464,18 +464,26 @@ static enum alarmtimer_type clock2alarm(clockid_t clockid)
+ static enum alarmtimer_restart alarm_handle_timer(struct alarm *alarm,
+ ktime_t now)
+ {
++ unsigned long flags;
+ struct k_itimer *ptr = container_of(alarm, struct k_itimer,
+ it.alarm.alarmtimer);
+- if (posix_timer_event(ptr, 0) != 0)
+- ptr->it_overrun++;
++ enum alarmtimer_restart result = ALARMTIMER_NORESTART;
++
++ spin_lock_irqsave(&ptr->it_lock, flags);
++ if ((ptr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) {
++ if (posix_timer_event(ptr, 0) != 0)
++ ptr->it_overrun++;
++ }
+
+ /* Re-add periodic timers */
+ if (ptr->it.alarm.interval.tv64) {
+ ptr->it_overrun += alarm_forward(alarm, now,
+ ptr->it.alarm.interval);
+- return ALARMTIMER_RESTART;
++ result = ALARMTIMER_RESTART;
+ }
+- return ALARMTIMER_NORESTART;
++ spin_unlock_irqrestore(&ptr->it_lock, flags);
++
++ return result;
+ }
+
+ /**
+@@ -541,18 +549,22 @@ static int alarm_timer_create(struct k_itimer *new_timer)
+ * @new_timer: k_itimer pointer
+ * @cur_setting: itimerspec data to fill
+ *
+- * Copies the itimerspec data out from the k_itimer
++ * Copies out the current itimerspec data
+ */
+ static void alarm_timer_get(struct k_itimer *timr,
+ struct itimerspec *cur_setting)
+ {
+- memset(cur_setting, 0, sizeof(struct itimerspec));
++ ktime_t relative_expiry_time =
++ alarm_expires_remaining(&(timr->it.alarm.alarmtimer));
++
++ if (ktime_to_ns(relative_expiry_time) > 0) {
++ cur_setting->it_value = ktime_to_timespec(relative_expiry_time);
++ } else {
++ cur_setting->it_value.tv_sec = 0;
++ cur_setting->it_value.tv_nsec = 0;
++ }
+
+- cur_setting->it_interval =
+- ktime_to_timespec(timr->it.alarm.interval);
+- cur_setting->it_value =
+- ktime_to_timespec(timr->it.alarm.alarmtimer.node.expires);
+- return;
++ cur_setting->it_interval = ktime_to_timespec(timr->it.alarm.interval);
+ }
+
+ /**
+diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
+index ac9d1dad630b..ca167e660e10 100644
+--- a/kernel/trace/ftrace.c
++++ b/kernel/trace/ftrace.c
+@@ -65,15 +65,21 @@
+ #define FL_GLOBAL_CONTROL_MASK (FTRACE_OPS_FL_CONTROL)
+
+ #ifdef CONFIG_DYNAMIC_FTRACE
+-#define INIT_REGEX_LOCK(opsname) \
+- .regex_lock = __MUTEX_INITIALIZER(opsname.regex_lock),
++#define INIT_OPS_HASH(opsname) \
++ .func_hash = &opsname.local_hash, \
++ .local_hash.regex_lock = __MUTEX_INITIALIZER(opsname.local_hash.regex_lock),
++#define ASSIGN_OPS_HASH(opsname, val) \
++ .func_hash = val, \
++ .local_hash.regex_lock = __MUTEX_INITIALIZER(opsname.local_hash.regex_lock),
+ #else
+-#define INIT_REGEX_LOCK(opsname)
++#define INIT_OPS_HASH(opsname)
++#define ASSIGN_OPS_HASH(opsname, val)
+ #endif
+
+ static struct ftrace_ops ftrace_list_end __read_mostly = {
+ .func = ftrace_stub,
+ .flags = FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_STUB,
++ INIT_OPS_HASH(ftrace_list_end)
+ };
+
+ /* ftrace_enabled is a method to turn ftrace on or off */
+@@ -108,6 +114,7 @@ static struct ftrace_ops *ftrace_ops_list __read_mostly = &ftrace_list_end;
+ ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub;
+ ftrace_func_t ftrace_pid_function __read_mostly = ftrace_stub;
+ static struct ftrace_ops global_ops;
++static struct ftrace_ops graph_ops;
+ static struct ftrace_ops control_ops;
+
+ #if ARCH_SUPPORTS_FTRACE_OPS
+@@ -143,7 +150,8 @@ static inline void ftrace_ops_init(struct ftrace_ops *ops)
+ {
+ #ifdef CONFIG_DYNAMIC_FTRACE
+ if (!(ops->flags & FTRACE_OPS_FL_INITIALIZED)) {
+- mutex_init(&ops->regex_lock);
++ mutex_init(&ops->local_hash.regex_lock);
++ ops->func_hash = &ops->local_hash;
+ ops->flags |= FTRACE_OPS_FL_INITIALIZED;
+ }
+ #endif
+@@ -902,7 +910,7 @@ static void unregister_ftrace_profiler(void)
+ static struct ftrace_ops ftrace_profile_ops __read_mostly = {
+ .func = function_profile_call,
+ .flags = FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED,
+- INIT_REGEX_LOCK(ftrace_profile_ops)
++ INIT_OPS_HASH(ftrace_profile_ops)
+ };
+
+ static int register_ftrace_profiler(void)
+@@ -1082,11 +1090,12 @@ static const struct ftrace_hash empty_hash = {
+ #define EMPTY_HASH ((struct ftrace_hash *)&empty_hash)
+
+ static struct ftrace_ops global_ops = {
+- .func = ftrace_stub,
+- .notrace_hash = EMPTY_HASH,
+- .filter_hash = EMPTY_HASH,
+- .flags = FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED,
+- INIT_REGEX_LOCK(global_ops)
++ .func = ftrace_stub,
++ .local_hash.notrace_hash = EMPTY_HASH,
++ .local_hash.filter_hash = EMPTY_HASH,
++ INIT_OPS_HASH(global_ops)
++ .flags = FTRACE_OPS_FL_RECURSION_SAFE |
++ FTRACE_OPS_FL_INITIALIZED,
+ };
+
+ struct ftrace_page {
+@@ -1227,8 +1236,8 @@ static void free_ftrace_hash_rcu(struct ftrace_hash *hash)
+ void ftrace_free_filter(struct ftrace_ops *ops)
+ {
+ ftrace_ops_init(ops);
+- free_ftrace_hash(ops->filter_hash);
+- free_ftrace_hash(ops->notrace_hash);
++ free_ftrace_hash(ops->func_hash->filter_hash);
++ free_ftrace_hash(ops->func_hash->notrace_hash);
+ }
+
+ static struct ftrace_hash *alloc_ftrace_hash(int size_bits)
+@@ -1289,9 +1298,9 @@ alloc_and_copy_ftrace_hash(int size_bits, struct ftrace_hash *hash)
+ }
+
+ static void
+-ftrace_hash_rec_disable(struct ftrace_ops *ops, int filter_hash);
++ftrace_hash_rec_disable_modify(struct ftrace_ops *ops, int filter_hash);
+ static void
+-ftrace_hash_rec_enable(struct ftrace_ops *ops, int filter_hash);
++ftrace_hash_rec_enable_modify(struct ftrace_ops *ops, int filter_hash);
+
+ static int
+ ftrace_hash_move(struct ftrace_ops *ops, int enable,
+@@ -1311,7 +1320,7 @@ ftrace_hash_move(struct ftrace_ops *ops, int enable,
+ * Remove the current set, update the hash and add
+ * them back.
+ */
+- ftrace_hash_rec_disable(ops, enable);
++ ftrace_hash_rec_disable_modify(ops, enable);
+
+ /*
+ * If the new source is empty, just free dst and assign it
+@@ -1360,7 +1369,7 @@ ftrace_hash_move(struct ftrace_ops *ops, int enable,
+ * On success, we enable the new hash.
+ * On failure, we re-enable the original hash.
+ */
+- ftrace_hash_rec_enable(ops, enable);
++ ftrace_hash_rec_enable_modify(ops, enable);
+
+ return ret;
+ }
+@@ -1394,8 +1403,8 @@ ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip, void *regs)
+ return 0;
+ #endif
+
+- filter_hash = rcu_dereference_raw_notrace(ops->filter_hash);
+- notrace_hash = rcu_dereference_raw_notrace(ops->notrace_hash);
++ filter_hash = rcu_dereference_raw_notrace(ops->func_hash->filter_hash);
++ notrace_hash = rcu_dereference_raw_notrace(ops->func_hash->notrace_hash);
+
+ if ((ftrace_hash_empty(filter_hash) ||
+ ftrace_lookup_ip(filter_hash, ip)) &&
+@@ -1519,14 +1528,14 @@ static void __ftrace_hash_rec_update(struct ftrace_ops *ops,
+ * gets inversed.
+ */
+ if (filter_hash) {
+- hash = ops->filter_hash;
+- other_hash = ops->notrace_hash;
++ hash = ops->func_hash->filter_hash;
++ other_hash = ops->func_hash->notrace_hash;
+ if (ftrace_hash_empty(hash))
+ all = 1;
+ } else {
+ inc = !inc;
+- hash = ops->notrace_hash;
+- other_hash = ops->filter_hash;
++ hash = ops->func_hash->notrace_hash;
++ other_hash = ops->func_hash->filter_hash;
+ /*
+ * If the notrace hash has no items,
+ * then there's nothing to do.
+@@ -1604,6 +1613,41 @@ static void ftrace_hash_rec_enable(struct ftrace_ops *ops,
+ __ftrace_hash_rec_update(ops, filter_hash, 1);
+ }
+
++static void ftrace_hash_rec_update_modify(struct ftrace_ops *ops,
++ int filter_hash, int inc)
++{
++ struct ftrace_ops *op;
++
++ __ftrace_hash_rec_update(ops, filter_hash, inc);
++
++ if (ops->func_hash != &global_ops.local_hash)
++ return;
++
++ /*
++ * If the ops shares the global_ops hash, then we need to update
++ * all ops that are enabled and use this hash.
++ */
++ do_for_each_ftrace_op(op, ftrace_ops_list) {
++ /* Already done */
++ if (op == ops)
++ continue;
++ if (op->func_hash == &global_ops.local_hash)
++ __ftrace_hash_rec_update(op, filter_hash, inc);
++ } while_for_each_ftrace_op(op);
++}
++
++static void ftrace_hash_rec_disable_modify(struct ftrace_ops *ops,
++ int filter_hash)
++{
++ ftrace_hash_rec_update_modify(ops, filter_hash, 0);
++}
++
++static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops,
++ int filter_hash)
++{
++ ftrace_hash_rec_update_modify(ops, filter_hash, 1);
++}
++
+ static void print_ip_ins(const char *fmt, unsigned char *p)
+ {
+ int i;
+@@ -1809,7 +1853,7 @@ __ftrace_replace_code(struct dyn_ftrace *rec, int enable)
+ return ftrace_make_call(rec, ftrace_addr);
+
+ case FTRACE_UPDATE_MAKE_NOP:
+- return ftrace_make_nop(NULL, rec, ftrace_addr);
++ return ftrace_make_nop(NULL, rec, ftrace_old_addr);
+
+ case FTRACE_UPDATE_MODIFY_CALL:
+ return ftrace_modify_call(rec, ftrace_old_addr, ftrace_addr);
+@@ -2196,8 +2240,8 @@ static inline int ops_traces_mod(struct ftrace_ops *ops)
+ * Filter_hash being empty will default to trace module.
+ * But notrace hash requires a test of individual module functions.
+ */
+- return ftrace_hash_empty(ops->filter_hash) &&
+- ftrace_hash_empty(ops->notrace_hash);
++ return ftrace_hash_empty(ops->func_hash->filter_hash) &&
++ ftrace_hash_empty(ops->func_hash->notrace_hash);
+ }
+
+ /*
+@@ -2219,12 +2263,12 @@ ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
+ return 0;
+
+ /* The function must be in the filter */
+- if (!ftrace_hash_empty(ops->filter_hash) &&
+- !ftrace_lookup_ip(ops->filter_hash, rec->ip))
++ if (!ftrace_hash_empty(ops->func_hash->filter_hash) &&
++ !ftrace_lookup_ip(ops->func_hash->filter_hash, rec->ip))
+ return 0;
+
+ /* If in notrace hash, we ignore it too */
+- if (ftrace_lookup_ip(ops->notrace_hash, rec->ip))
++ if (ftrace_lookup_ip(ops->func_hash->notrace_hash, rec->ip))
+ return 0;
+
+ return 1;
+@@ -2544,10 +2588,10 @@ t_next(struct seq_file *m, void *v, loff_t *pos)
+ } else {
+ rec = &iter->pg->records[iter->idx++];
+ if (((iter->flags & FTRACE_ITER_FILTER) &&
+- !(ftrace_lookup_ip(ops->filter_hash, rec->ip))) ||
++ !(ftrace_lookup_ip(ops->func_hash->filter_hash, rec->ip))) ||
+
+ ((iter->flags & FTRACE_ITER_NOTRACE) &&
+- !ftrace_lookup_ip(ops->notrace_hash, rec->ip)) ||
++ !ftrace_lookup_ip(ops->func_hash->notrace_hash, rec->ip)) ||
+
+ ((iter->flags & FTRACE_ITER_ENABLED) &&
+ !(rec->flags & FTRACE_FL_ENABLED))) {
+@@ -2596,7 +2640,7 @@ static void *t_start(struct seq_file *m, loff_t *pos)
+ * functions are enabled.
+ */
+ if (iter->flags & FTRACE_ITER_FILTER &&
+- ftrace_hash_empty(ops->filter_hash)) {
++ ftrace_hash_empty(ops->func_hash->filter_hash)) {
+ if (*pos > 0)
+ return t_hash_start(m, pos);
+ iter->flags |= FTRACE_ITER_PRINTALL;
+@@ -2750,12 +2794,12 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
+ iter->ops = ops;
+ iter->flags = flag;
+
+- mutex_lock(&ops->regex_lock);
++ mutex_lock(&ops->func_hash->regex_lock);
+
+ if (flag & FTRACE_ITER_NOTRACE)
+- hash = ops->notrace_hash;
++ hash = ops->func_hash->notrace_hash;
+ else
+- hash = ops->filter_hash;
++ hash = ops->func_hash->filter_hash;
+
+ if (file->f_mode & FMODE_WRITE) {
+ iter->hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, hash);
+@@ -2788,7 +2832,7 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
+ file->private_data = iter;
+
+ out_unlock:
+- mutex_unlock(&ops->regex_lock);
++ mutex_unlock(&ops->func_hash->regex_lock);
+
+ return ret;
+ }
+@@ -3026,7 +3070,7 @@ static struct ftrace_ops trace_probe_ops __read_mostly =
+ {
+ .func = function_trace_probe_call,
+ .flags = FTRACE_OPS_FL_INITIALIZED,
+- INIT_REGEX_LOCK(trace_probe_ops)
++ INIT_OPS_HASH(trace_probe_ops)
+ };
+
+ static int ftrace_probe_registered;
+@@ -3089,7 +3133,7 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ void *data)
+ {
+ struct ftrace_func_probe *entry;
+- struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
++ struct ftrace_hash **orig_hash = &trace_probe_ops.func_hash->filter_hash;
+ struct ftrace_hash *hash;
+ struct ftrace_page *pg;
+ struct dyn_ftrace *rec;
+@@ -3106,7 +3150,7 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ if (WARN_ON(not))
+ return -EINVAL;
+
+- mutex_lock(&trace_probe_ops.regex_lock);
++ mutex_lock(&trace_probe_ops.func_hash->regex_lock);
+
+ hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+ if (!hash) {
+@@ -3175,7 +3219,7 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ out_unlock:
+ mutex_unlock(&ftrace_lock);
+ out:
+- mutex_unlock(&trace_probe_ops.regex_lock);
++ mutex_unlock(&trace_probe_ops.func_hash->regex_lock);
+ free_ftrace_hash(hash);
+
+ return count;
+@@ -3193,7 +3237,7 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ struct ftrace_func_entry *rec_entry;
+ struct ftrace_func_probe *entry;
+ struct ftrace_func_probe *p;
+- struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
++ struct ftrace_hash **orig_hash = &trace_probe_ops.func_hash->filter_hash;
+ struct list_head free_list;
+ struct ftrace_hash *hash;
+ struct hlist_node *tmp;
+@@ -3215,7 +3259,7 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ return;
+ }
+
+- mutex_lock(&trace_probe_ops.regex_lock);
++ mutex_lock(&trace_probe_ops.func_hash->regex_lock);
+
+ hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+ if (!hash)
+@@ -3268,7 +3312,7 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ mutex_unlock(&ftrace_lock);
+
+ out_unlock:
+- mutex_unlock(&trace_probe_ops.regex_lock);
++ mutex_unlock(&trace_probe_ops.func_hash->regex_lock);
+ free_ftrace_hash(hash);
+ }
+
+@@ -3464,12 +3508,12 @@ ftrace_set_hash(struct ftrace_ops *ops, unsigned char *buf, int len,
+ if (unlikely(ftrace_disabled))
+ return -ENODEV;
+
+- mutex_lock(&ops->regex_lock);
++ mutex_lock(&ops->func_hash->regex_lock);
+
+ if (enable)
+- orig_hash = &ops->filter_hash;
++ orig_hash = &ops->func_hash->filter_hash;
+ else
+- orig_hash = &ops->notrace_hash;
++ orig_hash = &ops->func_hash->notrace_hash;
+
+ hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+ if (!hash) {
+@@ -3497,7 +3541,7 @@ ftrace_set_hash(struct ftrace_ops *ops, unsigned char *buf, int len,
+ mutex_unlock(&ftrace_lock);
+
+ out_regex_unlock:
+- mutex_unlock(&ops->regex_lock);
++ mutex_unlock(&ops->func_hash->regex_lock);
+
+ free_ftrace_hash(hash);
+ return ret;
+@@ -3704,15 +3748,15 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
+
+ trace_parser_put(parser);
+
+- mutex_lock(&iter->ops->regex_lock);
++ mutex_lock(&iter->ops->func_hash->regex_lock);
+
+ if (file->f_mode & FMODE_WRITE) {
+ filter_hash = !!(iter->flags & FTRACE_ITER_FILTER);
+
+ if (filter_hash)
+- orig_hash = &iter->ops->filter_hash;
++ orig_hash = &iter->ops->func_hash->filter_hash;
+ else
+- orig_hash = &iter->ops->notrace_hash;
++ orig_hash = &iter->ops->func_hash->notrace_hash;
+
+ mutex_lock(&ftrace_lock);
+ ret = ftrace_hash_move(iter->ops, filter_hash,
+@@ -3723,7 +3767,7 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
+ mutex_unlock(&ftrace_lock);
+ }
+
+- mutex_unlock(&iter->ops->regex_lock);
++ mutex_unlock(&iter->ops->func_hash->regex_lock);
+ free_ftrace_hash(iter->hash);
+ kfree(iter);
+
+@@ -4335,7 +4379,6 @@ void __init ftrace_init(void)
+ static struct ftrace_ops global_ops = {
+ .func = ftrace_stub,
+ .flags = FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED,
+- INIT_REGEX_LOCK(global_ops)
+ };
+
+ static int __init ftrace_nodyn_init(void)
+@@ -4437,7 +4480,7 @@ ftrace_ops_control_func(unsigned long ip, unsigned long parent_ip,
+ static struct ftrace_ops control_ops = {
+ .func = ftrace_ops_control_func,
+ .flags = FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED,
+- INIT_REGEX_LOCK(control_ops)
++ INIT_OPS_HASH(control_ops)
+ };
+
+ static inline void
+@@ -4873,6 +4916,14 @@ ftrace_enable_sysctl(struct ctl_table *table, int write,
+
+ #ifdef CONFIG_FUNCTION_GRAPH_TRACER
+
++static struct ftrace_ops graph_ops = {
++ .func = ftrace_stub,
++ .flags = FTRACE_OPS_FL_RECURSION_SAFE |
++ FTRACE_OPS_FL_INITIALIZED |
++ FTRACE_OPS_FL_STUB,
++ ASSIGN_OPS_HASH(graph_ops, &global_ops.local_hash)
++};
++
+ static int ftrace_graph_active;
+
+ int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace)
+@@ -5035,12 +5086,28 @@ static int ftrace_graph_entry_test(struct ftrace_graph_ent *trace)
+ */
+ static void update_function_graph_func(void)
+ {
+- if (ftrace_ops_list == &ftrace_list_end ||
+- (ftrace_ops_list == &global_ops &&
+- global_ops.next == &ftrace_list_end))
+- ftrace_graph_entry = __ftrace_graph_entry;
+- else
++ struct ftrace_ops *op;
++ bool do_test = false;
++
++ /*
++ * The graph and global ops share the same set of functions
++ * to test. If any other ops is on the list, then
++ * the graph tracing needs to test if its the function
++ * it should call.
++ */
++ do_for_each_ftrace_op(op, ftrace_ops_list) {
++ if (op != &global_ops && op != &graph_ops &&
++ op != &ftrace_list_end) {
++ do_test = true;
++ /* in double loop, break out with goto */
++ goto out;
++ }
++ } while_for_each_ftrace_op(op);
++ out:
++ if (do_test)
+ ftrace_graph_entry = ftrace_graph_entry_test;
++ else
++ ftrace_graph_entry = __ftrace_graph_entry;
+ }
+
+ static struct notifier_block ftrace_suspend_notifier = {
+@@ -5081,11 +5148,7 @@ int register_ftrace_graph(trace_func_graph_ret_t retfunc,
+ ftrace_graph_entry = ftrace_graph_entry_test;
+ update_function_graph_func();
+
+- /* Function graph doesn't use the .func field of global_ops */
+- global_ops.flags |= FTRACE_OPS_FL_STUB;
+-
+- ret = ftrace_startup(&global_ops, FTRACE_START_FUNC_RET);
+-
++ ret = ftrace_startup(&graph_ops, FTRACE_START_FUNC_RET);
+ out:
+ mutex_unlock(&ftrace_lock);
+ return ret;
+@@ -5102,8 +5165,7 @@ void unregister_ftrace_graph(void)
+ ftrace_graph_return = (trace_func_graph_ret_t)ftrace_stub;
+ ftrace_graph_entry = ftrace_graph_entry_stub;
+ __ftrace_graph_entry = ftrace_graph_entry_stub;
+- ftrace_shutdown(&global_ops, FTRACE_STOP_FUNC_RET);
+- global_ops.flags &= ~FTRACE_OPS_FL_STUB;
++ ftrace_shutdown(&graph_ops, FTRACE_STOP_FUNC_RET);
+ unregister_pm_notifier(&ftrace_suspend_notifier);
+ unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);
+
+diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
+index b95381ebdd5e..2ff0580d3dcd 100644
+--- a/kernel/trace/ring_buffer.c
++++ b/kernel/trace/ring_buffer.c
+@@ -626,8 +626,22 @@ int ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
+ work = &cpu_buffer->irq_work;
+ }
+
+- work->waiters_pending = true;
+ poll_wait(filp, &work->waiters, poll_table);
++ work->waiters_pending = true;
++ /*
++ * There's a tight race between setting the waiters_pending and
++ * checking if the ring buffer is empty. Once the waiters_pending bit
++ * is set, the next event will wake the task up, but we can get stuck
++ * if there's only a single event in.
++ *
++ * FIXME: Ideally, we need a memory barrier on the writer side as well,
++ * but adding a memory barrier to all events will cause too much of a
++ * performance hit in the fast path. We only need a memory barrier when
++ * the buffer goes from empty to having content. But as this race is
++ * extremely small, and it's not a problem if another event comes in, we
++ * will fix it later.
++ */
++ smp_mb();
+
+ if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) ||
+ (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, cpu)))
+diff --git a/mm/dmapool.c b/mm/dmapool.c
+index 306baa594f95..ba8019b063e1 100644
+--- a/mm/dmapool.c
++++ b/mm/dmapool.c
+@@ -176,7 +176,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
+ if (list_empty(&dev->dma_pools) &&
+ device_create_file(dev, &dev_attr_pools)) {
+ kfree(retval);
+- return NULL;
++ retval = NULL;
+ } else
+ list_add(&retval->pools, &dev->dma_pools);
+ mutex_unlock(&pools_lock);
+diff --git a/mm/memblock.c b/mm/memblock.c
+index 6d2f219a48b0..70fad0c0dafb 100644
+--- a/mm/memblock.c
++++ b/mm/memblock.c
+@@ -192,8 +192,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
+ phys_addr_t align, phys_addr_t start,
+ phys_addr_t end, int nid)
+ {
+- int ret;
+- phys_addr_t kernel_end;
++ phys_addr_t kernel_end, ret;
+
+ /* pump up @end */
+ if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+diff --git a/mm/memory.c b/mm/memory.c
+index 0a21f3d162ae..533023da2faa 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -1125,7 +1125,7 @@ again:
+ addr) != page->index) {
+ pte_t ptfile = pgoff_to_pte(page->index);
+ if (pte_soft_dirty(ptent))
+- pte_file_mksoft_dirty(ptfile);
++ ptfile = pte_file_mksoft_dirty(ptfile);
+ set_pte_at(mm, addr, pte, ptfile);
+ }
+ if (PageAnon(page))
+diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
+index 3707c71ae4cd..51108165f829 100644
+--- a/mm/percpu-vm.c
++++ b/mm/percpu-vm.c
+@@ -108,7 +108,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
+ int page_start, int page_end)
+ {
+ const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
+- unsigned int cpu;
++ unsigned int cpu, tcpu;
+ int i;
+
+ for_each_possible_cpu(cpu) {
+@@ -116,14 +116,23 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
+ struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
+
+ *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+- if (!*pagep) {
+- pcpu_free_pages(chunk, pages, populated,
+- page_start, page_end);
+- return -ENOMEM;
+- }
++ if (!*pagep)
++ goto err;
+ }
+ }
+ return 0;
++
++err:
++ while (--i >= page_start)
++ __free_page(pages[pcpu_page_idx(cpu, i)]);
++
++ for_each_possible_cpu(tcpu) {
++ if (tcpu == cpu)
++ break;
++ for (i = page_start; i < page_end; i++)
++ __free_page(pages[pcpu_page_idx(tcpu, i)]);
++ }
++ return -ENOMEM;
+ }
+
+ /**
+@@ -263,6 +272,7 @@ err:
+ __pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start),
+ page_end - page_start);
+ }
++ pcpu_post_unmap_tlb_flush(chunk, page_start, page_end);
+ return err;
+ }
+
+diff --git a/mm/percpu.c b/mm/percpu.c
+index 2ddf9a990dbd..492f601df473 100644
+--- a/mm/percpu.c
++++ b/mm/percpu.c
+@@ -1933,6 +1933,8 @@ void __init setup_per_cpu_areas(void)
+
+ if (pcpu_setup_first_chunk(ai, fc) < 0)
+ panic("Failed to initialize percpu areas.");
++
++ pcpu_free_alloc_info(ai);
+ }
+
+ #endif /* CONFIG_SMP */
+diff --git a/mm/shmem.c b/mm/shmem.c
+index af68b15a8fc1..e53ab3a8a8d3 100644
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -2064,8 +2064,10 @@ static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct
+
+ if (new_dentry->d_inode) {
+ (void) shmem_unlink(new_dir, new_dentry);
+- if (they_are_dirs)
++ if (they_are_dirs) {
++ drop_nlink(new_dentry->d_inode);
+ drop_nlink(old_dir);
++ }
+ } else if (they_are_dirs) {
+ drop_nlink(old_dir);
+ inc_nlink(new_dir);
+diff --git a/mm/slab.c b/mm/slab.c
+index 3070b929a1bf..c9103e4cf2c2 100644
+--- a/mm/slab.c
++++ b/mm/slab.c
+@@ -2224,7 +2224,8 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
+ int
+ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
+ {
+- size_t left_over, freelist_size, ralign;
++ size_t left_over, freelist_size;
++ size_t ralign = BYTES_PER_WORD;
+ gfp_t gfp;
+ int err;
+ size_t size = cachep->size;
+@@ -2257,14 +2258,6 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
+ size &= ~(BYTES_PER_WORD - 1);
+ }
+
+- /*
+- * Redzoning and user store require word alignment or possibly larger.
+- * Note this will be overridden by architecture or caller mandated
+- * alignment if either is greater than BYTES_PER_WORD.
+- */
+- if (flags & SLAB_STORE_USER)
+- ralign = BYTES_PER_WORD;
+-
+ if (flags & SLAB_RED_ZONE) {
+ ralign = REDZONE_ALIGN;
+ /* If redzoning, ensure that the second redzone is suitably
+diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c
+index 3345401be1b3..c8779f316d30 100644
+--- a/net/mac80211/mlme.c
++++ b/net/mac80211/mlme.c
+@@ -4355,8 +4355,7 @@ int ieee80211_mgd_assoc(struct ieee80211_sub_if_data *sdata,
+ rcu_read_unlock();
+
+ if (bss->wmm_used && bss->uapsd_supported &&
+- (sdata->local->hw.flags & IEEE80211_HW_SUPPORTS_UAPSD) &&
+- sdata->wmm_acm != 0xff) {
++ (sdata->local->hw.flags & IEEE80211_HW_SUPPORTS_UAPSD)) {
+ assoc_data->uapsd = true;
+ ifmgd->flags |= IEEE80211_STA_UAPSD_ENABLED;
+ } else {
+diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
+index e6836755c45d..5c34e8d42e01 100644
+--- a/net/netfilter/ipvs/ip_vs_core.c
++++ b/net/netfilter/ipvs/ip_vs_core.c
+@@ -1906,7 +1906,7 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
+ {
+ .hook = ip_vs_local_reply6,
+ .owner = THIS_MODULE,
+- .pf = NFPROTO_IPV4,
++ .pf = NFPROTO_IPV6,
+ .hooknum = NF_INET_LOCAL_OUT,
+ .priority = NF_IP6_PRI_NAT_DST + 1,
+ },
+diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
+index 73ba1cc7a88d..6f70bdd3a90a 100644
+--- a/net/netfilter/ipvs/ip_vs_xmit.c
++++ b/net/netfilter/ipvs/ip_vs_xmit.c
+@@ -967,8 +967,8 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
+ iph->nexthdr = IPPROTO_IPV6;
+ iph->payload_len = old_iph->payload_len;
+ be16_add_cpu(&iph->payload_len, sizeof(*old_iph));
+- iph->priority = old_iph->priority;
+ memset(&iph->flow_lbl, 0, sizeof(iph->flow_lbl));
++ ipv6_change_dsfield(iph, 0, ipv6_get_dsfield(old_iph));
+ iph->daddr = cp->daddr.in6;
+ iph->saddr = saddr;
+ iph->hop_limit = old_iph->hop_limit;
+diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
+index 8746ff9a8357..62101ed0d2af 100644
+--- a/net/netfilter/nf_tables_api.c
++++ b/net/netfilter/nf_tables_api.c
+@@ -899,6 +899,9 @@ static struct nft_stats __percpu *nft_stats_alloc(const struct nlattr *attr)
+ static void nft_chain_stats_replace(struct nft_base_chain *chain,
+ struct nft_stats __percpu *newstats)
+ {
++ if (newstats == NULL)
++ return;
++
+ if (chain->stats) {
+ struct nft_stats __percpu *oldstats =
+ nft_dereference(chain->stats);
+diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
+index f4e833005320..7198d660b4de 100644
+--- a/net/netfilter/xt_cgroup.c
++++ b/net/netfilter/xt_cgroup.c
+@@ -31,7 +31,7 @@ static int cgroup_mt_check(const struct xt_mtchk_param *par)
+ if (info->invert & ~1)
+ return -EINVAL;
+
+- return info->id ? 0 : -EINVAL;
++ return 0;
+ }
+
+ static bool
+diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
+index a3910fc2122b..47dc6836830a 100644
+--- a/net/netfilter/xt_hashlimit.c
++++ b/net/netfilter/xt_hashlimit.c
+@@ -104,7 +104,7 @@ struct xt_hashlimit_htable {
+ spinlock_t lock; /* lock for list_head */
+ u_int32_t rnd; /* random seed for hash */
+ unsigned int count; /* number entries in table */
+- struct timer_list timer; /* timer for gc */
++ struct delayed_work gc_work;
+
+ /* seq_file stuff */
+ struct proc_dir_entry *pde;
+@@ -213,7 +213,7 @@ dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent)
+ call_rcu_bh(&ent->rcu, dsthash_free_rcu);
+ ht->count--;
+ }
+-static void htable_gc(unsigned long htlong);
++static void htable_gc(struct work_struct *work);
+
+ static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
+ u_int8_t family)
+@@ -273,9 +273,9 @@ static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
+ }
+ hinfo->net = net;
+
+- setup_timer(&hinfo->timer, htable_gc, (unsigned long)hinfo);
+- hinfo->timer.expires = jiffies + msecs_to_jiffies(hinfo->cfg.gc_interval);
+- add_timer(&hinfo->timer);
++ INIT_DEFERRABLE_WORK(&hinfo->gc_work, htable_gc);
++ queue_delayed_work(system_power_efficient_wq, &hinfo->gc_work,
++ msecs_to_jiffies(hinfo->cfg.gc_interval));
+
+ hlist_add_head(&hinfo->node, &hashlimit_net->htables);
+
+@@ -300,29 +300,30 @@ static void htable_selective_cleanup(struct xt_hashlimit_htable *ht,
+ {
+ unsigned int i;
+
+- /* lock hash table and iterate over it */
+- spin_lock_bh(&ht->lock);
+ for (i = 0; i < ht->cfg.size; i++) {
+ struct dsthash_ent *dh;
+ struct hlist_node *n;
++
++ spin_lock_bh(&ht->lock);
+ hlist_for_each_entry_safe(dh, n, &ht->hash[i], node) {
+ if ((*select)(ht, dh))
+ dsthash_free(ht, dh);
+ }
++ spin_unlock_bh(&ht->lock);
++ cond_resched();
+ }
+- spin_unlock_bh(&ht->lock);
+ }
+
+-/* hash table garbage collector, run by timer */
+-static void htable_gc(unsigned long htlong)
++static void htable_gc(struct work_struct *work)
+ {
+- struct xt_hashlimit_htable *ht = (struct xt_hashlimit_htable *)htlong;
++ struct xt_hashlimit_htable *ht;
++
++ ht = container_of(work, struct xt_hashlimit_htable, gc_work.work);
+
+ htable_selective_cleanup(ht, select_gc);
+
+- /* re-add the timer accordingly */
+- ht->timer.expires = jiffies + msecs_to_jiffies(ht->cfg.gc_interval);
+- add_timer(&ht->timer);
++ queue_delayed_work(system_power_efficient_wq,
++ &ht->gc_work, msecs_to_jiffies(ht->cfg.gc_interval));
+ }
+
+ static void htable_remove_proc_entry(struct xt_hashlimit_htable *hinfo)
+@@ -341,7 +342,7 @@ static void htable_remove_proc_entry(struct xt_hashlimit_htable *hinfo)
+
+ static void htable_destroy(struct xt_hashlimit_htable *hinfo)
+ {
+- del_timer_sync(&hinfo->timer);
++ cancel_delayed_work_sync(&hinfo->gc_work);
+ htable_remove_proc_entry(hinfo);
+ htable_selective_cleanup(hinfo, select_all);
+ kfree(hinfo->name);
+diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
+index 6668daf69326..d702af40ddea 100644
+--- a/net/wireless/nl80211.c
++++ b/net/wireless/nl80211.c
+@@ -6978,6 +6978,9 @@ void __cfg80211_send_event_skb(struct sk_buff *skb, gfp_t gfp)
+ struct nlattr *data = ((void **)skb->cb)[2];
+ enum nl80211_multicast_groups mcgrp = NL80211_MCGRP_TESTMODE;
+
++ /* clear CB data for netlink core to own from now on */
++ memset(skb->cb, 0, sizeof(skb->cb));
++
+ nla_nest_end(skb, data);
+ genlmsg_end(skb, hdr);
+
+@@ -9300,6 +9303,9 @@ int cfg80211_vendor_cmd_reply(struct sk_buff *skb)
+ void *hdr = ((void **)skb->cb)[1];
+ struct nlattr *data = ((void **)skb->cb)[2];
+
++ /* clear CB data for netlink core to own from now on */
++ memset(skb->cb, 0, sizeof(skb->cb));
++
+ if (WARN_ON(!rdev->cur_cmd_info)) {
+ kfree_skb(skb);
+ return -EINVAL;
+diff --git a/sound/core/info.c b/sound/core/info.c
+index 051d55b05521..9f404e965ea2 100644
+--- a/sound/core/info.c
++++ b/sound/core/info.c
+@@ -684,7 +684,7 @@ int snd_info_card_free(struct snd_card *card)
+ * snd_info_get_line - read one line from the procfs buffer
+ * @buffer: the procfs buffer
+ * @line: the buffer to store
+- * @len: the max. buffer size - 1
++ * @len: the max. buffer size
+ *
+ * Reads one line from the buffer and stores the string.
+ *
+@@ -704,7 +704,7 @@ int snd_info_get_line(struct snd_info_buffer *buffer, char *line, int len)
+ buffer->stop = 1;
+ if (c == '\n')
+ break;
+- if (len) {
++ if (len > 1) {
+ len--;
+ *line++ = c;
+ }
+diff --git a/sound/core/pcm_lib.c b/sound/core/pcm_lib.c
+index 9acc77eae487..0032278567ad 100644
+--- a/sound/core/pcm_lib.c
++++ b/sound/core/pcm_lib.c
+@@ -1782,14 +1782,16 @@ static int snd_pcm_lib_ioctl_fifo_size(struct snd_pcm_substream *substream,
+ {
+ struct snd_pcm_hw_params *params = arg;
+ snd_pcm_format_t format;
+- int channels, width;
++ int channels;
++ ssize_t frame_size;
+
+ params->fifo_size = substream->runtime->hw.fifo_size;
+ if (!(substream->runtime->hw.info & SNDRV_PCM_INFO_FIFO_IN_FRAMES)) {
+ format = params_format(params);
+ channels = params_channels(params);
+- width = snd_pcm_format_physical_width(format);
+- params->fifo_size /= width * channels;
++ frame_size = snd_pcm_format_size(format, channels);
++ if (frame_size > 0)
++ params->fifo_size /= (unsigned)frame_size;
+ }
+ return 0;
+ }
+diff --git a/sound/firewire/amdtp.c b/sound/firewire/amdtp.c
+index f96bf4c7c232..95fc2eaf11dc 100644
+--- a/sound/firewire/amdtp.c
++++ b/sound/firewire/amdtp.c
+@@ -507,7 +507,16 @@ static void amdtp_pull_midi(struct amdtp_stream *s,
+ static void update_pcm_pointers(struct amdtp_stream *s,
+ struct snd_pcm_substream *pcm,
+ unsigned int frames)
+-{ unsigned int ptr;
++{
++ unsigned int ptr;
++
++ /*
++ * In IEC 61883-6, one data block represents one event. In ALSA, one
++ * event equals to one PCM frame. But Dice has a quirk to transfer
++ * two PCM frames in one data block.
++ */
++ if (s->double_pcm_frames)
++ frames *= 2;
+
+ ptr = s->pcm_buffer_pointer + frames;
+ if (ptr >= pcm->runtime->buffer_size)
+diff --git a/sound/firewire/amdtp.h b/sound/firewire/amdtp.h
+index d8ee7b0e9386..4823c08196ac 100644
+--- a/sound/firewire/amdtp.h
++++ b/sound/firewire/amdtp.h
+@@ -125,6 +125,7 @@ struct amdtp_stream {
+ unsigned int pcm_buffer_pointer;
+ unsigned int pcm_period_pointer;
+ bool pointer_flush;
++ bool double_pcm_frames;
+
+ struct snd_rawmidi_substream *midi[AMDTP_MAX_CHANNELS_FOR_MIDI * 8];
+
+diff --git a/sound/firewire/dice.c b/sound/firewire/dice.c
+index a9a30c0161f1..e3a04d69c853 100644
+--- a/sound/firewire/dice.c
++++ b/sound/firewire/dice.c
+@@ -567,10 +567,14 @@ static int dice_hw_params(struct snd_pcm_substream *substream,
+ return err;
+
+ /*
+- * At rates above 96 kHz, pretend that the stream runs at half the
+- * actual sample rate with twice the number of channels; two samples
+- * of a channel are stored consecutively in the packet. Requires
+- * blocking mode and PCM buffer size should be aligned to SYT_INTERVAL.
++ * At 176.4/192.0 kHz, Dice has a quirk to transfer two PCM frames in
++ * one data block of AMDTP packet. Thus sampling transfer frequency is
++ * a half of PCM sampling frequency, i.e. PCM frames at 192.0 kHz are
++ * transferred on AMDTP packets at 96 kHz. Two successive samples of a
++ * channel are stored consecutively in the packet. This quirk is called
++ * as 'Dual Wire'.
++ * For this quirk, blocking mode is required and PCM buffer size should
++ * be aligned to SYT_INTERVAL.
+ */
+ channels = params_channels(hw_params);
+ if (rate_index > 4) {
+@@ -579,18 +583,25 @@ static int dice_hw_params(struct snd_pcm_substream *substream,
+ return err;
+ }
+
+- for (i = 0; i < channels; i++) {
+- dice->stream.pcm_positions[i * 2] = i;
+- dice->stream.pcm_positions[i * 2 + 1] = i + channels;
+- }
+-
+ rate /= 2;
+ channels *= 2;
++ dice->stream.double_pcm_frames = true;
++ } else {
++ dice->stream.double_pcm_frames = false;
+ }
+
+ mode = rate_index_to_mode(rate_index);
+ amdtp_stream_set_parameters(&dice->stream, rate, channels,
+ dice->rx_midi_ports[mode]);
++ if (rate_index > 4) {
++ channels /= 2;
++
++ for (i = 0; i < channels; i++) {
++ dice->stream.pcm_positions[i] = i * 2;
++ dice->stream.pcm_positions[i + channels] = i * 2 + 1;
++ }
++ }
++
+ amdtp_stream_set_pcm_format(&dice->stream,
+ params_format(hw_params));
+
+diff --git a/sound/pci/hda/patch_conexant.c b/sound/pci/hda/patch_conexant.c
+index 1dc7e974f3b1..d5792653e77b 100644
+--- a/sound/pci/hda/patch_conexant.c
++++ b/sound/pci/hda/patch_conexant.c
+@@ -2822,6 +2822,7 @@ enum {
+ CXT_FIXUP_HEADPHONE_MIC_PIN,
+ CXT_FIXUP_HEADPHONE_MIC,
+ CXT_FIXUP_GPIO1,
++ CXT_FIXUP_ASPIRE_DMIC,
+ CXT_FIXUP_THINKPAD_ACPI,
+ CXT_FIXUP_OLPC_XO,
+ CXT_FIXUP_CAP_MIX_AMP,
+@@ -3269,6 +3270,12 @@ static const struct hda_fixup cxt_fixups[] = {
+ { }
+ },
+ },
++ [CXT_FIXUP_ASPIRE_DMIC] = {
++ .type = HDA_FIXUP_FUNC,
++ .v.func = cxt_fixup_stereo_dmic,
++ .chained = true,
++ .chain_id = CXT_FIXUP_GPIO1,
++ },
+ [CXT_FIXUP_THINKPAD_ACPI] = {
+ .type = HDA_FIXUP_FUNC,
+ .v.func = hda_fixup_thinkpad_acpi,
+@@ -3349,7 +3356,7 @@ static const struct hda_model_fixup cxt5051_fixup_models[] = {
+
+ static const struct snd_pci_quirk cxt5066_fixups[] = {
+ SND_PCI_QUIRK(0x1025, 0x0543, "Acer Aspire One 522", CXT_FIXUP_STEREO_DMIC),
+- SND_PCI_QUIRK(0x1025, 0x054c, "Acer Aspire 3830TG", CXT_FIXUP_GPIO1),
++ SND_PCI_QUIRK(0x1025, 0x054c, "Acer Aspire 3830TG", CXT_FIXUP_ASPIRE_DMIC),
+ SND_PCI_QUIRK(0x1043, 0x138d, "Asus", CXT_FIXUP_HEADPHONE_MIC_PIN),
+ SND_PCI_QUIRK(0x152d, 0x0833, "OLPC XO-1.5", CXT_FIXUP_OLPC_XO),
+ SND_PCI_QUIRK(0x17aa, 0x20f2, "Lenovo T400", CXT_PINCFG_LENOVO_TP410),
+@@ -3375,6 +3382,7 @@ static const struct hda_model_fixup cxt5066_fixup_models[] = {
+ { .id = CXT_PINCFG_LENOVO_TP410, .name = "tp410" },
+ { .id = CXT_FIXUP_THINKPAD_ACPI, .name = "thinkpad" },
+ { .id = CXT_PINCFG_LEMOTE_A1004, .name = "lemote-a1004" },
++ { .id = CXT_PINCFG_LEMOTE_A1205, .name = "lemote-a1205" },
+ { .id = CXT_FIXUP_OLPC_XO, .name = "olpc-xo" },
+ {}
+ };
+diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
+index 25728aaacc26..88e4623d4f97 100644
+--- a/sound/pci/hda/patch_realtek.c
++++ b/sound/pci/hda/patch_realtek.c
+@@ -327,6 +327,7 @@ static void alc_auto_init_amp(struct hda_codec *codec, int type)
+ case 0x10ec0885:
+ case 0x10ec0887:
+ /*case 0x10ec0889:*/ /* this causes an SPDIF problem */
++ case 0x10ec0900:
+ alc889_coef_init(codec);
+ break;
+ case 0x10ec0888:
+@@ -2349,6 +2350,7 @@ static int patch_alc882(struct hda_codec *codec)
+ switch (codec->vendor_id) {
+ case 0x10ec0882:
+ case 0x10ec0885:
++ case 0x10ec0900:
+ break;
+ default:
+ /* ALC883 and variants */
+diff --git a/sound/pci/hda/patch_sigmatel.c b/sound/pci/hda/patch_sigmatel.c
+index 4d3a3b932690..619aec71b1e2 100644
+--- a/sound/pci/hda/patch_sigmatel.c
++++ b/sound/pci/hda/patch_sigmatel.c
+@@ -565,8 +565,8 @@ static void stac_init_power_map(struct hda_codec *codec)
+ if (snd_hda_jack_tbl_get(codec, nid))
+ continue;
+ if (def_conf == AC_JACK_PORT_COMPLEX &&
+- !(spec->vref_mute_led_nid == nid ||
+- is_jack_detectable(codec, nid))) {
++ spec->vref_mute_led_nid != nid &&
++ is_jack_detectable(codec, nid)) {
+ snd_hda_jack_detect_enable_callback(codec, nid,
+ STAC_PWR_EVENT,
+ jack_update_power);
+@@ -4263,11 +4263,18 @@ static int stac_parse_auto_config(struct hda_codec *codec)
+ return err;
+ }
+
+- stac_init_power_map(codec);
+-
+ return 0;
+ }
+
++static int stac_build_controls(struct hda_codec *codec)
++{
++ int err = snd_hda_gen_build_controls(codec);
++
++ if (err < 0)
++ return err;
++ stac_init_power_map(codec);
++ return 0;
++}
+
+ static int stac_init(struct hda_codec *codec)
+ {
+@@ -4379,7 +4386,7 @@ static int stac_suspend(struct hda_codec *codec)
+ #endif /* CONFIG_PM */
+
+ static const struct hda_codec_ops stac_patch_ops = {
+- .build_controls = snd_hda_gen_build_controls,
++ .build_controls = stac_build_controls,
+ .build_pcms = snd_hda_gen_build_pcms,
+ .init = stac_init,
+ .free = stac_free,
+diff --git a/sound/soc/davinci/davinci-mcasp.c b/sound/soc/davinci/davinci-mcasp.c
+index 9afb14629a17..b7559bc49426 100644
+--- a/sound/soc/davinci/davinci-mcasp.c
++++ b/sound/soc/davinci/davinci-mcasp.c
+@@ -455,8 +455,17 @@ static int davinci_config_channel_size(struct davinci_mcasp *mcasp,
+ {
+ u32 fmt;
+ u32 tx_rotate = (word_length / 4) & 0x7;
+- u32 rx_rotate = (32 - word_length) / 4;
+ u32 mask = (1ULL << word_length) - 1;
++ /*
++ * For captured data we should not rotate, inversion and masking is
++ * enoguh to get the data to the right position:
++ * Format data from bus after reverse (XRBUF)
++ * S16_LE: |LSB|MSB|xxx|xxx| |xxx|xxx|MSB|LSB|
++ * S24_3LE: |LSB|DAT|MSB|xxx| |xxx|MSB|DAT|LSB|
++ * S24_LE: |LSB|DAT|MSB|xxx| |xxx|MSB|DAT|LSB|
++ * S32_LE: |LSB|DAT|DAT|MSB| |MSB|DAT|DAT|LSB|
++ */
++ u32 rx_rotate = 0;
+
+ /*
+ * if s BCLK-to-LRCLK ratio has been configured via the set_clkdiv()
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-06 11:16 Anthony G. Basile
0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-10-06 11:16 UTC (permalink / raw
To: gentoo-commits
commit: 06a07c7f7ebb2c26793e4bf990975df43e6c9bf6
Author: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Mon Oct 6 11:01:54 2014 +0000
Commit: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Mon Oct 6 11:01:54 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=06a07c7f
Remove duplicate of multipath-tcp patch.
---
5010_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 --------------------------
1 file changed, 19230 deletions(-)
diff --git a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
deleted file mode 100644
index 3000da3..0000000
--- a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
+++ /dev/null
@@ -1,19230 +0,0 @@
-diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
-index 768a0fb67dd6..5a46d91a8df9 100644
---- a/drivers/infiniband/hw/cxgb4/cm.c
-+++ b/drivers/infiniband/hw/cxgb4/cm.c
-@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
- */
- memset(&tmp_opt, 0, sizeof(tmp_opt));
- tcp_clear_options(&tmp_opt);
-- tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+ tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
-
- req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
- memset(req, 0, sizeof(*req));
-diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
-index 2faef339d8f2..d86c853ffaad 100644
---- a/include/linux/ipv6.h
-+++ b/include/linux/ipv6.h
-@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
- return inet_sk(__sk)->pinet6;
- }
-
--static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
--{
-- struct request_sock *req = reqsk_alloc(ops);
--
-- if (req)
-- inet_rsk(req)->pktopts = NULL;
--
-- return req;
--}
--
- static inline struct raw6_sock *raw6_sk(const struct sock *sk)
- {
- return (struct raw6_sock *)sk;
-@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
- return NULL;
- }
-
--static inline struct inet6_request_sock *
-- inet6_rsk(const struct request_sock *rsk)
--{
-- return NULL;
--}
--
- static inline struct raw6_sock *raw6_sk(const struct sock *sk)
- {
- return NULL;
-diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
-index ec89301ada41..99ea4b0e3693 100644
---- a/include/linux/skbuff.h
-+++ b/include/linux/skbuff.h
-@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
- bool zero_okay,
- __sum16 check)
- {
-- if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
-- skb->csum_valid = 1;
-+ if (skb_csum_unnecessary(skb)) {
-+ return false;
-+ } else if (zero_okay && !check) {
-+ skb->ip_summed = CHECKSUM_UNNECESSARY;
- return false;
- }
-
-diff --git a/include/linux/tcp.h b/include/linux/tcp.h
-index a0513210798f..7bc2e078d6ca 100644
---- a/include/linux/tcp.h
-+++ b/include/linux/tcp.h
-@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
- /* TCP Fast Open */
- #define TCP_FASTOPEN_COOKIE_MIN 4 /* Min Fast Open Cookie size in bytes */
- #define TCP_FASTOPEN_COOKIE_MAX 16 /* Max Fast Open Cookie size in bytes */
--#define TCP_FASTOPEN_COOKIE_SIZE 8 /* the size employed by this impl. */
-+#define TCP_FASTOPEN_COOKIE_SIZE 4 /* the size employed by this impl. */
-
- /* TCP Fast Open Cookie as stored in memory */
- struct tcp_fastopen_cookie {
-@@ -72,6 +72,51 @@ struct tcp_sack_block {
- u32 end_seq;
- };
-
-+struct tcp_out_options {
-+ u16 options; /* bit field of OPTION_* */
-+ u8 ws; /* window scale, 0 to disable */
-+ u8 num_sack_blocks;/* number of SACK blocks to include */
-+ u8 hash_size; /* bytes in hash_location */
-+ u16 mss; /* 0 to disable */
-+ __u8 *hash_location; /* temporary pointer, overloaded */
-+ __u32 tsval, tsecr; /* need to include OPTION_TS */
-+ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
-+#ifdef CONFIG_MPTCP
-+ u16 mptcp_options; /* bit field of MPTCP related OPTION_* */
-+ u8 dss_csum:1,
-+ add_addr_v4:1,
-+ add_addr_v6:1; /* dss-checksum required? */
-+
-+ union {
-+ struct {
-+ __u64 sender_key; /* sender's key for mptcp */
-+ __u64 receiver_key; /* receiver's key for mptcp */
-+ } mp_capable;
-+
-+ struct {
-+ __u64 sender_truncated_mac;
-+ __u32 sender_nonce;
-+ /* random number of the sender */
-+ __u32 token; /* token for mptcp */
-+ u8 low_prio:1;
-+ } mp_join_syns;
-+ };
-+
-+ struct {
-+ struct in_addr addr;
-+ u8 addr_id;
-+ } add_addr4;
-+
-+ struct {
-+ struct in6_addr addr;
-+ u8 addr_id;
-+ } add_addr6;
-+
-+ u16 remove_addrs; /* list of address id */
-+ u8 addr_id; /* address id (mp_join or add_address) */
-+#endif /* CONFIG_MPTCP */
-+};
-+
- /*These are used to set the sack_ok field in struct tcp_options_received */
- #define TCP_SACK_SEEN (1 << 0) /*1 = peer is SACK capable, */
- #define TCP_FACK_ENABLED (1 << 1) /*1 = FACK is enabled locally*/
-@@ -95,6 +140,9 @@ struct tcp_options_received {
- u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
- };
-
-+struct mptcp_cb;
-+struct mptcp_tcp_sock;
-+
- static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
- {
- rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
-@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
-
- struct tcp_request_sock {
- struct inet_request_sock req;
--#ifdef CONFIG_TCP_MD5SIG
-- /* Only used by TCP MD5 Signature so far. */
- const struct tcp_request_sock_ops *af_specific;
--#endif
- struct sock *listener; /* needed for TFO */
- u32 rcv_isn;
- u32 snt_isn;
-@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
- return (struct tcp_request_sock *)req;
- }
-
-+struct tcp_md5sig_key;
-+
- struct tcp_sock {
- /* inet_connection_sock has to be the first member of tcp_sock */
- struct inet_connection_sock inet_conn;
-@@ -326,6 +373,37 @@ struct tcp_sock {
- * socket. Used to retransmit SYNACKs etc.
- */
- struct request_sock *fastopen_rsk;
-+
-+ /* MPTCP/TCP-specific callbacks */
-+ const struct tcp_sock_ops *ops;
-+
-+ struct mptcp_cb *mpcb;
-+ struct sock *meta_sk;
-+ /* We keep these flags even if CONFIG_MPTCP is not checked, because
-+ * it allows checking MPTCP capability just by checking the mpc flag,
-+ * rather than adding ifdefs everywhere.
-+ */
-+ u16 mpc:1, /* Other end is multipath capable */
-+ inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
-+ send_mp_fclose:1,
-+ request_mptcp:1, /* Did we send out an MP_CAPABLE?
-+ * (this speeds up mptcp_doit() in tcp_recvmsg)
-+ */
-+ mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
-+ pf:1, /* Potentially Failed state: when this flag is set, we
-+ * stop using the subflow
-+ */
-+ mp_killed:1, /* Killed with a tcp_done in mptcp? */
-+ was_meta_sk:1, /* This was a meta sk (in case of reuse) */
-+ is_master_sk,
-+ close_it:1, /* Must close socket in mptcp_data_ready? */
-+ closing:1;
-+ struct mptcp_tcp_sock *mptcp;
-+#ifdef CONFIG_MPTCP
-+ struct hlist_nulls_node tk_table;
-+ u32 mptcp_loc_token;
-+ u64 mptcp_loc_key;
-+#endif /* CONFIG_MPTCP */
- };
-
- enum tsq_flags {
-@@ -337,6 +415,8 @@ enum tsq_flags {
- TCP_MTU_REDUCED_DEFERRED, /* tcp_v{4|6}_err() could not call
- * tcp_v{4|6}_mtu_reduced()
- */
-+ MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
-+ MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
- };
-
- static inline struct tcp_sock *tcp_sk(const struct sock *sk)
-@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
- #ifdef CONFIG_TCP_MD5SIG
- struct tcp_md5sig_key *tw_md5_key;
- #endif
-+ struct mptcp_tw *mptcp_tw;
- };
-
- static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
-diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
-index 74af137304be..83f63033897a 100644
---- a/include/net/inet6_connection_sock.h
-+++ b/include/net/inet6_connection_sock.h
-@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
-
- struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
- const struct request_sock *req);
-+u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-+ const u32 rnd, const u32 synq_hsize);
-
- struct request_sock *inet6_csk_search_req(const struct sock *sk,
- struct request_sock ***prevp,
-diff --git a/include/net/inet_common.h b/include/net/inet_common.h
-index fe7994c48b75..780f229f46a8 100644
---- a/include/net/inet_common.h
-+++ b/include/net/inet_common.h
-@@ -1,6 +1,8 @@
- #ifndef _INET_COMMON_H
- #define _INET_COMMON_H
-
-+#include <net/sock.h>
-+
- extern const struct proto_ops inet_stream_ops;
- extern const struct proto_ops inet_dgram_ops;
-
-@@ -13,6 +15,8 @@ struct sock;
- struct sockaddr;
- struct socket;
-
-+int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
-+int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
- int inet_release(struct socket *sock);
- int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
- int addr_len, int flags);
-diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
-index 7a4313887568..f62159e39839 100644
---- a/include/net/inet_connection_sock.h
-+++ b/include/net/inet_connection_sock.h
-@@ -30,6 +30,7 @@
-
- struct inet_bind_bucket;
- struct tcp_congestion_ops;
-+struct tcp_options_received;
-
- /*
- * Pointers to address related TCP functions
-@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
-
- struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
-
-+u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
-+ const u32 synq_hsize);
-+
- struct request_sock *inet_csk_search_req(const struct sock *sk,
- struct request_sock ***prevp,
- const __be16 rport,
-diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
-index b1edf17bec01..6a32d8d6b85e 100644
---- a/include/net/inet_sock.h
-+++ b/include/net/inet_sock.h
-@@ -86,10 +86,14 @@ struct inet_request_sock {
- wscale_ok : 1,
- ecn_ok : 1,
- acked : 1,
-- no_srccheck: 1;
-+ no_srccheck: 1,
-+ mptcp_rqsk : 1,
-+ saw_mpc : 1;
- kmemcheck_bitfield_end(flags);
-- struct ip_options_rcu *opt;
-- struct sk_buff *pktopts;
-+ union {
-+ struct ip_options_rcu *opt;
-+ struct sk_buff *pktopts;
-+ };
- u32 ir_mark;
- };
-
-diff --git a/include/net/mptcp.h b/include/net/mptcp.h
-new file mode 100644
-index 000000000000..712780fc39e4
---- /dev/null
-+++ b/include/net/mptcp.h
-@@ -0,0 +1,1439 @@
-+/*
-+ * MPTCP implementation
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef _MPTCP_H
-+#define _MPTCP_H
-+
-+#include <linux/inetdevice.h>
-+#include <linux/ipv6.h>
-+#include <linux/list.h>
-+#include <linux/net.h>
-+#include <linux/netpoll.h>
-+#include <linux/skbuff.h>
-+#include <linux/socket.h>
-+#include <linux/tcp.h>
-+#include <linux/kernel.h>
-+
-+#include <asm/byteorder.h>
-+#include <asm/unaligned.h>
-+#include <crypto/hash.h>
-+#include <net/tcp.h>
-+
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ #define ntohll(x) be64_to_cpu(x)
-+ #define htonll(x) cpu_to_be64(x)
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ #define ntohll(x) (x)
-+ #define htonll(x) (x)
-+#endif
-+
-+struct mptcp_loc4 {
-+ u8 loc4_id;
-+ u8 low_prio:1;
-+ struct in_addr addr;
-+};
-+
-+struct mptcp_rem4 {
-+ u8 rem4_id;
-+ __be16 port;
-+ struct in_addr addr;
-+};
-+
-+struct mptcp_loc6 {
-+ u8 loc6_id;
-+ u8 low_prio:1;
-+ struct in6_addr addr;
-+};
-+
-+struct mptcp_rem6 {
-+ u8 rem6_id;
-+ __be16 port;
-+ struct in6_addr addr;
-+};
-+
-+struct mptcp_request_sock {
-+ struct tcp_request_sock req;
-+ /* hlist-nulls entry to the hash-table. Depending on whether this is a
-+ * a new MPTCP connection or an additional subflow, the request-socket
-+ * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
-+ */
-+ struct hlist_nulls_node hash_entry;
-+
-+ union {
-+ struct {
-+ /* Only on initial subflows */
-+ u64 mptcp_loc_key;
-+ u64 mptcp_rem_key;
-+ u32 mptcp_loc_token;
-+ };
-+
-+ struct {
-+ /* Only on additional subflows */
-+ struct mptcp_cb *mptcp_mpcb;
-+ u32 mptcp_rem_nonce;
-+ u32 mptcp_loc_nonce;
-+ u64 mptcp_hash_tmac;
-+ };
-+ };
-+
-+ u8 loc_id;
-+ u8 rem_id; /* Address-id in the MP_JOIN */
-+ u8 dss_csum:1,
-+ is_sub:1, /* Is this a new subflow? */
-+ low_prio:1, /* Interface set to low-prio? */
-+ rcv_low_prio:1;
-+};
-+
-+struct mptcp_options_received {
-+ u16 saw_mpc:1,
-+ dss_csum:1,
-+ drop_me:1,
-+
-+ is_mp_join:1,
-+ join_ack:1,
-+
-+ saw_low_prio:2, /* 0x1 - low-prio set for this subflow
-+ * 0x2 - low-prio set for another subflow
-+ */
-+ low_prio:1,
-+
-+ saw_add_addr:2, /* Saw at least one add_addr option:
-+ * 0x1: IPv4 - 0x2: IPv6
-+ */
-+ more_add_addr:1, /* Saw one more add-addr. */
-+
-+ saw_rem_addr:1, /* Saw at least one rem_addr option */
-+ more_rem_addr:1, /* Saw one more rem-addr. */
-+
-+ mp_fail:1,
-+ mp_fclose:1;
-+ u8 rem_id; /* Address-id in the MP_JOIN */
-+ u8 prio_addr_id; /* Address-id in the MP_PRIO */
-+
-+ const unsigned char *add_addr_ptr; /* Pointer to add-address option */
-+ const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
-+
-+ u32 data_ack;
-+ u32 data_seq;
-+ u16 data_len;
-+
-+ u32 mptcp_rem_token;/* Remote token */
-+
-+ /* Key inside the option (from mp_capable or fast_close) */
-+ u64 mptcp_key;
-+
-+ u32 mptcp_recv_nonce;
-+ u64 mptcp_recv_tmac;
-+ u8 mptcp_recv_mac[20];
-+};
-+
-+struct mptcp_tcp_sock {
-+ struct tcp_sock *next; /* Next subflow socket */
-+ struct hlist_node cb_list;
-+ struct mptcp_options_received rx_opt;
-+
-+ /* Those three fields record the current mapping */
-+ u64 map_data_seq;
-+ u32 map_subseq;
-+ u16 map_data_len;
-+ u16 slave_sk:1,
-+ fully_established:1,
-+ establish_increased:1,
-+ second_packet:1,
-+ attached:1,
-+ send_mp_fail:1,
-+ include_mpc:1,
-+ mapping_present:1,
-+ map_data_fin:1,
-+ low_prio:1, /* use this socket as backup */
-+ rcv_low_prio:1, /* Peer sent low-prio option to us */
-+ send_mp_prio:1, /* Trigger to send mp_prio on this socket */
-+ pre_established:1; /* State between sending 3rd ACK and
-+ * receiving the fourth ack of new subflows.
-+ */
-+
-+ /* isn: needed to translate abs to relative subflow seqnums */
-+ u32 snt_isn;
-+ u32 rcv_isn;
-+ u8 path_index;
-+ u8 loc_id;
-+ u8 rem_id;
-+
-+#define MPTCP_SCHED_SIZE 4
-+ u8 mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
-+
-+ struct sk_buff *shortcut_ofoqueue; /* Shortcut to the current modified
-+ * skb in the ofo-queue.
-+ */
-+
-+ int init_rcv_wnd;
-+ u32 infinite_cutoff_seq;
-+ struct delayed_work work;
-+ u32 mptcp_loc_nonce;
-+ struct tcp_sock *tp; /* Where is my daddy? */
-+ u32 last_end_data_seq;
-+
-+ /* MP_JOIN subflow: timer for retransmitting the 3rd ack */
-+ struct timer_list mptcp_ack_timer;
-+
-+ /* HMAC of the third ack */
-+ char sender_mac[20];
-+};
-+
-+struct mptcp_tw {
-+ struct list_head list;
-+ u64 loc_key;
-+ u64 rcv_nxt;
-+ struct mptcp_cb __rcu *mpcb;
-+ u8 meta_tw:1,
-+ in_list:1;
-+};
-+
-+#define MPTCP_PM_NAME_MAX 16
-+struct mptcp_pm_ops {
-+ struct list_head list;
-+
-+ /* Signal the creation of a new MPTCP-session. */
-+ void (*new_session)(const struct sock *meta_sk);
-+ void (*release_sock)(struct sock *meta_sk);
-+ void (*fully_established)(struct sock *meta_sk);
-+ void (*new_remote_address)(struct sock *meta_sk);
-+ int (*get_local_id)(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio);
-+ void (*addr_signal)(struct sock *sk, unsigned *size,
-+ struct tcp_out_options *opts, struct sk_buff *skb);
-+ void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
-+ sa_family_t family, __be16 port, u8 id);
-+ void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
-+ void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
-+ void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
-+
-+ char name[MPTCP_PM_NAME_MAX];
-+ struct module *owner;
-+};
-+
-+#define MPTCP_SCHED_NAME_MAX 16
-+struct mptcp_sched_ops {
-+ struct list_head list;
-+
-+ struct sock * (*get_subflow)(struct sock *meta_sk,
-+ struct sk_buff *skb,
-+ bool zero_wnd_test);
-+ struct sk_buff * (*next_segment)(struct sock *meta_sk,
-+ int *reinject,
-+ struct sock **subsk,
-+ unsigned int *limit);
-+ void (*init)(struct sock *sk);
-+
-+ char name[MPTCP_SCHED_NAME_MAX];
-+ struct module *owner;
-+};
-+
-+struct mptcp_cb {
-+ /* list of sockets in this multipath connection */
-+ struct tcp_sock *connection_list;
-+ /* list of sockets that need a call to release_cb */
-+ struct hlist_head callback_list;
-+
-+ /* High-order bits of 64-bit sequence numbers */
-+ u32 snd_high_order[2];
-+ u32 rcv_high_order[2];
-+
-+ u16 send_infinite_mapping:1,
-+ in_time_wait:1,
-+ list_rcvd:1, /* XXX TO REMOVE */
-+ addr_signal:1, /* Path-manager wants us to call addr_signal */
-+ dss_csum:1,
-+ server_side:1,
-+ infinite_mapping_rcv:1,
-+ infinite_mapping_snd:1,
-+ dfin_combined:1, /* Was the DFIN combined with subflow-fin? */
-+ passive_close:1,
-+ snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
-+ rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
-+
-+ /* socket count in this connection */
-+ u8 cnt_subflows;
-+ u8 cnt_established;
-+
-+ struct mptcp_sched_ops *sched_ops;
-+
-+ struct sk_buff_head reinject_queue;
-+ /* First cache-line boundary is here minus 8 bytes. But from the
-+ * reinject-queue only the next and prev pointers are regularly
-+ * accessed. Thus, the whole data-path is on a single cache-line.
-+ */
-+
-+ u64 csum_cutoff_seq;
-+
-+ /***** Start of fields, used for connection closure */
-+ spinlock_t tw_lock;
-+ unsigned char mptw_state;
-+ u8 dfin_path_index;
-+
-+ struct list_head tw_list;
-+
-+ /***** Start of fields, used for subflow establishment and closure */
-+ atomic_t mpcb_refcnt;
-+
-+ /* Mutex needed, because otherwise mptcp_close will complain that the
-+ * socket is owned by the user.
-+ * E.g., mptcp_sub_close_wq is taking the meta-lock.
-+ */
-+ struct mutex mpcb_mutex;
-+
-+ /***** Start of fields, used for subflow establishment */
-+ struct sock *meta_sk;
-+
-+ /* Master socket, also part of the connection_list, this
-+ * socket is the one that the application sees.
-+ */
-+ struct sock *master_sk;
-+
-+ __u64 mptcp_loc_key;
-+ __u64 mptcp_rem_key;
-+ __u32 mptcp_loc_token;
-+ __u32 mptcp_rem_token;
-+
-+#define MPTCP_PM_SIZE 608
-+ u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
-+ struct mptcp_pm_ops *pm_ops;
-+
-+ u32 path_index_bits;
-+ /* Next pi to pick up in case a new path becomes available */
-+ u8 next_path_index;
-+
-+ /* Original snd/rcvbuf of the initial subflow.
-+ * Used for the new subflows on the server-side to allow correct
-+ * autotuning
-+ */
-+ int orig_sk_rcvbuf;
-+ int orig_sk_sndbuf;
-+ u32 orig_window_clamp;
-+
-+ /* Timer for retransmitting SYN/ACK+MP_JOIN */
-+ struct timer_list synack_timer;
-+};
-+
-+#define MPTCP_SUB_CAPABLE 0
-+#define MPTCP_SUB_LEN_CAPABLE_SYN 12
-+#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN 12
-+#define MPTCP_SUB_LEN_CAPABLE_ACK 20
-+#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN 20
-+
-+#define MPTCP_SUB_JOIN 1
-+#define MPTCP_SUB_LEN_JOIN_SYN 12
-+#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN 12
-+#define MPTCP_SUB_LEN_JOIN_SYNACK 16
-+#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN 16
-+#define MPTCP_SUB_LEN_JOIN_ACK 24
-+#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN 24
-+
-+#define MPTCP_SUB_DSS 2
-+#define MPTCP_SUB_LEN_DSS 4
-+#define MPTCP_SUB_LEN_DSS_ALIGN 4
-+
-+/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
-+ * as they are part of the DSS-option.
-+ * To get the total length, just add the different options together.
-+ */
-+#define MPTCP_SUB_LEN_SEQ 10
-+#define MPTCP_SUB_LEN_SEQ_CSUM 12
-+#define MPTCP_SUB_LEN_SEQ_ALIGN 12
-+
-+#define MPTCP_SUB_LEN_SEQ_64 14
-+#define MPTCP_SUB_LEN_SEQ_CSUM_64 16
-+#define MPTCP_SUB_LEN_SEQ_64_ALIGN 16
-+
-+#define MPTCP_SUB_LEN_ACK 4
-+#define MPTCP_SUB_LEN_ACK_ALIGN 4
-+
-+#define MPTCP_SUB_LEN_ACK_64 8
-+#define MPTCP_SUB_LEN_ACK_64_ALIGN 8
-+
-+/* This is the "default" option-length we will send out most often.
-+ * MPTCP DSS-header
-+ * 32-bit data sequence number
-+ * 32-bit data ack
-+ *
-+ * It is necessary to calculate the effective MSS we will be using when
-+ * sending data.
-+ */
-+#define MPTCP_SUB_LEN_DSM_ALIGN (MPTCP_SUB_LEN_DSS_ALIGN + \
-+ MPTCP_SUB_LEN_SEQ_ALIGN + \
-+ MPTCP_SUB_LEN_ACK_ALIGN)
-+
-+#define MPTCP_SUB_ADD_ADDR 3
-+#define MPTCP_SUB_LEN_ADD_ADDR4 8
-+#define MPTCP_SUB_LEN_ADD_ADDR6 20
-+#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN 8
-+#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN 20
-+
-+#define MPTCP_SUB_REMOVE_ADDR 4
-+#define MPTCP_SUB_LEN_REMOVE_ADDR 4
-+
-+#define MPTCP_SUB_PRIO 5
-+#define MPTCP_SUB_LEN_PRIO 3
-+#define MPTCP_SUB_LEN_PRIO_ADDR 4
-+#define MPTCP_SUB_LEN_PRIO_ALIGN 4
-+
-+#define MPTCP_SUB_FAIL 6
-+#define MPTCP_SUB_LEN_FAIL 12
-+#define MPTCP_SUB_LEN_FAIL_ALIGN 12
-+
-+#define MPTCP_SUB_FCLOSE 7
-+#define MPTCP_SUB_LEN_FCLOSE 12
-+#define MPTCP_SUB_LEN_FCLOSE_ALIGN 12
-+
-+
-+#define OPTION_MPTCP (1 << 5)
-+
-+#ifdef CONFIG_MPTCP
-+
-+/* Used for checking if the mptcp initialization has been successful */
-+extern bool mptcp_init_failed;
-+
-+/* MPTCP options */
-+#define OPTION_TYPE_SYN (1 << 0)
-+#define OPTION_TYPE_SYNACK (1 << 1)
-+#define OPTION_TYPE_ACK (1 << 2)
-+#define OPTION_MP_CAPABLE (1 << 3)
-+#define OPTION_DATA_ACK (1 << 4)
-+#define OPTION_ADD_ADDR (1 << 5)
-+#define OPTION_MP_JOIN (1 << 6)
-+#define OPTION_MP_FAIL (1 << 7)
-+#define OPTION_MP_FCLOSE (1 << 8)
-+#define OPTION_REMOVE_ADDR (1 << 9)
-+#define OPTION_MP_PRIO (1 << 10)
-+
-+/* MPTCP flags: both TX and RX */
-+#define MPTCPHDR_SEQ 0x01 /* DSS.M option is present */
-+#define MPTCPHDR_FIN 0x02 /* DSS.F option is present */
-+#define MPTCPHDR_SEQ64_INDEX 0x04 /* index of seq in mpcb->snd_high_order */
-+/* MPTCP flags: RX only */
-+#define MPTCPHDR_ACK 0x08
-+#define MPTCPHDR_SEQ64_SET 0x10 /* Did we received a 64-bit seq number? */
-+#define MPTCPHDR_SEQ64_OFO 0x20 /* Is it not in our circular array? */
-+#define MPTCPHDR_DSS_CSUM 0x40
-+#define MPTCPHDR_JOIN 0x80
-+/* MPTCP flags: TX only */
-+#define MPTCPHDR_INF 0x08
-+
-+struct mptcp_option {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 ver:4,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ ver:4;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+};
-+
-+struct mp_capable {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 ver:4,
-+ sub:4;
-+ __u8 h:1,
-+ rsv:5,
-+ b:1,
-+ a:1;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ ver:4;
-+ __u8 a:1,
-+ b:1,
-+ rsv:5,
-+ h:1;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u64 sender_key;
-+ __u64 receiver_key;
-+} __attribute__((__packed__));
-+
-+struct mp_join {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 b:1,
-+ rsv:3,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ rsv:3,
-+ b:1;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u8 addr_id;
-+ union {
-+ struct {
-+ u32 token;
-+ u32 nonce;
-+ } syn;
-+ struct {
-+ __u64 mac;
-+ u32 nonce;
-+ } synack;
-+ struct {
-+ __u8 mac[20];
-+ } ack;
-+ } u;
-+} __attribute__((__packed__));
-+
-+struct mp_dss {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u16 rsv1:4,
-+ sub:4,
-+ A:1,
-+ a:1,
-+ M:1,
-+ m:1,
-+ F:1,
-+ rsv2:3;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u16 sub:4,
-+ rsv1:4,
-+ rsv2:3,
-+ F:1,
-+ m:1,
-+ M:1,
-+ a:1,
-+ A:1;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+};
-+
-+struct mp_add_addr {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 ipver:4,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ ipver:4;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u8 addr_id;
-+ union {
-+ struct {
-+ struct in_addr addr;
-+ __be16 port;
-+ } v4;
-+ struct {
-+ struct in6_addr addr;
-+ __be16 port;
-+ } v6;
-+ } u;
-+} __attribute__((__packed__));
-+
-+struct mp_remove_addr {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 rsv:4,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ rsv:4;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ /* list of addr_id */
-+ __u8 addrs_id;
-+};
-+
-+struct mp_fail {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u16 rsv1:4,
-+ sub:4,
-+ rsv2:8;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u16 sub:4,
-+ rsv1:4,
-+ rsv2:8;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __be64 data_seq;
-+} __attribute__((__packed__));
-+
-+struct mp_fclose {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u16 rsv1:4,
-+ sub:4,
-+ rsv2:8;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u16 sub:4,
-+ rsv1:4,
-+ rsv2:8;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u64 key;
-+} __attribute__((__packed__));
-+
-+struct mp_prio {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 b:1,
-+ rsv:3,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ rsv:3,
-+ b:1;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u8 addr_id;
-+} __attribute__((__packed__));
-+
-+static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
-+{
-+ return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
-+}
-+
-+#define MPTCP_APP 2
-+
-+extern int sysctl_mptcp_enabled;
-+extern int sysctl_mptcp_checksum;
-+extern int sysctl_mptcp_debug;
-+extern int sysctl_mptcp_syn_retries;
-+
-+extern struct workqueue_struct *mptcp_wq;
-+
-+#define mptcp_debug(fmt, args...) \
-+ do { \
-+ if (unlikely(sysctl_mptcp_debug)) \
-+ pr_err(__FILE__ ": " fmt, ##args); \
-+ } while (0)
-+
-+/* Iterates over all subflows */
-+#define mptcp_for_each_tp(mpcb, tp) \
-+ for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
-+
-+#define mptcp_for_each_sk(mpcb, sk) \
-+ for ((sk) = (struct sock *)(mpcb)->connection_list; \
-+ sk; \
-+ sk = (struct sock *)tcp_sk(sk)->mptcp->next)
-+
-+#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp) \
-+ for (__sk = (struct sock *)(__mpcb)->connection_list, \
-+ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
-+ __sk; \
-+ __sk = __temp, \
-+ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
-+
-+/* Iterates over all bit set to 1 in a bitset */
-+#define mptcp_for_each_bit_set(b, i) \
-+ for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
-+
-+#define mptcp_for_each_bit_unset(b, i) \
-+ mptcp_for_each_bit_set(~b, i)
-+
-+extern struct lock_class_key meta_key;
-+extern struct lock_class_key meta_slock_key;
-+extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
-+
-+/* This is needed to ensure that two subsequent key/nonce-generation result in
-+ * different keys/nonces if the IPs and ports are the same.
-+ */
-+extern u32 mptcp_seed;
-+
-+#define MPTCP_HASH_SIZE 1024
-+
-+extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
-+
-+/* This second hashtable is needed to retrieve request socks
-+ * created as a result of a join request. While the SYN contains
-+ * the token, the final ack does not, so we need a separate hashtable
-+ * to retrieve the mpcb.
-+ */
-+extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
-+extern spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
-+
-+/* Lock, protecting the two hash-tables that hold the token. Namely,
-+ * mptcp_reqsk_tk_htb and tk_hashtable
-+ */
-+extern spinlock_t mptcp_tk_hashlock; /* hashtable protection */
-+
-+/* Request-sockets can be hashed in the tk_htb for collision-detection or in
-+ * the regular htb for join-connections. We need to define different NULLS
-+ * values so that we can correctly detect a request-socket that has been
-+ * recycled. See also c25eb3bfb9729.
-+ */
-+#define MPTCP_REQSK_NULLS_BASE (1U << 29)
-+
-+
-+void mptcp_data_ready(struct sock *sk);
-+void mptcp_write_space(struct sock *sk);
-+
-+void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
-+ struct sock *sk);
-+void mptcp_ofo_queue(struct sock *meta_sk);
-+void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
-+void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
-+int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
-+ gfp_t flags);
-+void mptcp_del_sock(struct sock *sk);
-+void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
-+void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
-+void mptcp_update_sndbuf(const struct tcp_sock *tp);
-+void mptcp_send_fin(struct sock *meta_sk);
-+void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
-+bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+ int push_one, gfp_t gfp);
-+void tcp_parse_mptcp_options(const struct sk_buff *skb,
-+ struct mptcp_options_received *mopt);
-+void mptcp_parse_options(const uint8_t *ptr, int opsize,
-+ struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb);
-+void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
-+ unsigned *remaining);
-+void mptcp_synack_options(struct request_sock *req,
-+ struct tcp_out_options *opts,
-+ unsigned *remaining);
-+void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
-+ struct tcp_out_options *opts, unsigned *size);
-+void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+ const struct tcp_out_options *opts,
-+ struct sk_buff *skb);
-+void mptcp_close(struct sock *meta_sk, long timeout);
-+int mptcp_doit(struct sock *sk);
-+int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
-+int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
-+int mptcp_check_req_master(struct sock *sk, struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev);
-+struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev,
-+ const struct mptcp_options_received *mopt);
-+u32 __mptcp_select_window(struct sock *sk);
-+void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
-+ __u32 *window_clamp, int wscale_ok,
-+ __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+ const struct sock *sk);
-+unsigned int mptcp_current_mss(struct sock *meta_sk);
-+int mptcp_select_size(const struct sock *meta_sk, bool sg);
-+void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
-+void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
-+ u32 *hash_out);
-+void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
-+void mptcp_fin(struct sock *meta_sk);
-+void mptcp_retransmit_timer(struct sock *meta_sk);
-+int mptcp_write_wakeup(struct sock *meta_sk);
-+void mptcp_sub_close_wq(struct work_struct *work);
-+void mptcp_sub_close(struct sock *sk, unsigned long delay);
-+struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
-+void mptcp_fallback_meta_sk(struct sock *meta_sk);
-+int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+void mptcp_ack_handler(unsigned long);
-+int mptcp_check_rtt(const struct tcp_sock *tp, int time);
-+int mptcp_check_snd_buf(const struct tcp_sock *tp);
-+int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
-+ const struct sk_buff *skb);
-+void __init mptcp_init(void);
-+int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
-+void mptcp_destroy_sock(struct sock *sk);
-+int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
-+ const struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt);
-+unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
-+ int large_allowed);
-+int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
-+void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
-+void mptcp_time_wait(struct sock *sk, int state, int timeo);
-+void mptcp_disconnect(struct sock *sk);
-+bool mptcp_should_expand_sndbuf(const struct sock *sk);
-+int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
-+void mptcp_tsq_flags(struct sock *sk);
-+void mptcp_tsq_sub_deferred(struct sock *meta_sk);
-+struct mp_join *mptcp_find_join(const struct sk_buff *skb);
-+void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
-+void mptcp_hash_remove(struct tcp_sock *meta_tp);
-+struct sock *mptcp_hash_find(const struct net *net, const u32 token);
-+int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
-+int mptcp_do_join_short(struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt,
-+ struct net *net);
-+void mptcp_reqsk_destructor(struct request_sock *req);
-+void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+ const struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb);
-+int mptcp_check_req(struct sk_buff *skb, struct net *net);
-+void mptcp_connect_init(struct sock *sk);
-+void mptcp_sub_force_close(struct sock *sk);
-+int mptcp_sub_len_remove_addr_align(u16 bitfield);
-+void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+ const struct sk_buff *skb);
-+void mptcp_init_buffer_space(struct sock *sk);
-+void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
-+ struct sk_buff *skb);
-+void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
-+int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
-+void mptcp_init_congestion_control(struct sock *sk);
-+
-+/* MPTCP-path-manager registration/initialization functions */
-+int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
-+void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
-+void mptcp_init_path_manager(struct mptcp_cb *mpcb);
-+void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
-+void mptcp_fallback_default(struct mptcp_cb *mpcb);
-+void mptcp_get_default_path_manager(char *name);
-+int mptcp_set_default_path_manager(const char *name);
-+extern struct mptcp_pm_ops mptcp_pm_default;
-+
-+/* MPTCP-scheduler registration/initialization functions */
-+int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
-+void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
-+void mptcp_init_scheduler(struct mptcp_cb *mpcb);
-+void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
-+void mptcp_get_default_scheduler(char *name);
-+int mptcp_set_default_scheduler(const char *name);
-+extern struct mptcp_sched_ops mptcp_sched_default;
-+
-+static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
-+ unsigned long len)
-+{
-+ sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
-+ jiffies + len);
-+}
-+
-+static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
-+{
-+ sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
-+}
-+
-+static inline bool is_mptcp_enabled(const struct sock *sk)
-+{
-+ if (!sysctl_mptcp_enabled || mptcp_init_failed)
-+ return false;
-+
-+ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
-+ return false;
-+
-+ return true;
-+}
-+
-+static inline int mptcp_pi_to_flag(int pi)
-+{
-+ return 1 << (pi - 1);
-+}
-+
-+static inline
-+struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
-+{
-+ return (struct mptcp_request_sock *)req;
-+}
-+
-+static inline
-+struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
-+{
-+ return (struct request_sock *)req;
-+}
-+
-+static inline bool mptcp_can_sendpage(struct sock *sk)
-+{
-+ struct sock *sk_it;
-+
-+ if (tcp_sk(sk)->mpcb->dss_csum)
-+ return false;
-+
-+ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
-+ if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
-+ !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
-+ return false;
-+ }
-+
-+ return true;
-+}
-+
-+static inline void mptcp_push_pending_frames(struct sock *meta_sk)
-+{
-+ /* We check packets out and send-head here. TCP only checks the
-+ * send-head. But, MPTCP also checks packets_out, as this is an
-+ * indication that we might want to do opportunistic reinjection.
-+ */
-+ if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
-+ struct tcp_sock *tp = tcp_sk(meta_sk);
-+
-+ /* We don't care about the MSS, because it will be set in
-+ * mptcp_write_xmit.
-+ */
-+ __tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
-+ }
-+}
-+
-+static inline void mptcp_send_reset(struct sock *sk)
-+{
-+ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
-+ mptcp_sub_force_close(sk);
-+}
-+
-+static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
-+{
-+ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
-+}
-+
-+static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
-+{
-+ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
-+}
-+
-+/* Is it a data-fin while in infinite mapping mode?
-+ * In infinite mode, a subflow-fin is in fact a data-fin.
-+ */
-+static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
-+ const struct tcp_sock *tp)
-+{
-+ return mptcp_is_data_fin(skb) ||
-+ (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
-+}
-+
-+static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
-+{
-+ u64 data_seq_high = (u32)(data_seq >> 32);
-+
-+ if (mpcb->rcv_high_order[0] == data_seq_high)
-+ return 0;
-+ else if (mpcb->rcv_high_order[1] == data_seq_high)
-+ return MPTCPHDR_SEQ64_INDEX;
-+ else
-+ return MPTCPHDR_SEQ64_OFO;
-+}
-+
-+/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
-+ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
-+ */
-+static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
-+ u32 *data_seq,
-+ struct mptcp_cb *mpcb)
-+{
-+ __u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
-+
-+ if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
-+ u64 data_seq64 = get_unaligned_be64(ptr);
-+
-+ if (mpcb)
-+ TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
-+
-+ *data_seq = (u32)data_seq64;
-+ ptr++;
-+ } else {
-+ *data_seq = get_unaligned_be32(ptr);
-+ }
-+
-+ return ptr;
-+}
-+
-+static inline struct sock *mptcp_meta_sk(const struct sock *sk)
-+{
-+ return tcp_sk(sk)->meta_sk;
-+}
-+
-+static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
-+{
-+ return tcp_sk(tp->meta_sk);
-+}
-+
-+static inline int is_meta_tp(const struct tcp_sock *tp)
-+{
-+ return tp->mpcb && mptcp_meta_tp(tp) == tp;
-+}
-+
-+static inline int is_meta_sk(const struct sock *sk)
-+{
-+ return sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP &&
-+ mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
-+}
-+
-+static inline int is_master_tp(const struct tcp_sock *tp)
-+{
-+ return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
-+}
-+
-+static inline void mptcp_hash_request_remove(struct request_sock *req)
-+{
-+ int in_softirq = 0;
-+
-+ if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
-+ return;
-+
-+ if (in_softirq()) {
-+ spin_lock(&mptcp_reqsk_hlock);
-+ in_softirq = 1;
-+ } else {
-+ spin_lock_bh(&mptcp_reqsk_hlock);
-+ }
-+
-+ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
-+
-+ if (in_softirq)
-+ spin_unlock(&mptcp_reqsk_hlock);
-+ else
-+ spin_unlock_bh(&mptcp_reqsk_hlock);
-+}
-+
-+static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
-+{
-+ mopt->saw_mpc = 0;
-+ mopt->dss_csum = 0;
-+ mopt->drop_me = 0;
-+
-+ mopt->is_mp_join = 0;
-+ mopt->join_ack = 0;
-+
-+ mopt->saw_low_prio = 0;
-+ mopt->low_prio = 0;
-+
-+ mopt->saw_add_addr = 0;
-+ mopt->more_add_addr = 0;
-+
-+ mopt->saw_rem_addr = 0;
-+ mopt->more_rem_addr = 0;
-+
-+ mopt->mp_fail = 0;
-+ mopt->mp_fclose = 0;
-+}
-+
-+static inline void mptcp_reset_mopt(struct tcp_sock *tp)
-+{
-+ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
-+
-+ mopt->saw_low_prio = 0;
-+ mopt->saw_add_addr = 0;
-+ mopt->more_add_addr = 0;
-+ mopt->saw_rem_addr = 0;
-+ mopt->more_rem_addr = 0;
-+ mopt->join_ack = 0;
-+ mopt->mp_fail = 0;
-+ mopt->mp_fclose = 0;
-+}
-+
-+static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
-+ const struct mptcp_cb *mpcb)
-+{
-+ return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
-+ MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
-+}
-+
-+static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
-+ u32 data_seq_32)
-+{
-+ return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
-+}
-+
-+static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
-+{
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
-+ meta_tp->rcv_nxt);
-+}
-+
-+static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
-+{
-+ if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
-+ mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
-+ }
-+}
-+
-+static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
-+ u32 old_rcv_nxt)
-+{
-+ if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
-+ mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
-+ }
-+}
-+
-+static inline int mptcp_sk_can_send(const struct sock *sk)
-+{
-+ return tcp_passive_fastopen(sk) ||
-+ ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
-+ !tcp_sk(sk)->mptcp->pre_established);
-+}
-+
-+static inline int mptcp_sk_can_recv(const struct sock *sk)
-+{
-+ return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
-+}
-+
-+static inline int mptcp_sk_can_send_ack(const struct sock *sk)
-+{
-+ return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
-+ TCPF_CLOSE | TCPF_LISTEN)) &&
-+ !tcp_sk(sk)->mptcp->pre_established;
-+}
-+
-+/* Only support GSO if all subflows supports it */
-+static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
-+{
-+ struct sock *sk;
-+
-+ if (tcp_sk(meta_sk)->mpcb->dss_csum)
-+ return false;
-+
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+ if (!sk_can_gso(sk))
-+ return false;
-+ }
-+ return true;
-+}
-+
-+static inline bool mptcp_can_sg(const struct sock *meta_sk)
-+{
-+ struct sock *sk;
-+
-+ if (tcp_sk(meta_sk)->mpcb->dss_csum)
-+ return false;
-+
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+ if (!(sk->sk_route_caps & NETIF_F_SG))
-+ return false;
-+ }
-+ return true;
-+}
-+
-+static inline void mptcp_set_rto(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *sk_it;
-+ struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
-+ __u32 max_rto = 0;
-+
-+ /* We are in recovery-phase on the MPTCP-level. Do not update the
-+ * RTO, because this would kill exponential backoff.
-+ */
-+ if (micsk->icsk_retransmits)
-+ return;
-+
-+ mptcp_for_each_sk(tp->mpcb, sk_it) {
-+ if (mptcp_sk_can_send(sk_it) &&
-+ inet_csk(sk_it)->icsk_rto > max_rto)
-+ max_rto = inet_csk(sk_it)->icsk_rto;
-+ }
-+ if (max_rto) {
-+ micsk->icsk_rto = max_rto << 1;
-+
-+ /* A successfull rto-measurement - reset backoff counter */
-+ micsk->icsk_backoff = 0;
-+ }
-+}
-+
-+static inline int mptcp_sysctl_syn_retries(void)
-+{
-+ return sysctl_mptcp_syn_retries;
-+}
-+
-+static inline void mptcp_sub_close_passive(struct sock *sk)
-+{
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
-+
-+ /* Only close, if the app did a send-shutdown (passive close), and we
-+ * received the data-ack of the data-fin.
-+ */
-+ if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
-+ mptcp_sub_close(sk, 0);
-+}
-+
-+static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ /* If data has been acknowleged on the meta-level, fully_established
-+ * will have been set before and thus we will not fall back to infinite
-+ * mapping.
-+ */
-+ if (likely(tp->mptcp->fully_established))
-+ return false;
-+
-+ if (!(flag & MPTCP_FLAG_DATA_ACKED))
-+ return false;
-+
-+ /* Don't fallback twice ;) */
-+ if (tp->mpcb->infinite_mapping_snd)
-+ return false;
-+
-+ pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
-+ __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
-+ &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
-+ __builtin_return_address(0));
-+ if (!is_master_tp(tp))
-+ return true;
-+
-+ tp->mpcb->infinite_mapping_snd = 1;
-+ tp->mpcb->infinite_mapping_rcv = 1;
-+ tp->mptcp->fully_established = 1;
-+
-+ return false;
-+}
-+
-+/* Find the first index whose bit in the bit-field == 0 */
-+static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
-+{
-+ u8 base = mpcb->next_path_index;
-+ int i;
-+
-+ /* Start at 1, because 0 is reserved for the meta-sk */
-+ mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
-+ if (i + base < 1)
-+ continue;
-+ if (i + base >= sizeof(mpcb->path_index_bits) * 8)
-+ break;
-+ i += base;
-+ mpcb->path_index_bits |= (1 << i);
-+ mpcb->next_path_index = i + 1;
-+ return i;
-+ }
-+ mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
-+ if (i >= sizeof(mpcb->path_index_bits) * 8)
-+ break;
-+ if (i < 1)
-+ continue;
-+ mpcb->path_index_bits |= (1 << i);
-+ mpcb->next_path_index = i + 1;
-+ return i;
-+ }
-+
-+ return 0;
-+}
-+
-+static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
-+{
-+ return sk->sk_family == AF_INET6 &&
-+ ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
-+}
-+
-+/* TCP and MPTCP mpc flag-depending functions */
-+u16 mptcp_select_window(struct sock *sk);
-+void mptcp_init_buffer_space(struct sock *sk);
-+void mptcp_tcp_set_rto(struct sock *sk);
-+
-+/* TCP and MPTCP flag-depending functions */
-+bool mptcp_prune_ofo_queue(struct sock *sk);
-+
-+#else /* CONFIG_MPTCP */
-+#define mptcp_debug(fmt, args...) \
-+ do { \
-+ } while (0)
-+
-+/* Without MPTCP, we just do one iteration
-+ * over the only socket available. This assumes that
-+ * the sk/tp arg is the socket in that case.
-+ */
-+#define mptcp_for_each_sk(mpcb, sk)
-+#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
-+
-+static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
-+{
-+ return false;
-+}
-+static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
-+{
-+ return false;
-+}
-+static inline struct sock *mptcp_meta_sk(const struct sock *sk)
-+{
-+ return NULL;
-+}
-+static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
-+{
-+ return NULL;
-+}
-+static inline int is_meta_sk(const struct sock *sk)
-+{
-+ return 0;
-+}
-+static inline int is_master_tp(const struct tcp_sock *tp)
-+{
-+ return 0;
-+}
-+static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_del_sock(const struct sock *sk) {}
-+static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
-+static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
-+static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
-+static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
-+ const struct sock *sk) {}
-+static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
-+static inline void mptcp_set_rto(const struct sock *sk) {}
-+static inline void mptcp_send_fin(const struct sock *meta_sk) {}
-+static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
-+ const struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb) {}
-+static inline void mptcp_syn_options(const struct sock *sk,
-+ struct tcp_out_options *opts,
-+ unsigned *remaining) {}
-+static inline void mptcp_synack_options(struct request_sock *req,
-+ struct tcp_out_options *opts,
-+ unsigned *remaining) {}
-+
-+static inline void mptcp_established_options(struct sock *sk,
-+ struct sk_buff *skb,
-+ struct tcp_out_options *opts,
-+ unsigned *size) {}
-+static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+ const struct tcp_out_options *opts,
-+ struct sk_buff *skb) {}
-+static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
-+static inline int mptcp_doit(struct sock *sk)
-+{
-+ return 0;
-+}
-+static inline int mptcp_check_req_fastopen(struct sock *child,
-+ struct request_sock *req)
-+{
-+ return 1;
-+}
-+static inline int mptcp_check_req_master(const struct sock *sk,
-+ const struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev)
-+{
-+ return 1;
-+}
-+static inline struct sock *mptcp_check_req_child(struct sock *sk,
-+ struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev,
-+ const struct mptcp_options_received *mopt)
-+{
-+ return NULL;
-+}
-+static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
-+{
-+ return 0;
-+}
-+static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
-+{
-+ return 0;
-+}
-+static inline void mptcp_sub_close_passive(struct sock *sk) {}
-+static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
-+{
-+ return false;
-+}
-+static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
-+static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
-+{
-+ return 0;
-+}
-+static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
-+{
-+ return 0;
-+}
-+static inline int mptcp_sysctl_syn_retries(void)
-+{
-+ return 0;
-+}
-+static inline void mptcp_send_reset(const struct sock *sk) {}
-+static inline int mptcp_handle_options(struct sock *sk,
-+ const struct tcphdr *th,
-+ struct sk_buff *skb)
-+{
-+ return 0;
-+}
-+static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
-+static inline void __init mptcp_init(void) {}
-+static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
-+{
-+ return 0;
-+}
-+static inline bool mptcp_sk_can_gso(const struct sock *sk)
-+{
-+ return false;
-+}
-+static inline bool mptcp_can_sg(const struct sock *meta_sk)
-+{
-+ return false;
-+}
-+static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
-+ u32 mss_now, int large_allowed)
-+{
-+ return 0;
-+}
-+static inline void mptcp_destroy_sock(struct sock *sk) {}
-+static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
-+ struct sock **skptr,
-+ struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt)
-+{
-+ return 0;
-+}
-+static inline bool mptcp_can_sendpage(struct sock *sk)
-+{
-+ return false;
-+}
-+static inline int mptcp_init_tw_sock(struct sock *sk,
-+ struct tcp_timewait_sock *tw)
-+{
-+ return 0;
-+}
-+static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
-+static inline void mptcp_disconnect(struct sock *sk) {}
-+static inline void mptcp_tsq_flags(struct sock *sk) {}
-+static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
-+static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+ const struct tcp_options_received *rx_opt,
-+ const struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb) {}
-+static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+ const struct sk_buff *skb) {}
-+static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* _MPTCP_H */
-diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
-new file mode 100644
-index 000000000000..93ad97c77c5a
---- /dev/null
-+++ b/include/net/mptcp_v4.h
-@@ -0,0 +1,67 @@
-+/*
-+ * MPTCP implementation
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef MPTCP_V4_H_
-+#define MPTCP_V4_H_
-+
-+
-+#include <linux/in.h>
-+#include <linux/skbuff.h>
-+#include <net/mptcp.h>
-+#include <net/request_sock.h>
-+#include <net/sock.h>
-+
-+extern struct request_sock_ops mptcp_request_sock_ops;
-+extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
-+extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
-+extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
-+
-+#ifdef CONFIG_MPTCP
-+
-+int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
-+ const __be32 laddr, const struct net *net);
-+int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
-+ struct mptcp_rem4 *rem);
-+int mptcp_pm_v4_init(void);
-+void mptcp_pm_v4_undo(void);
-+u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
-+u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
-+
-+#else
-+
-+static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
-+ const struct sk_buff *skb)
-+{
-+ return 0;
-+}
-+
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* MPTCP_V4_H_ */
-diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
-new file mode 100644
-index 000000000000..49a4f30ccd4d
---- /dev/null
-+++ b/include/net/mptcp_v6.h
-@@ -0,0 +1,69 @@
-+/*
-+ * MPTCP implementation
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef _MPTCP_V6_H
-+#define _MPTCP_V6_H
-+
-+#include <linux/in6.h>
-+#include <net/if_inet6.h>
-+
-+#include <net/mptcp.h>
-+
-+
-+#ifdef CONFIG_MPTCP
-+extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
-+extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
-+extern struct request_sock_ops mptcp6_request_sock_ops;
-+extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
-+extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
-+
-+int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
-+ const struct in6_addr *laddr, const struct net *net);
-+int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
-+ struct mptcp_rem6 *rem);
-+int mptcp_pm_v6_init(void);
-+void mptcp_pm_v6_undo(void);
-+__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
-+ __be16 sport, __be16 dport);
-+u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
-+ __be16 sport, __be16 dport);
-+
-+#else /* CONFIG_MPTCP */
-+
-+#define mptcp_v6_mapped ipv6_mapped
-+
-+static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ return 0;
-+}
-+
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* _MPTCP_V6_H */
-diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
-index 361d26077196..bae95a11c531 100644
---- a/include/net/net_namespace.h
-+++ b/include/net/net_namespace.h
-@@ -16,6 +16,7 @@
- #include <net/netns/packet.h>
- #include <net/netns/ipv4.h>
- #include <net/netns/ipv6.h>
-+#include <net/netns/mptcp.h>
- #include <net/netns/ieee802154_6lowpan.h>
- #include <net/netns/sctp.h>
- #include <net/netns/dccp.h>
-@@ -92,6 +93,9 @@ struct net {
- #if IS_ENABLED(CONFIG_IPV6)
- struct netns_ipv6 ipv6;
- #endif
-+#if IS_ENABLED(CONFIG_MPTCP)
-+ struct netns_mptcp mptcp;
-+#endif
- #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
- struct netns_ieee802154_lowpan ieee802154_lowpan;
- #endif
-diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
-new file mode 100644
-index 000000000000..bad418b04cc8
---- /dev/null
-+++ b/include/net/netns/mptcp.h
-@@ -0,0 +1,44 @@
-+/*
-+ * MPTCP implementation - MPTCP namespace
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef __NETNS_MPTCP_H__
-+#define __NETNS_MPTCP_H__
-+
-+#include <linux/compiler.h>
-+
-+enum {
-+ MPTCP_PM_FULLMESH = 0,
-+ MPTCP_PM_MAX
-+};
-+
-+struct netns_mptcp {
-+ void *path_managers[MPTCP_PM_MAX];
-+};
-+
-+#endif /* __NETNS_MPTCP_H__ */
-diff --git a/include/net/request_sock.h b/include/net/request_sock.h
-index 7f830ff67f08..e79e87a8e1a6 100644
---- a/include/net/request_sock.h
-+++ b/include/net/request_sock.h
-@@ -164,7 +164,7 @@ struct request_sock_queue {
- };
-
- int reqsk_queue_alloc(struct request_sock_queue *queue,
-- unsigned int nr_table_entries);
-+ unsigned int nr_table_entries, gfp_t flags);
-
- void __reqsk_queue_destroy(struct request_sock_queue *queue);
- void reqsk_queue_destroy(struct request_sock_queue *queue);
-diff --git a/include/net/sock.h b/include/net/sock.h
-index 156350745700..0e23cae8861f 100644
---- a/include/net/sock.h
-+++ b/include/net/sock.h
-@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
-
- int sk_wait_data(struct sock *sk, long *timeo);
-
-+/* START - needed for MPTCP */
-+struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
-+void sock_lock_init(struct sock *sk);
-+
-+extern struct lock_class_key af_callback_keys[AF_MAX];
-+extern char *const af_family_clock_key_strings[AF_MAX+1];
-+
-+#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
-+/* END - needed for MPTCP */
-+
- struct request_sock_ops;
- struct timewait_sock_ops;
- struct inet_hashinfo;
-diff --git a/include/net/tcp.h b/include/net/tcp.h
-index 7286db80e8b8..ff92e74cd684 100644
---- a/include/net/tcp.h
-+++ b/include/net/tcp.h
-@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
- #define TCPOPT_SACK 5 /* SACK Block */
- #define TCPOPT_TIMESTAMP 8 /* Better RTT estimations/PAWS */
- #define TCPOPT_MD5SIG 19 /* MD5 Signature (RFC2385) */
-+#define TCPOPT_MPTCP 30
- #define TCPOPT_EXP 254 /* Experimental */
- /* Magic number to be after the option value for sharing TCP
- * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
-@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
- #define TFO_SERVER_WO_SOCKOPT1 0x400
- #define TFO_SERVER_WO_SOCKOPT2 0x800
-
-+/* Flags from tcp_input.c for tcp_ack */
-+#define FLAG_DATA 0x01 /* Incoming frame contained data. */
-+#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
-+#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
-+#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
-+#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
-+#define FLAG_DATA_SACKED 0x20 /* New SACK. */
-+#define FLAG_ECE 0x40 /* ECE in this ACK */
-+#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
-+#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
-+#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
-+#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
-+#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
-+#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
-+#define MPTCP_FLAG_DATA_ACKED 0x8000
-+
-+#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
-+#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
-+#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
-+#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
-+
- extern struct inet_timewait_death_row tcp_death_row;
-
- /* sysctl variables for tcp */
-@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
- #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
- #define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
-
-+/**** START - Exports needed for MPTCP ****/
-+extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
-+extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
-+
-+struct mptcp_options_received;
-+
-+void tcp_enter_quickack_mode(struct sock *sk);
-+int tcp_close_state(struct sock *sk);
-+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-+ const struct sk_buff *skb);
-+int tcp_xmit_probe_skb(struct sock *sk, int urgent);
-+void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
-+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-+ gfp_t gfp_mask);
-+unsigned int tcp_mss_split_point(const struct sock *sk,
-+ const struct sk_buff *skb,
-+ unsigned int mss_now,
-+ unsigned int max_segs,
-+ int nonagle);
-+bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ unsigned int cur_mss, int nonagle);
-+bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ unsigned int cur_mss);
-+unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
-+int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+ unsigned int mss_now);
-+void __pskb_trim_head(struct sk_buff *skb, int len);
-+void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
-+void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
-+void tcp_reset(struct sock *sk);
-+bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
-+ const u32 ack_seq, const u32 nwin);
-+bool tcp_urg_mode(const struct tcp_sock *tp);
-+void tcp_ack_probe(struct sock *sk);
-+void tcp_rearm_rto(struct sock *sk);
-+int tcp_write_timeout(struct sock *sk);
-+bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
-+ unsigned int timeout, bool syn_set);
-+void tcp_write_err(struct sock *sk);
-+void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
-+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+ unsigned int mss_now);
-+
-+int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
-+void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req);
-+__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
-+int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl,
-+ struct request_sock *req,
-+ u16 queue_mapping,
-+ struct tcp_fastopen_cookie *foc);
-+void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
-+struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
-+struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
-+void tcp_v4_reqsk_destructor(struct request_sock *req);
-+
-+int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
-+void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req);
-+__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
-+int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl, struct request_sock *req,
-+ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
-+void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
-+int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
-+int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
-+void tcp_v6_destroy_sock(struct sock *sk);
-+void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
-+void tcp_v6_hash(struct sock *sk);
-+struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
-+struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req,
-+ struct dst_entry *dst);
-+void tcp_v6_reqsk_destructor(struct request_sock *req);
-+
-+unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
-+ int large_allowed);
-+u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
-+
-+void skb_clone_fraglist(struct sk_buff *skb);
-+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
-+
-+void inet_twsk_free(struct inet_timewait_sock *tw);
-+int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
-+/* These states need RST on ABORT according to RFC793 */
-+static inline bool tcp_need_reset(int state)
-+{
-+ return (1 << state) &
-+ (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
-+ TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
-+}
-+
-+bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
-+ int hlen);
-+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-+ bool *fragstolen);
-+bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
-+ struct sk_buff *from, bool *fragstolen);
-+/**** END - Exports needed for MPTCP ****/
-+
- void tcp_tasklet_init(void);
-
- void tcp_v4_err(struct sk_buff *skb, u32);
-@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- size_t len, int nonblock, int flags, int *addr_len);
- void tcp_parse_options(const struct sk_buff *skb,
- struct tcp_options_received *opt_rx,
-+ struct mptcp_options_received *mopt_rx,
- int estab, struct tcp_fastopen_cookie *foc);
- const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
-
-@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
-
- u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
- u16 *mssp);
--__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
--#else
--static inline __u32 cookie_v4_init_sequence(struct sock *sk,
-- struct sk_buff *skb,
-- __u16 *mss)
--{
-- return 0;
--}
-+__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
-+ __u16 *mss);
- #endif
-
- __u32 cookie_init_timestamp(struct request_sock *req);
-@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
- const struct tcphdr *th, u16 *mssp);
- __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
- __u16 *mss);
--#else
--static inline __u32 cookie_v6_init_sequence(struct sock *sk,
-- struct sk_buff *skb,
-- __u16 *mss)
--{
-- return 0;
--}
- #endif
- /* tcp_output.c */
-
-@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
- void tcp_send_loss_probe(struct sock *sk);
- bool tcp_schedule_loss_probe(struct sock *sk);
-
-+u16 tcp_select_window(struct sock *sk);
-+bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+ int push_one, gfp_t gfp);
-+
- /* tcp_input.c */
- void tcp_resume_early_retransmit(struct sock *sk);
- void tcp_rearm_rto(struct sock *sk);
- void tcp_reset(struct sock *sk);
-+void tcp_set_rto(struct sock *sk);
-+bool tcp_should_expand_sndbuf(const struct sock *sk);
-+bool tcp_prune_ofo_queue(struct sock *sk);
-
- /* tcp_timer.c */
- void tcp_init_xmit_timers(struct sock *);
-@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
- */
- struct tcp_skb_cb {
- union {
-- struct inet_skb_parm h4;
-+ union {
-+ struct inet_skb_parm h4;
- #if IS_ENABLED(CONFIG_IPV6)
-- struct inet6_skb_parm h6;
-+ struct inet6_skb_parm h6;
- #endif
-- } header; /* For incoming frames */
-+ } header; /* For incoming frames */
-+#ifdef CONFIG_MPTCP
-+ union { /* For MPTCP outgoing frames */
-+ __u32 path_mask; /* paths that tried to send this skb */
-+ __u32 dss[6]; /* DSS options */
-+ };
-+#endif
-+ };
- __u32 seq; /* Starting sequence number */
- __u32 end_seq; /* SEQ + FIN + SYN + datalen */
- __u32 when; /* used to compute rtt's */
-+#ifdef CONFIG_MPTCP
-+ __u8 mptcp_flags; /* flags for the MPTCP layer */
-+ __u8 dss_off; /* Number of 4-byte words until
-+ * seq-number */
-+#endif
- __u8 tcp_flags; /* TCP header flags. (tcp[13]) */
-
- __u8 sacked; /* State flags for SACK/FACK. */
-@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
- /* Determine a window scaling and initial window to offer. */
- void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
- __u32 *window_clamp, int wscale_ok,
-- __u8 *rcv_wscale, __u32 init_rcv_wnd);
-+ __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+ const struct sock *sk);
-
- static inline int tcp_win_from_space(int space)
- {
-@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
- space - (space>>sysctl_tcp_adv_win_scale);
- }
-
-+#ifdef CONFIG_MPTCP
-+extern struct static_key mptcp_static_key;
-+static inline bool mptcp(const struct tcp_sock *tp)
-+{
-+ return static_key_false(&mptcp_static_key) && tp->mpc;
-+}
-+#else
-+static inline bool mptcp(const struct tcp_sock *tp)
-+{
-+ return 0;
-+}
-+#endif
-+
- /* Note: caller must be prepared to deal with negative returns */
- static inline int tcp_space(const struct sock *sk)
- {
-+ if (mptcp(tcp_sk(sk)))
-+ sk = tcp_sk(sk)->meta_sk;
-+
- return tcp_win_from_space(sk->sk_rcvbuf -
- atomic_read(&sk->sk_rmem_alloc));
- }
-
- static inline int tcp_full_space(const struct sock *sk)
- {
-+ if (mptcp(tcp_sk(sk)))
-+ sk = tcp_sk(sk)->meta_sk;
-+
- return tcp_win_from_space(sk->sk_rcvbuf);
- }
-
-@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
- ireq->wscale_ok = rx_opt->wscale_ok;
- ireq->acked = 0;
- ireq->ecn_ok = 0;
-+ ireq->mptcp_rqsk = 0;
-+ ireq->saw_mpc = 0;
- ireq->ir_rmt_port = tcp_hdr(skb)->source;
- ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
- }
-@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
- void tcp4_proc_exit(void);
- #endif
-
-+int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
-+int tcp_conn_request(struct request_sock_ops *rsk_ops,
-+ const struct tcp_request_sock_ops *af_ops,
-+ struct sock *sk, struct sk_buff *skb);
-+
- /* TCP af-specific functions */
- struct tcp_sock_af_ops {
- #ifdef CONFIG_TCP_MD5SIG
-@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
- #endif
- };
-
-+/* TCP/MPTCP-specific functions */
-+struct tcp_sock_ops {
-+ u32 (*__select_window)(struct sock *sk);
-+ u16 (*select_window)(struct sock *sk);
-+ void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
-+ __u32 *window_clamp, int wscale_ok,
-+ __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+ const struct sock *sk);
-+ void (*init_buffer_space)(struct sock *sk);
-+ void (*set_rto)(struct sock *sk);
-+ bool (*should_expand_sndbuf)(const struct sock *sk);
-+ void (*send_fin)(struct sock *sk);
-+ bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
-+ int push_one, gfp_t gfp);
-+ void (*send_active_reset)(struct sock *sk, gfp_t priority);
-+ int (*write_wakeup)(struct sock *sk);
-+ bool (*prune_ofo_queue)(struct sock *sk);
-+ void (*retransmit_timer)(struct sock *sk);
-+ void (*time_wait)(struct sock *sk, int state, int timeo);
-+ void (*cleanup_rbuf)(struct sock *sk, int copied);
-+ void (*init_congestion_control)(struct sock *sk);
-+};
-+extern const struct tcp_sock_ops tcp_specific;
-+
- struct tcp_request_sock_ops {
-+ u16 mss_clamp;
- #ifdef CONFIG_TCP_MD5SIG
- struct tcp_md5sig_key *(*md5_lookup) (struct sock *sk,
- struct request_sock *req);
-@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
- const struct request_sock *req,
- const struct sk_buff *skb);
- #endif
-+ int (*init_req)(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb);
-+#ifdef CONFIG_SYN_COOKIES
-+ __u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
-+ __u16 *mss);
-+#endif
-+ struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
-+ const struct request_sock *req,
-+ bool *strict);
-+ __u32 (*init_seq)(const struct sk_buff *skb);
-+ int (*send_synack)(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl, struct request_sock *req,
-+ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
-+ void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
-+ const unsigned long timeout);
- };
-
-+#ifdef CONFIG_SYN_COOKIES
-+static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
-+ struct sock *sk, struct sk_buff *skb,
-+ __u16 *mss)
-+{
-+ return ops->cookie_init_seq(sk, skb, mss);
-+}
-+#else
-+static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
-+ struct sock *sk, struct sk_buff *skb,
-+ __u16 *mss)
-+{
-+ return 0;
-+}
-+#endif
-+
- int tcpv4_offload_init(void);
-
- void tcp_v4_init(void);
-diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
-index 9cf2394f0bcf..c2634b6ed854 100644
---- a/include/uapi/linux/if.h
-+++ b/include/uapi/linux/if.h
-@@ -109,6 +109,9 @@ enum net_device_flags {
- #define IFF_DORMANT IFF_DORMANT
- #define IFF_ECHO IFF_ECHO
-
-+#define IFF_NOMULTIPATH 0x80000 /* Disable for MPTCP */
-+#define IFF_MPBACKUP 0x100000 /* Use as backup path for MPTCP */
-+
- #define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
- IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
-
-diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
-index 3b9718328d8b..487475681d84 100644
---- a/include/uapi/linux/tcp.h
-+++ b/include/uapi/linux/tcp.h
-@@ -112,6 +112,7 @@ enum {
- #define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */
- #define TCP_TIMESTAMP 24
- #define TCP_NOTSENT_LOWAT 25 /* limit number of unsent bytes in write queue */
-+#define MPTCP_ENABLED 26
-
- struct tcp_repair_opt {
- __u32 opt_code;
-diff --git a/net/Kconfig b/net/Kconfig
-index d92afe4204d9..96b58593ad5e 100644
---- a/net/Kconfig
-+++ b/net/Kconfig
-@@ -79,6 +79,7 @@ if INET
- source "net/ipv4/Kconfig"
- source "net/ipv6/Kconfig"
- source "net/netlabel/Kconfig"
-+source "net/mptcp/Kconfig"
-
- endif # if INET
-
-diff --git a/net/Makefile b/net/Makefile
-index cbbbe6d657ca..244bac1435b1 100644
---- a/net/Makefile
-+++ b/net/Makefile
-@@ -20,6 +20,7 @@ obj-$(CONFIG_INET) += ipv4/
- obj-$(CONFIG_XFRM) += xfrm/
- obj-$(CONFIG_UNIX) += unix/
- obj-$(CONFIG_NET) += ipv6/
-+obj-$(CONFIG_MPTCP) += mptcp/
- obj-$(CONFIG_PACKET) += packet/
- obj-$(CONFIG_NET_KEY) += key/
- obj-$(CONFIG_BRIDGE) += bridge/
-diff --git a/net/core/dev.c b/net/core/dev.c
-index 367a586d0c8a..215d2757fbf6 100644
---- a/net/core/dev.c
-+++ b/net/core/dev.c
-@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
-
- dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
- IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
-- IFF_AUTOMEDIA)) |
-+ IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
- (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
- IFF_ALLMULTI));
-
-diff --git a/net/core/request_sock.c b/net/core/request_sock.c
-index 467f326126e0..909dfa13f499 100644
---- a/net/core/request_sock.c
-+++ b/net/core/request_sock.c
-@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
- EXPORT_SYMBOL(sysctl_max_syn_backlog);
-
- int reqsk_queue_alloc(struct request_sock_queue *queue,
-- unsigned int nr_table_entries)
-+ unsigned int nr_table_entries,
-+ gfp_t flags)
- {
- size_t lopt_size = sizeof(struct listen_sock);
- struct listen_sock *lopt;
-@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
- nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
- lopt_size += nr_table_entries * sizeof(struct request_sock *);
- if (lopt_size > PAGE_SIZE)
-- lopt = vzalloc(lopt_size);
-+ lopt = __vmalloc(lopt_size,
-+ flags | __GFP_HIGHMEM | __GFP_ZERO,
-+ PAGE_KERNEL);
- else
-- lopt = kzalloc(lopt_size, GFP_KERNEL);
-+ lopt = kzalloc(lopt_size, flags);
- if (lopt == NULL)
- return -ENOMEM;
-
-diff --git a/net/core/skbuff.c b/net/core/skbuff.c
-index c1a33033cbe2..8abc5d60fbe3 100644
---- a/net/core/skbuff.c
-+++ b/net/core/skbuff.c
-@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
- skb_drop_list(&skb_shinfo(skb)->frag_list);
- }
-
--static void skb_clone_fraglist(struct sk_buff *skb)
-+void skb_clone_fraglist(struct sk_buff *skb)
- {
- struct sk_buff *list;
-
-@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
- skb->inner_mac_header += off;
- }
-
--static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
-+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
- {
- __copy_skb_header(new, old);
-
-diff --git a/net/core/sock.c b/net/core/sock.c
-index 026e01f70274..359295523177 100644
---- a/net/core/sock.c
-+++ b/net/core/sock.c
-@@ -136,6 +136,11 @@
-
- #include <trace/events/sock.h>
-
-+#ifdef CONFIG_MPTCP
-+#include <net/mptcp.h>
-+#include <net/inet_common.h>
-+#endif
-+
- #ifdef CONFIG_INET
- #include <net/tcp.h>
- #endif
-@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
- "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG" ,
- "slock-AF_NFC" , "slock-AF_VSOCK" ,"slock-AF_MAX"
- };
--static const char *const af_family_clock_key_strings[AF_MAX+1] = {
-+char *const af_family_clock_key_strings[AF_MAX+1] = {
- "clock-AF_UNSPEC", "clock-AF_UNIX" , "clock-AF_INET" ,
- "clock-AF_AX25" , "clock-AF_IPX" , "clock-AF_APPLETALK",
- "clock-AF_NETROM", "clock-AF_BRIDGE" , "clock-AF_ATMPVC" ,
-@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
- * sk_callback_lock locking rules are per-address-family,
- * so split the lock classes by using a per-AF key:
- */
--static struct lock_class_key af_callback_keys[AF_MAX];
-+struct lock_class_key af_callback_keys[AF_MAX];
-
- /* Take into consideration the size of the struct sk_buff overhead in the
- * determination of these values, since that is non-constant across
-@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
- }
- }
-
--#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
--
- static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
- {
- if (sk->sk_flags & flags) {
-@@ -1253,8 +1256,25 @@ lenout:
- *
- * (We also register the sk_lock with the lock validator.)
- */
--static inline void sock_lock_init(struct sock *sk)
--{
-+void sock_lock_init(struct sock *sk)
-+{
-+#ifdef CONFIG_MPTCP
-+ /* Reclassify the lock-class for subflows */
-+ if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
-+ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
-+ sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
-+ &meta_slock_key,
-+ "sk_lock-AF_INET-MPTCP",
-+ &meta_key);
-+
-+ /* We don't yet have the mptcp-point.
-+ * Thus we still need inet_sock_destruct
-+ */
-+ sk->sk_destruct = inet_sock_destruct;
-+ return;
-+ }
-+#endif
-+
- sock_lock_init_class_and_name(sk,
- af_family_slock_key_strings[sk->sk_family],
- af_family_slock_keys + sk->sk_family,
-@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
- }
- EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
-
--static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
-+struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
- int family)
- {
- struct sock *sk;
-diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
-index 4db3c2a1679c..04cb17d4b0ce 100644
---- a/net/dccp/ipv6.c
-+++ b/net/dccp/ipv6.c
-@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
- goto drop;
-
-- req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
-+ req = inet_reqsk_alloc(&dccp6_request_sock_ops);
- if (req == NULL)
- goto drop;
-
-diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
-index 05c57f0fcabe..630434db0085 100644
---- a/net/ipv4/Kconfig
-+++ b/net/ipv4/Kconfig
-@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
- For further details see:
- http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
-
-+config TCP_CONG_COUPLED
-+ tristate "MPTCP COUPLED CONGESTION CONTROL"
-+ depends on MPTCP
-+ default n
-+ ---help---
-+ MultiPath TCP Coupled Congestion Control
-+ To enable it, just put 'coupled' in tcp_congestion_control
-+
-+config TCP_CONG_OLIA
-+ tristate "MPTCP Opportunistic Linked Increase"
-+ depends on MPTCP
-+ default n
-+ ---help---
-+ MultiPath TCP Opportunistic Linked Increase Congestion Control
-+ To enable it, just put 'olia' in tcp_congestion_control
-+
-+config TCP_CONG_WVEGAS
-+ tristate "MPTCP WVEGAS CONGESTION CONTROL"
-+ depends on MPTCP
-+ default n
-+ ---help---
-+ wVegas congestion control for MPTCP
-+ To enable it, just put 'wvegas' in tcp_congestion_control
-+
- choice
- prompt "Default TCP congestion control"
- default DEFAULT_CUBIC
-@@ -584,6 +608,15 @@ choice
- config DEFAULT_WESTWOOD
- bool "Westwood" if TCP_CONG_WESTWOOD=y
-
-+ config DEFAULT_COUPLED
-+ bool "Coupled" if TCP_CONG_COUPLED=y
-+
-+ config DEFAULT_OLIA
-+ bool "Olia" if TCP_CONG_OLIA=y
-+
-+ config DEFAULT_WVEGAS
-+ bool "Wvegas" if TCP_CONG_WVEGAS=y
-+
- config DEFAULT_RENO
- bool "Reno"
-
-@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
- default "vegas" if DEFAULT_VEGAS
- default "westwood" if DEFAULT_WESTWOOD
- default "veno" if DEFAULT_VENO
-+ default "coupled" if DEFAULT_COUPLED
-+ default "wvegas" if DEFAULT_WVEGAS
- default "reno" if DEFAULT_RENO
- default "cubic"
-
-diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
-index d156b3c5f363..4afd6d8d9028 100644
---- a/net/ipv4/af_inet.c
-+++ b/net/ipv4/af_inet.c
-@@ -104,6 +104,7 @@
- #include <net/ip_fib.h>
- #include <net/inet_connection_sock.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
- #include <net/udp.h>
- #include <net/udplite.h>
- #include <net/ping.h>
-@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
- * Create an inet socket.
- */
-
--static int inet_create(struct net *net, struct socket *sock, int protocol,
-- int kern)
-+int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
- {
- struct sock *sk;
- struct inet_protosw *answer;
-@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
- lock_sock(sk2);
-
- sock_rps_record_flow(sk2);
-+
-+ if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
-+ struct sock *sk_it = sk2;
-+
-+ mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+
-+ if (tcp_sk(sk2)->mpcb->master_sk) {
-+ sk_it = tcp_sk(sk2)->mpcb->master_sk;
-+
-+ write_lock_bh(&sk_it->sk_callback_lock);
-+ sk_it->sk_wq = newsock->wq;
-+ sk_it->sk_socket = newsock;
-+ write_unlock_bh(&sk_it->sk_callback_lock);
-+ }
-+ }
-+
- WARN_ON(!((1 << sk2->sk_state) &
- (TCPF_ESTABLISHED | TCPF_SYN_RECV |
- TCPF_CLOSE_WAIT | TCPF_CLOSE)));
-@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
-
- ip_init();
-
-+ /* We must initialize MPTCP before TCP. */
-+ mptcp_init();
-+
- tcp_v4_init();
-
- /* Setup TCP slab cache for open requests. */
-diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
-index 14d02ea905b6..7d734d8af19b 100644
---- a/net/ipv4/inet_connection_sock.c
-+++ b/net/ipv4/inet_connection_sock.c
-@@ -23,6 +23,7 @@
- #include <net/route.h>
- #include <net/tcp_states.h>
- #include <net/xfrm.h>
-+#include <net/mptcp.h>
-
- #ifdef INET_CSK_DEBUG
- const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
-@@ -465,8 +466,8 @@ no_route:
- }
- EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
-
--static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
-- const u32 rnd, const u32 synq_hsize)
-+u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
-+ const u32 synq_hsize)
- {
- return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
- }
-@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
-
- lopt->clock_hand = i;
-
-- if (lopt->qlen)
-+ if (lopt->qlen && !is_meta_sk(parent))
- inet_csk_reset_keepalive_timer(parent, interval);
- }
- EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
-@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
- const struct request_sock *req,
- const gfp_t priority)
- {
-- struct sock *newsk = sk_clone_lock(sk, priority);
-+ struct sock *newsk;
-+
-+ newsk = sk_clone_lock(sk, priority);
-
- if (newsk != NULL) {
- struct inet_connection_sock *newicsk = inet_csk(newsk);
-@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
- {
- struct inet_sock *inet = inet_sk(sk);
- struct inet_connection_sock *icsk = inet_csk(sk);
-- int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
-+ int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
-+ GFP_KERNEL);
-
- if (rc != 0)
- return rc;
-@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
-
- while ((req = acc_req) != NULL) {
- struct sock *child = req->sk;
-+ bool mutex_taken = false;
-
- acc_req = req->dl_next;
-
-+ if (is_meta_sk(child)) {
-+ mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
-+ mutex_taken = true;
-+ }
- local_bh_disable();
- bh_lock_sock(child);
- WARN_ON(sock_owned_by_user(child));
-@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
-
- bh_unlock_sock(child);
- local_bh_enable();
-+ if (mutex_taken)
-+ mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
- sock_put(child);
-
- sk_acceptq_removed(sk);
-diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
-index c86624b36a62..0ff3fe004d62 100644
---- a/net/ipv4/syncookies.c
-+++ b/net/ipv4/syncookies.c
-@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
- }
- EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
-
--__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
-+__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
-+ __u16 *mssp)
- {
- const struct iphdr *iph = ip_hdr(skb);
- const struct tcphdr *th = tcp_hdr(skb);
-@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
-
- /* check for timestamp cookie support */
- memset(&tcp_opt, 0, sizeof(tcp_opt));
-- tcp_parse_options(skb, &tcp_opt, 0, NULL);
-+ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
-
- if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
- goto out;
-@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
- /* Try to redo what tcp_v4_send_synack did. */
- req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
-
-- tcp_select_initial_window(tcp_full_space(sk), req->mss,
-- &req->rcv_wnd, &req->window_clamp,
-- ireq->wscale_ok, &rcv_wscale,
-- dst_metric(&rt->dst, RTAX_INITRWND));
-+ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
-+ &req->rcv_wnd, &req->window_clamp,
-+ ireq->wscale_ok, &rcv_wscale,
-+ dst_metric(&rt->dst, RTAX_INITRWND), sk);
-
- ireq->rcv_wscale = rcv_wscale;
-
-diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
-index 9d2118e5fbc7..2cb89f886d45 100644
---- a/net/ipv4/tcp.c
-+++ b/net/ipv4/tcp.c
-@@ -271,6 +271,7 @@
-
- #include <net/icmp.h>
- #include <net/inet_common.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- #include <net/xfrm.h>
- #include <net/ip.h>
-@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
- return period;
- }
-
-+const struct tcp_sock_ops tcp_specific = {
-+ .__select_window = __tcp_select_window,
-+ .select_window = tcp_select_window,
-+ .select_initial_window = tcp_select_initial_window,
-+ .init_buffer_space = tcp_init_buffer_space,
-+ .set_rto = tcp_set_rto,
-+ .should_expand_sndbuf = tcp_should_expand_sndbuf,
-+ .init_congestion_control = tcp_init_congestion_control,
-+ .send_fin = tcp_send_fin,
-+ .write_xmit = tcp_write_xmit,
-+ .send_active_reset = tcp_send_active_reset,
-+ .write_wakeup = tcp_write_wakeup,
-+ .prune_ofo_queue = tcp_prune_ofo_queue,
-+ .retransmit_timer = tcp_retransmit_timer,
-+ .time_wait = tcp_time_wait,
-+ .cleanup_rbuf = tcp_cleanup_rbuf,
-+};
-+
- /* Address-family independent initialization for a tcp_sock.
- *
- * NOTE: A lot of things set to zero explicitly by call to
-@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
- sk->sk_sndbuf = sysctl_tcp_wmem[1];
- sk->sk_rcvbuf = sysctl_tcp_rmem[1];
-
-+ tp->ops = &tcp_specific;
-+
- local_bh_disable();
- sock_update_memcg(sk);
- sk_sockets_allocated_inc(sk);
-@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
- int ret;
-
- sock_rps_record_flow(sk);
-+
-+#ifdef CONFIG_MPTCP
-+ if (mptcp(tcp_sk(sk))) {
-+ struct sock *sk_it;
-+ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+ }
-+#endif
- /*
- * We can't seek on a socket input
- */
-@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
- return NULL;
- }
-
--static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
-- int large_allowed)
-+unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- u32 xmit_size_goal, old_size_goal;
-@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
- {
- int mss_now;
-
-- mss_now = tcp_current_mss(sk);
-- *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+ if (mptcp(tcp_sk(sk))) {
-+ mss_now = mptcp_current_mss(sk);
-+ *size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+ } else {
-+ mss_now = tcp_current_mss(sk);
-+ *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+ }
-
- return mss_now;
- }
-@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
- * is fully established.
- */
- if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
-- !tcp_passive_fastopen(sk)) {
-+ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
-+ tp->mpcb->master_sk : sk)) {
- if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- goto out_err;
- }
-
-+ if (mptcp(tp)) {
-+ struct sock *sk_it = sk;
-+
-+ /* We must check this with socket-lock hold because we iterate
-+ * over the subflows.
-+ */
-+ if (!mptcp_can_sendpage(sk)) {
-+ ssize_t ret;
-+
-+ release_sock(sk);
-+ ret = sock_no_sendpage(sk->sk_socket, page, offset,
-+ size, flags);
-+ lock_sock(sk);
-+ return ret;
-+ }
-+
-+ mptcp_for_each_sk(tp->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+ }
-+
- clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
-
- mss_now = tcp_send_mss(sk, &size_goal, flags);
-@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
- {
- ssize_t res;
-
-- if (!(sk->sk_route_caps & NETIF_F_SG) ||
-- !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-+ /* If MPTCP is enabled, we check it later after establishment */
-+ if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
-+ !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
- return sock_no_sendpage(sk->sk_socket, page, offset, size,
- flags);
-
-@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
- const struct tcp_sock *tp = tcp_sk(sk);
- int tmp = tp->mss_cache;
-
-+ if (mptcp(tp))
-+ return mptcp_select_size(sk, sg);
-+
- if (sg) {
- if (sk_can_gso(sk)) {
- /* Small frames wont use a full page:
-@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- * is fully established.
- */
- if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
-- !tcp_passive_fastopen(sk)) {
-+ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
-+ tp->mpcb->master_sk : sk)) {
- if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- goto do_error;
- }
-
-+ if (mptcp(tp)) {
-+ struct sock *sk_it = sk;
-+ mptcp_for_each_sk(tp->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+ }
-+
- if (unlikely(tp->repair)) {
- if (tp->repair_queue == TCP_RECV_QUEUE) {
- copied = tcp_send_rcvq(sk, msg, size);
-@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
- goto out_err;
-
-- sg = !!(sk->sk_route_caps & NETIF_F_SG);
-+ if (mptcp(tp))
-+ sg = mptcp_can_sg(sk);
-+ else
-+ sg = !!(sk->sk_route_caps & NETIF_F_SG);
-
- while (--iovlen >= 0) {
- size_t seglen = iov->iov_len;
-@@ -1183,8 +1251,15 @@ new_segment:
-
- /*
- * Check whether we can use HW checksum.
-+ *
-+ * If dss-csum is enabled, we do not do hw-csum.
-+ * In case of non-mptcp we check the
-+ * device-capabilities.
-+ * In case of mptcp, hw-csum's will be handled
-+ * later in mptcp_write_xmit.
- */
-- if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
-+ if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
-+ (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
- skb->ip_summed = CHECKSUM_PARTIAL;
-
- skb_entail(sk, skb);
-@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
-
- /* Optimize, __tcp_select_window() is not cheap. */
- if (2*rcv_window_now <= tp->window_clamp) {
-- __u32 new_window = __tcp_select_window(sk);
-+ __u32 new_window = tp->ops->__select_window(sk);
-
- /* Send ACK now, if this read freed lots of space
- * in our buffer. Certainly, new_window is new window.
-@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
- /* Clean up data we have read: This will do ACK frames. */
- if (copied > 0) {
- tcp_recv_skb(sk, seq, &offset);
-- tcp_cleanup_rbuf(sk, copied);
-+ tp->ops->cleanup_rbuf(sk, copied);
- }
- return copied;
- }
-@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
-
- lock_sock(sk);
-
-+#ifdef CONFIG_MPTCP
-+ if (mptcp(tp)) {
-+ struct sock *sk_it;
-+ mptcp_for_each_sk(tp->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+ }
-+#endif
-+
- err = -ENOTCONN;
- if (sk->sk_state == TCP_LISTEN)
- goto out;
-@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- }
- }
-
-- tcp_cleanup_rbuf(sk, copied);
-+ tp->ops->cleanup_rbuf(sk, copied);
-
- if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
- /* Install new reader */
-@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- if (tp->rcv_wnd == 0 &&
- !skb_queue_empty(&sk->sk_async_wait_queue)) {
- tcp_service_net_dma(sk, true);
-- tcp_cleanup_rbuf(sk, copied);
-+ tp->ops->cleanup_rbuf(sk, copied);
- } else
- dma_async_issue_pending(tp->ucopy.dma_chan);
- }
-@@ -1993,7 +2076,7 @@ skip_copy:
- */
-
- /* Clean up data we have read: This will do ACK frames. */
-- tcp_cleanup_rbuf(sk, copied);
-+ tp->ops->cleanup_rbuf(sk, copied);
-
- release_sock(sk);
- return copied;
-@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
- /* TCP_CLOSING */ TCP_CLOSING,
- };
-
--static int tcp_close_state(struct sock *sk)
-+int tcp_close_state(struct sock *sk)
- {
- int next = (int)new_state[sk->sk_state];
- int ns = next & TCP_STATE_MASK;
-@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
- TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
- /* Clear out any half completed packets. FIN if needed. */
- if (tcp_close_state(sk))
-- tcp_send_fin(sk);
-+ tcp_sk(sk)->ops->send_fin(sk);
- }
- }
- EXPORT_SYMBOL(tcp_shutdown);
-@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
- int data_was_unread = 0;
- int state;
-
-+ if (is_meta_sk(sk)) {
-+ mptcp_close(sk, timeout);
-+ return;
-+ }
-+
- lock_sock(sk);
- sk->sk_shutdown = SHUTDOWN_MASK;
-
-@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
- /* Unread data was tossed, zap the connection. */
- NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
- tcp_set_state(sk, TCP_CLOSE);
-- tcp_send_active_reset(sk, sk->sk_allocation);
-+ tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
- } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
- /* Check zero linger _after_ checking for unread data. */
- sk->sk_prot->disconnect(sk, 0);
-@@ -2247,7 +2335,7 @@ adjudge_to_death:
- struct tcp_sock *tp = tcp_sk(sk);
- if (tp->linger2 < 0) {
- tcp_set_state(sk, TCP_CLOSE);
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
- NET_INC_STATS_BH(sock_net(sk),
- LINUX_MIB_TCPABORTONLINGER);
- } else {
-@@ -2257,7 +2345,8 @@ adjudge_to_death:
- inet_csk_reset_keepalive_timer(sk,
- tmo - TCP_TIMEWAIT_LEN);
- } else {
-- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+ tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
-+ tmo);
- goto out;
- }
- }
-@@ -2266,7 +2355,7 @@ adjudge_to_death:
- sk_mem_reclaim(sk);
- if (tcp_check_oom(sk, 0)) {
- tcp_set_state(sk, TCP_CLOSE);
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
- NET_INC_STATS_BH(sock_net(sk),
- LINUX_MIB_TCPABORTONMEMORY);
- }
-@@ -2291,15 +2380,6 @@ out:
- }
- EXPORT_SYMBOL(tcp_close);
-
--/* These states need RST on ABORT according to RFC793 */
--
--static inline bool tcp_need_reset(int state)
--{
-- return (1 << state) &
-- (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
-- TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
--}
--
- int tcp_disconnect(struct sock *sk, int flags)
- {
- struct inet_sock *inet = inet_sk(sk);
-@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
- /* The last check adjusts for discrepancy of Linux wrt. RFC
- * states
- */
-- tcp_send_active_reset(sk, gfp_any());
-+ tp->ops->send_active_reset(sk, gfp_any());
- sk->sk_err = ECONNRESET;
- } else if (old_state == TCP_SYN_SENT)
- sk->sk_err = ECONNRESET;
-@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
- if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
- inet_reset_saddr(sk);
-
-+ if (is_meta_sk(sk)) {
-+ mptcp_disconnect(sk);
-+ } else {
-+ if (tp->inside_tk_table)
-+ mptcp_hash_remove_bh(tp);
-+ }
-+
- sk->sk_shutdown = 0;
- sock_reset_flag(sk, SOCK_DONE);
- tp->srtt_us = 0;
-@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- break;
-
- case TCP_DEFER_ACCEPT:
-+ /* An established MPTCP-connection (mptcp(tp) only returns true
-+ * if the socket is established) should not use DEFER on new
-+ * subflows.
-+ */
-+ if (mptcp(tp))
-+ break;
- /* Translate value in seconds to number of retransmits */
- icsk->icsk_accept_queue.rskq_defer_accept =
- secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
-@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
- inet_csk_ack_scheduled(sk)) {
- icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
-- tcp_cleanup_rbuf(sk, 1);
-+ tp->ops->cleanup_rbuf(sk, 1);
- if (!(val & 1))
- icsk->icsk_ack.pingpong = 1;
- }
-@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- tp->notsent_lowat = val;
- sk->sk_write_space(sk);
- break;
-+#ifdef CONFIG_MPTCP
-+ case MPTCP_ENABLED:
-+ if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
-+ if (val)
-+ tp->mptcp_enabled = 1;
-+ else
-+ tp->mptcp_enabled = 0;
-+ } else {
-+ err = -EPERM;
-+ }
-+ break;
-+#endif
- default:
- err = -ENOPROTOOPT;
- break;
-@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
- case TCP_NOTSENT_LOWAT:
- val = tp->notsent_lowat;
- break;
-+#ifdef CONFIG_MPTCP
-+ case MPTCP_ENABLED:
-+ val = tp->mptcp_enabled;
-+ break;
-+#endif
- default:
- return -ENOPROTOOPT;
- }
-@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
- if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
-
-+ WARN_ON(sk->sk_state == TCP_CLOSE);
- tcp_set_state(sk, TCP_CLOSE);
-+
- tcp_clear_xmit_timers(sk);
-+
- if (req != NULL)
- reqsk_fastopen_remove(sk, req, false);
-
-diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
-index 9771563ab564..5c230d96c4c1 100644
---- a/net/ipv4/tcp_fastopen.c
-+++ b/net/ipv4/tcp_fastopen.c
-@@ -7,6 +7,7 @@
- #include <linux/rculist.h>
- #include <net/inetpeer.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-
- int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
-
-@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- {
- struct tcp_sock *tp;
- struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
-- struct sock *child;
-+ struct sock *child, *meta_sk;
-
- req->num_retrans = 0;
- req->num_timeout = 0;
-@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- /* Add the child socket directly into the accept queue */
- inet_csk_reqsk_queue_add(sk, req, child);
-
-- /* Now finish processing the fastopen child socket. */
-- inet_csk(child)->icsk_af_ops->rebuild_header(child);
-- tcp_init_congestion_control(child);
-- tcp_mtup_init(child);
-- tcp_init_metrics(child);
-- tcp_init_buffer_space(child);
--
- /* Queue the data carried in the SYN packet. We need to first
- * bump skb's refcnt because the caller will attempt to free it.
- *
-@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- tp->syn_data_acked = 1;
- }
- tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-+
-+ meta_sk = child;
-+ if (!mptcp_check_req_fastopen(meta_sk, req)) {
-+ child = tcp_sk(meta_sk)->mpcb->master_sk;
-+ tp = tcp_sk(child);
-+ }
-+
-+ /* Now finish processing the fastopen child socket. */
-+ inet_csk(child)->icsk_af_ops->rebuild_header(child);
-+ tp->ops->init_congestion_control(child);
-+ tcp_mtup_init(child);
-+ tcp_init_metrics(child);
-+ tp->ops->init_buffer_space(child);
-+
- sk->sk_data_ready(sk);
-- bh_unlock_sock(child);
-+ if (mptcp(tcp_sk(child)))
-+ bh_unlock_sock(child);
-+ bh_unlock_sock(meta_sk);
- sock_put(child);
- WARN_ON(req->sk == NULL);
- return true;
-diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
-index 40639c288dc2..3273bb69f387 100644
---- a/net/ipv4/tcp_input.c
-+++ b/net/ipv4/tcp_input.c
-@@ -74,6 +74,9 @@
- #include <linux/ipsec.h>
- #include <asm/unaligned.h>
- #include <net/netdma.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-
- int sysctl_tcp_timestamps __read_mostly = 1;
- int sysctl_tcp_window_scaling __read_mostly = 1;
-@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
- int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
- int sysctl_tcp_early_retrans __read_mostly = 3;
-
--#define FLAG_DATA 0x01 /* Incoming frame contained data. */
--#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
--#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
--#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
--#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
--#define FLAG_DATA_SACKED 0x20 /* New SACK. */
--#define FLAG_ECE 0x40 /* ECE in this ACK */
--#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
--#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
--#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
--#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
--#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
--#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
--
--#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
--#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
--#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
--#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
--
- #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
- #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
-
-@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
- icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
- }
-
--static void tcp_enter_quickack_mode(struct sock *sk)
-+void tcp_enter_quickack_mode(struct sock *sk)
- {
- struct inet_connection_sock *icsk = inet_csk(sk);
- tcp_incr_quickack(sk);
-@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
- per_mss = roundup_pow_of_two(per_mss) +
- SKB_DATA_ALIGN(sizeof(struct sk_buff));
-
-- nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
-- nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
-+ if (mptcp(tp)) {
-+ nr_segs = mptcp_check_snd_buf(tp);
-+ } else {
-+ nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
-+ nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
-+ }
-
- /* Fast Recovery (RFC 5681 3.2) :
- * Cubic needs 1.7 factor, rounded to 2 to include
-@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
- */
- sndmem = 2 * nr_segs * per_mss;
-
-- if (sk->sk_sndbuf < sndmem)
-+ /* MPTCP: after this sndmem is the new contribution of the
-+ * current subflow to the aggregated sndbuf */
-+ if (sk->sk_sndbuf < sndmem) {
-+ int old_sndbuf = sk->sk_sndbuf;
- sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
-+ /* MPTCP: ok, the subflow sndbuf has grown, reflect
-+ * this in the aggregate buffer.*/
-+ if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
-+ mptcp_update_sndbuf(tp);
-+ }
- }
-
- /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
-@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
- static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-
- /* Check #1 */
-- if (tp->rcv_ssthresh < tp->window_clamp &&
-- (int)tp->rcv_ssthresh < tcp_space(sk) &&
-+ if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
-+ (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
- !sk_under_memory_pressure(sk)) {
- int incr;
-
-@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
- * will fit to rcvbuf in future.
- */
- if (tcp_win_from_space(skb->truesize) <= skb->len)
-- incr = 2 * tp->advmss;
-+ incr = 2 * meta_tp->advmss;
- else
-- incr = __tcp_grow_window(sk, skb);
-+ incr = __tcp_grow_window(meta_sk, skb);
-
- if (incr) {
- incr = max_t(int, incr, 2 * skb->len);
-- tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
-- tp->window_clamp);
-+ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
-+ meta_tp->window_clamp);
- inet_csk(sk)->icsk_ack.quick |= 1;
- }
- }
-@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
- int copied;
-
- time = tcp_time_stamp - tp->rcvq_space.time;
-- if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
-+ if (mptcp(tp)) {
-+ if (mptcp_check_rtt(tp, time))
-+ return;
-+ } else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
- return;
-
- /* Number of bytes copied to user in last RTT */
-@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
- /* Calculate rto without backoff. This is the second half of Van Jacobson's
- * routine referred to above.
- */
--static void tcp_set_rto(struct sock *sk)
-+void tcp_set_rto(struct sock *sk)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
- /* Old crap is replaced with new one. 8)
-@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
- int len;
- int in_sack;
-
-- if (!sk_can_gso(sk))
-+ /* For MPTCP we cannot shift skb-data and remove one skb from the
-+ * send-queue, because this will make us loose the DSS-option (which
-+ * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
-+ */
-+ if (!sk_can_gso(sk) || mptcp(tp))
- goto fallback;
-
- /* Normally R but no L won't result in plain S */
-@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
- return false;
-
- tcp_rtt_estimator(sk, seq_rtt_us);
-- tcp_set_rto(sk);
-+ tp->ops->set_rto(sk);
-
- /* RFC6298: only reset backoff on valid RTT measurement. */
- inet_csk(sk)->icsk_backoff = 0;
-@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
- }
-
- /* If we get here, the whole TSO packet has not been acked. */
--static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
-+u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- u32 packets_acked;
-@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
- */
- if (!(scb->tcp_flags & TCPHDR_SYN)) {
- flag |= FLAG_DATA_ACKED;
-+ if (mptcp(tp) && mptcp_is_data_seq(skb))
-+ flag |= MPTCP_FLAG_DATA_ACKED;
- } else {
- flag |= FLAG_SYN_ACKED;
- tp->retrans_stamp = 0;
-@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
- return flag;
- }
-
--static void tcp_ack_probe(struct sock *sk)
-+void tcp_ack_probe(struct sock *sk)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
- struct inet_connection_sock *icsk = inet_csk(sk);
-@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
- /* Check that window update is acceptable.
- * The function assumes that snd_una<=ack<=snd_next.
- */
--static inline bool tcp_may_update_window(const struct tcp_sock *tp,
-- const u32 ack, const u32 ack_seq,
-- const u32 nwin)
-+bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
-+ const u32 ack_seq, const u32 nwin)
- {
- return after(ack, tp->snd_una) ||
- after(ack_seq, tp->snd_wl1) ||
-@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
- }
-
- /* This routine deals with incoming acks, but not outgoing ones. */
--static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
-+static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
- {
- struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
-@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
- sack_rtt_us);
- acked -= tp->packets_out;
-
-+ if (mptcp(tp)) {
-+ if (mptcp_fallback_infinite(sk, flag)) {
-+ pr_err("%s resetting flow\n", __func__);
-+ mptcp_send_reset(sk);
-+ goto invalid_ack;
-+ }
-+
-+ mptcp_clean_rtx_infinite(skb, sk);
-+ }
-+
- /* Advance cwnd if state allows */
- if (tcp_may_raise_cwnd(sk, flag))
- tcp_cong_avoid(sk, ack, acked);
-@@ -3512,8 +3528,9 @@ old_ack:
- * the fast version below fails.
- */
- void tcp_parse_options(const struct sk_buff *skb,
-- struct tcp_options_received *opt_rx, int estab,
-- struct tcp_fastopen_cookie *foc)
-+ struct tcp_options_received *opt_rx,
-+ struct mptcp_options_received *mopt,
-+ int estab, struct tcp_fastopen_cookie *foc)
- {
- const unsigned char *ptr;
- const struct tcphdr *th = tcp_hdr(skb);
-@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
- */
- break;
- #endif
-+ case TCPOPT_MPTCP:
-+ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
-+ break;
- case TCPOPT_EXP:
- /* Fast Open option shares code 254 using a
- * 16 bits magic number. It's valid only in
-@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
- if (tcp_parse_aligned_timestamp(tp, th))
- return true;
- }
--
-- tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
-+ tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
-+ 1, NULL);
- if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
- tp->rx_opt.rcv_tsecr -= tp->tsoffset;
-
-@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
- dst = __sk_dst_get(sk);
- if (!dst || !dst_metric(dst, RTAX_QUICKACK))
- inet_csk(sk)->icsk_ack.pingpong = 1;
-+ if (mptcp(tp))
-+ mptcp_sub_close_passive(sk);
- break;
-
- case TCP_CLOSE_WAIT:
-@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
- tcp_set_state(sk, TCP_CLOSING);
- break;
- case TCP_FIN_WAIT2:
-+ if (mptcp(tp)) {
-+ /* The socket will get closed by mptcp_data_ready.
-+ * We first have to process all data-sequences.
-+ */
-+ tp->close_it = 1;
-+ break;
-+ }
- /* Received a FIN -- send ACK and enter TIME_WAIT. */
- tcp_send_ack(sk);
-- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
-+ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
- break;
- default:
- /* Only TCP_LISTEN and TCP_CLOSE are left, in these
-@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
- if (!sock_flag(sk, SOCK_DEAD)) {
- sk->sk_state_change(sk);
-
-+ /* Don't wake up MPTCP-subflows */
-+ if (mptcp(tp))
-+ return;
-+
- /* Do not send POLL_HUP for half duplex close. */
- if (sk->sk_shutdown == SHUTDOWN_MASK ||
- sk->sk_state == TCP_CLOSE)
-@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
- tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
- }
-
-- if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
-+ /* In case of MPTCP, the segment may be empty if it's a
-+ * non-data DATA_FIN. (see beginning of tcp_data_queue)
-+ */
-+ if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
-+ !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
- SOCK_DEBUG(sk, "ofo packet was already received\n");
- __skb_unlink(skb, &tp->out_of_order_queue);
- __kfree_skb(skb);
-@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
- }
- }
-
--static bool tcp_prune_ofo_queue(struct sock *sk);
- static int tcp_prune_queue(struct sock *sk);
-
- static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- unsigned int size)
- {
-+ if (mptcp(tcp_sk(sk)))
-+ sk = mptcp_meta_sk(sk);
-+
- if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
- !sk_rmem_schedule(sk, skb, size)) {
-
-@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- return -1;
-
- if (!sk_rmem_schedule(sk, skb, size)) {
-- if (!tcp_prune_ofo_queue(sk))
-+ if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
- return -1;
-
- if (!sk_rmem_schedule(sk, skb, size))
-@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- * Better try to coalesce them right now to avoid future collapses.
- * Returns true if caller should free @from instead of queueing it
- */
--static bool tcp_try_coalesce(struct sock *sk,
-- struct sk_buff *to,
-- struct sk_buff *from,
-- bool *fragstolen)
-+bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
-+ bool *fragstolen)
- {
- int delta;
-
- *fragstolen = false;
-
-+ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
-+ return false;
-+
- if (tcp_hdr(from)->fin)
- return false;
-
-@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
-
- /* Do skb overlap to previous one? */
- if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-- if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ /* MPTCP allows non-data data-fin to be in the ofo-queue */
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
-+ !(mptcp(tp) && end_seq == seq)) {
- /* All the bits are present. Drop. */
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
- __kfree_skb(skb);
-@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
- end_seq);
- break;
- }
-+ /* MPTCP allows non-data data-fin to be in the ofo-queue */
-+ if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
-+ continue;
- __skb_unlink(skb1, &tp->out_of_order_queue);
- tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
- TCP_SKB_CB(skb1)->end_seq);
-@@ -4280,8 +4325,8 @@ end:
- }
- }
-
--static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-- bool *fragstolen)
-+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-+ bool *fragstolen)
- {
- int eaten;
- struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
-@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
- int eaten = -1;
- bool fragstolen = false;
-
-- if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-+ /* If no data is present, but a data_fin is in the options, we still
-+ * have to call mptcp_queue_skb later on. */
-+ if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
-+ !(mptcp(tp) && mptcp_is_data_fin(skb)))
- goto drop;
-
- skb_dst_drop(skb);
-@@ -4389,7 +4437,7 @@ queue_and_out:
- eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
- }
- tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-- if (skb->len)
-+ if (skb->len || mptcp_is_data_fin(skb))
- tcp_event_data_recv(sk, skb);
- if (th->fin)
- tcp_fin(sk);
-@@ -4411,7 +4459,11 @@ queue_and_out:
-
- if (eaten > 0)
- kfree_skb_partial(skb, fragstolen);
-- if (!sock_flag(sk, SOCK_DEAD))
-+ if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
-+ /* MPTCP: we always have to call data_ready, because
-+ * we may be about to receive a data-fin, which still
-+ * must get queued.
-+ */
- sk->sk_data_ready(sk);
- return;
- }
-@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
- next = skb_queue_next(list, skb);
-
- __skb_unlink(skb, list);
-+ if (mptcp(tcp_sk(sk)))
-+ mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
- __kfree_skb(skb);
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
-
-@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
- * Purge the out-of-order queue.
- * Return true if queue was pruned.
- */
--static bool tcp_prune_ofo_queue(struct sock *sk)
-+bool tcp_prune_ofo_queue(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- bool res = false;
-@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
- /* Collapsing did not help, destructive actions follow.
- * This must not ever occur. */
-
-- tcp_prune_ofo_queue(sk);
-+ tp->ops->prune_ofo_queue(sk);
-
- if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
- return 0;
-@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
- return -1;
- }
-
--static bool tcp_should_expand_sndbuf(const struct sock *sk)
-+/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
-+ * As additional protections, we do not touch cwnd in retransmission phases,
-+ * and if application hit its sndbuf limit recently.
-+ */
-+void tcp_cwnd_application_limited(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
-+ sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
-+ /* Limited by application or receiver window. */
-+ u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
-+ u32 win_used = max(tp->snd_cwnd_used, init_win);
-+ if (win_used < tp->snd_cwnd) {
-+ tp->snd_ssthresh = tcp_current_ssthresh(sk);
-+ tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
-+ }
-+ tp->snd_cwnd_used = 0;
-+ }
-+ tp->snd_cwnd_stamp = tcp_time_stamp;
-+}
-+
-+bool tcp_should_expand_sndbuf(const struct sock *sk)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
-
-@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-- if (tcp_should_expand_sndbuf(sk)) {
-+ if (tp->ops->should_expand_sndbuf(sk)) {
- tcp_sndbuf_expand(sk);
- tp->snd_cwnd_stamp = tcp_time_stamp;
- }
-@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
- {
- if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
- sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
-- if (sk->sk_socket &&
-- test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
-+ if (mptcp(tcp_sk(sk)) ||
-+ (sk->sk_socket &&
-+ test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
- tcp_new_space(sk);
- }
- }
-@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
- /* ... and right edge of window advances far enough.
- * (tcp_recvmsg() will send ACK otherwise). Or...
- */
-- __tcp_select_window(sk) >= tp->rcv_wnd) ||
-+ tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
- /* We ACK each frame or... */
- tcp_in_quickack_mode(sk) ||
- /* We have out of order data. */
-@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-+ /* MPTCP urgent data is not yet supported */
-+ if (mptcp(tp))
-+ return;
-+
- /* Check if we get a new urgent pointer - normally not. */
- if (th->urg)
- tcp_check_urg(sk, th);
-@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
- }
-
- #ifdef CONFIG_NET_DMA
--static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
-- int hlen)
-+bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- int chunk = skb->len - hlen;
-@@ -5052,9 +5132,15 @@ syn_challenge:
- goto discard;
- }
-
-+ /* If valid: post process the received MPTCP options. */
-+ if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
-+ goto discard;
-+
- return true;
-
- discard:
-+ if (mptcp(tp))
-+ mptcp_reset_mopt(tp);
- __kfree_skb(skb);
- return false;
- }
-@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-
- tp->rx_opt.saw_tstamp = 0;
-
-+ /* MPTCP: force slowpath. */
-+ if (mptcp(tp))
-+ goto slow_path;
-+
- /* pred_flags is 0xS?10 << 16 + snd_wnd
- * if header_prediction is to be made
- * 'S' will always be tp->tcp_header_len >> 2
-@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
- }
- if (copied_early)
-- tcp_cleanup_rbuf(sk, skb->len);
-+ tp->ops->cleanup_rbuf(sk, skb->len);
- }
- if (!eaten) {
- if (tcp_checksum_complete_user(sk, skb))
-@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
-
- tcp_init_metrics(sk);
-
-- tcp_init_congestion_control(sk);
-+ tp->ops->init_congestion_control(sk);
-
- /* Prevent spurious tcp_cwnd_restart() on first data
- * packet.
- */
- tp->lsndtime = tcp_time_stamp;
-
-- tcp_init_buffer_space(sk);
-+ tp->ops->init_buffer_space(sk);
-
- if (sock_flag(sk, SOCK_KEEPOPEN))
- inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
-@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
- /* Get original SYNACK MSS value if user MSS sets mss_clamp */
- tcp_clear_options(&opt);
- opt.user_mss = opt.mss_clamp = 0;
-- tcp_parse_options(synack, &opt, 0, NULL);
-+ tcp_parse_options(synack, &opt, NULL, 0, NULL);
- mss = opt.mss_clamp;
- }
-
-@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
-
- tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
-
-- if (data) { /* Retransmit unacked data in SYN */
-+ /* In mptcp case, we do not rely on "retransmit", but instead on
-+ * "transmit", because if fastopen data is not acked, the retransmission
-+ * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
-+ */
-+ if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
- tcp_for_write_queue_from(data, sk) {
- if (data == tcp_send_head(sk) ||
- __tcp_retransmit_skb(sk, data))
-@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- struct tcp_sock *tp = tcp_sk(sk);
- struct tcp_fastopen_cookie foc = { .len = -1 };
- int saved_clamp = tp->rx_opt.mss_clamp;
-+ struct mptcp_options_received mopt;
-+ mptcp_init_mp_opt(&mopt);
-
-- tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
-+ tcp_parse_options(skb, &tp->rx_opt,
-+ mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
- if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
- tp->rx_opt.rcv_tsecr -= tp->tsoffset;
-
-@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
- tcp_ack(sk, skb, FLAG_SLOWPATH);
-
-+ if (tp->request_mptcp || mptcp(tp)) {
-+ int ret;
-+ ret = mptcp_rcv_synsent_state_process(sk, &sk,
-+ skb, &mopt);
-+
-+ /* May have changed if we support MPTCP */
-+ tp = tcp_sk(sk);
-+ icsk = inet_csk(sk);
-+
-+ if (ret == 1)
-+ goto reset_and_undo;
-+ if (ret == 2)
-+ goto discard;
-+ }
-+
-+ if (mptcp(tp) && !is_master_tp(tp)) {
-+ /* Timer for repeating the ACK until an answer
-+ * arrives. Used only when establishing an additional
-+ * subflow inside of an MPTCP connection.
-+ */
-+ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+ jiffies + icsk->icsk_rto);
-+ }
-+
- /* Ok.. it's good. Set up sequence numbers and
- * move to established.
- */
-@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- tp->tcp_header_len = sizeof(struct tcphdr);
- }
-
-+ if (mptcp(tp)) {
-+ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
-+ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-+ }
-+
- if (tcp_is_sack(tp) && sysctl_tcp_fack)
- tcp_enable_fack(tp);
-
-@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- tcp_rcv_fastopen_synack(sk, skb, &foc))
- return -1;
-
-- if (sk->sk_write_pending ||
-+ /* With MPTCP we cannot send data on the third ack due to the
-+ * lack of option-space to combine with an MP_CAPABLE.
-+ */
-+ if (!mptcp(tp) && (sk->sk_write_pending ||
- icsk->icsk_accept_queue.rskq_defer_accept ||
-- icsk->icsk_ack.pingpong) {
-+ icsk->icsk_ack.pingpong)) {
- /* Save one ACK. Data will be ready after
- * several ticks, if write_pending is set.
- *
-@@ -5536,6 +5665,7 @@ discard:
- tcp_paws_reject(&tp->rx_opt, 0))
- goto discard_and_undo;
-
-+ /* TODO - check this here for MPTCP */
- if (th->syn) {
- /* We see SYN without ACK. It is attempt of
- * simultaneous connect with crossed SYNs.
-@@ -5552,6 +5682,11 @@ discard:
- tp->tcp_header_len = sizeof(struct tcphdr);
- }
-
-+ if (mptcp(tp)) {
-+ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
-+ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-+ }
-+
- tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
- tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
-
-@@ -5610,6 +5745,7 @@ reset_and_undo:
-
- int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- const struct tcphdr *th, unsigned int len)
-+ __releases(&sk->sk_lock.slock)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- struct inet_connection_sock *icsk = inet_csk(sk);
-@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-
- case TCP_SYN_SENT:
- queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
-+ if (is_meta_sk(sk)) {
-+ sk = tcp_sk(sk)->mpcb->master_sk;
-+ tp = tcp_sk(sk);
-+
-+ /* Need to call it here, because it will announce new
-+ * addresses, which can only be done after the third ack
-+ * of the 3-way handshake.
-+ */
-+ mptcp_update_metasocket(sk, tp->meta_sk);
-+ }
- if (queued >= 0)
- return queued;
-
-@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- tcp_urg(sk, skb, th);
- __kfree_skb(skb);
- tcp_data_snd_check(sk);
-+ if (mptcp(tp) && is_master_tp(tp))
-+ bh_unlock_sock(sk);
- return 0;
- }
-
-@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- synack_stamp = tp->lsndtime;
- /* Make sure socket is routed, for correct metrics. */
- icsk->icsk_af_ops->rebuild_header(sk);
-- tcp_init_congestion_control(sk);
-+ tp->ops->init_congestion_control(sk);
-
- tcp_mtup_init(sk);
- tp->copied_seq = tp->rcv_nxt;
-- tcp_init_buffer_space(sk);
-+ tp->ops->init_buffer_space(sk);
- }
- smp_mb();
- tcp_set_state(sk, TCP_ESTABLISHED);
-@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-
- if (tp->rx_opt.tstamp_ok)
- tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
-+ if (mptcp(tp))
-+ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-
- if (req) {
- /* Re-arm the timer because data may have been sent out.
-@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-
- tcp_initialize_rcv_mss(sk);
- tcp_fast_path_on(tp);
-+ /* Send an ACK when establishing a new
-+ * MPTCP subflow, i.e. using an MP_JOIN
-+ * subtype.
-+ */
-+ if (mptcp(tp) && !is_master_tp(tp))
-+ tcp_send_ack(sk);
- break;
-
- case TCP_FIN_WAIT1: {
-@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- tmo = tcp_fin_time(sk);
- if (tmo > TCP_TIMEWAIT_LEN) {
- inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
-- } else if (th->fin || sock_owned_by_user(sk)) {
-+ } else if (th->fin || mptcp_is_data_fin(skb) ||
-+ sock_owned_by_user(sk)) {
- /* Bad case. We could lose such FIN otherwise.
- * It is not a big problem, but it looks confusing
- * and not so rare event. We still can lose it now,
-@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- */
- inet_csk_reset_keepalive_timer(sk, tmo);
- } else {
-- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
- goto discard;
- }
- break;
-@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-
- case TCP_CLOSING:
- if (tp->snd_una == tp->write_seq) {
-- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
-+ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
- goto discard;
- }
- break;
-@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- goto discard;
- }
- break;
-+ case TCP_CLOSE:
-+ if (tp->mp_killed)
-+ goto discard;
- }
-
- /* step 6: check the URG bit */
-@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- */
- if (sk->sk_shutdown & RCV_SHUTDOWN) {
- if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
-- after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
-+ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
-+ !mptcp(tp)) {
-+ /* In case of mptcp, the reset is handled by
-+ * mptcp_rcv_state_process
-+ */
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
- tcp_reset(sk);
- return 1;
-@@ -5877,3 +6041,154 @@ discard:
- return 0;
- }
- EXPORT_SYMBOL(tcp_rcv_state_process);
-+
-+static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
-+{
-+ struct inet_request_sock *ireq = inet_rsk(req);
-+
-+ if (family == AF_INET)
-+ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
-+ &ireq->ir_rmt_addr, port);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else if (family == AF_INET6)
-+ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
-+ &ireq->ir_v6_rmt_addr, port);
-+#endif
-+}
-+
-+int tcp_conn_request(struct request_sock_ops *rsk_ops,
-+ const struct tcp_request_sock_ops *af_ops,
-+ struct sock *sk, struct sk_buff *skb)
-+{
-+ struct tcp_options_received tmp_opt;
-+ struct request_sock *req;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct dst_entry *dst = NULL;
-+ __u32 isn = TCP_SKB_CB(skb)->when;
-+ bool want_cookie = false, fastopen;
-+ struct flowi fl;
-+ struct tcp_fastopen_cookie foc = { .len = -1 };
-+ int err;
-+
-+
-+ /* TW buckets are converted to open requests without
-+ * limitations, they conserve resources and peer is
-+ * evidently real one.
-+ */
-+ if ((sysctl_tcp_syncookies == 2 ||
-+ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-+ want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
-+ if (!want_cookie)
-+ goto drop;
-+ }
-+
-+
-+ /* Accept backlog is full. If we have already queued enough
-+ * of warm entries in syn queue, drop request. It is better than
-+ * clogging syn queue with openreqs with exponentially increasing
-+ * timeout.
-+ */
-+ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-+ goto drop;
-+ }
-+
-+ req = inet_reqsk_alloc(rsk_ops);
-+ if (!req)
-+ goto drop;
-+
-+ tcp_rsk(req)->af_specific = af_ops;
-+
-+ tcp_clear_options(&tmp_opt);
-+ tmp_opt.mss_clamp = af_ops->mss_clamp;
-+ tmp_opt.user_mss = tp->rx_opt.user_mss;
-+ tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
-+
-+ if (want_cookie && !tmp_opt.saw_tstamp)
-+ tcp_clear_options(&tmp_opt);
-+
-+ tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-+ tcp_openreq_init(req, &tmp_opt, skb);
-+
-+ if (af_ops->init_req(req, sk, skb))
-+ goto drop_and_free;
-+
-+ if (security_inet_conn_request(sk, skb, req))
-+ goto drop_and_free;
-+
-+ if (!want_cookie || tmp_opt.tstamp_ok)
-+ TCP_ECN_create_request(req, skb, sock_net(sk));
-+
-+ if (want_cookie) {
-+ isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
-+ req->cookie_ts = tmp_opt.tstamp_ok;
-+ } else if (!isn) {
-+ /* VJ's idea. We save last timestamp seen
-+ * from the destination in peer table, when entering
-+ * state TIME-WAIT, and check against it before
-+ * accepting new connection request.
-+ *
-+ * If "isn" is not zero, this request hit alive
-+ * timewait bucket, so that all the necessary checks
-+ * are made in the function processing timewait state.
-+ */
-+ if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
-+ bool strict;
-+
-+ dst = af_ops->route_req(sk, &fl, req, &strict);
-+ if (dst && strict &&
-+ !tcp_peer_is_proven(req, dst, true)) {
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
-+ goto drop_and_release;
-+ }
-+ }
-+ /* Kill the following clause, if you dislike this way. */
-+ else if (!sysctl_tcp_syncookies &&
-+ (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
-+ (sysctl_max_syn_backlog >> 2)) &&
-+ !tcp_peer_is_proven(req, dst, false)) {
-+ /* Without syncookies last quarter of
-+ * backlog is filled with destinations,
-+ * proven to be alive.
-+ * It means that we continue to communicate
-+ * to destinations, already remembered
-+ * to the moment of synflood.
-+ */
-+ pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
-+ rsk_ops->family);
-+ goto drop_and_release;
-+ }
-+
-+ isn = af_ops->init_seq(skb);
-+ }
-+ if (!dst) {
-+ dst = af_ops->route_req(sk, &fl, req, NULL);
-+ if (!dst)
-+ goto drop_and_free;
-+ }
-+
-+ tcp_rsk(req)->snt_isn = isn;
-+ tcp_openreq_init_rwin(req, sk, dst);
-+ fastopen = !want_cookie &&
-+ tcp_try_fastopen(sk, skb, req, &foc, dst);
-+ err = af_ops->send_synack(sk, dst, &fl, req,
-+ skb_get_queue_mapping(skb), &foc);
-+ if (!fastopen) {
-+ if (err || want_cookie)
-+ goto drop_and_free;
-+
-+ tcp_rsk(req)->listener = NULL;
-+ af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-+ }
-+
-+ return 0;
-+
-+drop_and_release:
-+ dst_release(dst);
-+drop_and_free:
-+ reqsk_free(req);
-+drop:
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-+ return 0;
-+}
-+EXPORT_SYMBOL(tcp_conn_request);
-diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
-index 77cccda1ad0c..c77017f600f1 100644
---- a/net/ipv4/tcp_ipv4.c
-+++ b/net/ipv4/tcp_ipv4.c
-@@ -67,6 +67,8 @@
- #include <net/icmp.h>
- #include <net/inet_hashtables.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
- #include <net/transp_v6.h>
- #include <net/ipv6.h>
- #include <net/inet_common.h>
-@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
- struct inet_hashinfo tcp_hashinfo;
- EXPORT_SYMBOL(tcp_hashinfo);
-
--static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
-+__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
- {
- return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
- ip_hdr(skb)->saddr,
-@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- struct inet_sock *inet;
- const int type = icmp_hdr(icmp_skb)->type;
- const int code = icmp_hdr(icmp_skb)->code;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk;
- struct sk_buff *skb;
- struct request_sock *fastopen;
- __u32 seq, snd_una;
-@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- return;
- }
-
-- bh_lock_sock(sk);
-+ tp = tcp_sk(sk);
-+ if (mptcp(tp))
-+ meta_sk = mptcp_meta_sk(sk);
-+ else
-+ meta_sk = sk;
-+
-+ bh_lock_sock(meta_sk);
- /* If too many ICMPs get dropped on busy
- * servers this needs to be solved differently.
- * We do take care of PMTU discovery (RFC1191) special case :
- * we can receive locally generated ICMP messages while socket is held.
- */
-- if (sock_owned_by_user(sk)) {
-+ if (sock_owned_by_user(meta_sk)) {
- if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
- NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
- }
-@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- }
-
- icsk = inet_csk(sk);
-- tp = tcp_sk(sk);
- seq = ntohl(th->seq);
- /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
- fastopen = tp->fastopen_rsk;
-@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- goto out;
-
- tp->mtu_info = info;
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- tcp_v4_mtu_reduced(sk);
- } else {
- if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
- sock_hold(sk);
-+ if (mptcp(tp))
-+ mptcp_tsq_flags(sk);
- }
- goto out;
- }
-@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- !icsk->icsk_backoff || fastopen)
- break;
-
-- if (sock_owned_by_user(sk))
-+ if (sock_owned_by_user(meta_sk))
- break;
-
- icsk->icsk_backoff--;
-@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- switch (sk->sk_state) {
- struct request_sock *req, **prev;
- case TCP_LISTEN:
-- if (sock_owned_by_user(sk))
-+ if (sock_owned_by_user(meta_sk))
- goto out;
-
- req = inet_csk_search_req(sk, &prev, th->dest,
-@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- if (fastopen && fastopen->sk == NULL)
- break;
-
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- sk->sk_err = err;
-
- sk->sk_error_report(sk);
-@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- */
-
- inet = inet_sk(sk);
-- if (!sock_owned_by_user(sk) && inet->recverr) {
-+ if (!sock_owned_by_user(meta_sk) && inet->recverr) {
- sk->sk_err = err;
- sk->sk_error_report(sk);
- } else { /* Only an error on timeout */
-@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- }
-
- out:
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
- * Exception: precedence violation. We do not implement it in any case.
- */
-
--static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
-+void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- struct {
-@@ -702,10 +711,10 @@ release_sk1:
- outside socket context is ugly, certainly. What can I do?
- */
-
--static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
-+static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
- u32 win, u32 tsval, u32 tsecr, int oif,
- struct tcp_md5sig_key *key,
-- int reply_flags, u8 tos)
-+ int reply_flags, u8 tos, int mptcp)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- struct {
-@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
- #ifdef CONFIG_TCP_MD5SIG
- + (TCPOLEN_MD5SIG_ALIGNED >> 2)
- #endif
-+#ifdef CONFIG_MPTCP
-+ + ((MPTCP_SUB_LEN_DSS >> 2) +
-+ (MPTCP_SUB_LEN_ACK >> 2))
-+#endif
- ];
- } rep;
- struct ip_reply_arg arg;
-@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
- ip_hdr(skb)->daddr, &rep.th);
- }
- #endif
-+#ifdef CONFIG_MPTCP
-+ if (mptcp) {
-+ int offset = (tsecr) ? 3 : 0;
-+ /* Construction of 32-bit data_ack */
-+ rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
-+ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
-+ (0x20 << 8) |
-+ (0x01));
-+ rep.opt[offset] = htonl(data_ack);
-+
-+ arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
-+ rep.th.doff = arg.iov[0].iov_len / 4;
-+ }
-+#endif /* CONFIG_MPTCP */
-+
- arg.flags = reply_flags;
- arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
- ip_hdr(skb)->saddr, /* XXX */
-@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
- {
- struct inet_timewait_sock *tw = inet_twsk(sk);
- struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
-+ u32 data_ack = 0;
-+ int mptcp = 0;
-+
-+ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
-+ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
-+ mptcp = 1;
-+ }
-
- tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
-+ data_ack,
- tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
- tcp_time_stamp + tcptw->tw_ts_offset,
- tcptw->tw_ts_recent,
- tw->tw_bound_dev_if,
- tcp_twsk_md5_key(tcptw),
- tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
-- tw->tw_tos
-+ tw->tw_tos, mptcp
- );
-
- inet_twsk_put(tw);
- }
-
--static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-- struct request_sock *req)
-+void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req)
- {
- /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
- * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
- */
- tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
- tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
-- tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
-+ tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
- tcp_time_stamp,
- req->ts_recent,
- 0,
- tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
- AF_INET),
- inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
-- ip_hdr(skb)->tos);
-+ ip_hdr(skb)->tos, 0);
- }
-
- /*
-@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
- * This still operates on a request_sock only, not on a big
- * socket.
- */
--static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-- struct request_sock *req,
-- u16 queue_mapping,
-- struct tcp_fastopen_cookie *foc)
-+int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl,
-+ struct request_sock *req,
-+ u16 queue_mapping,
-+ struct tcp_fastopen_cookie *foc)
- {
- const struct inet_request_sock *ireq = inet_rsk(req);
- struct flowi4 fl4;
-@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
- return err;
- }
-
--static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
--{
-- int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
--
-- if (!res) {
-- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-- }
-- return res;
--}
--
- /*
- * IPv4 request_sock destructor.
- */
--static void tcp_v4_reqsk_destructor(struct request_sock *req)
-+void tcp_v4_reqsk_destructor(struct request_sock *req)
- {
- kfree(inet_rsk(req)->opt);
- }
-@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
- /*
- * Save and compile IPv4 options into the request_sock if needed.
- */
--static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
-+struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
- {
- const struct ip_options *opt = &(IPCB(skb)->opt);
- struct ip_options_rcu *dopt = NULL;
-@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
-
- #endif
-
-+static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ struct inet_request_sock *ireq = inet_rsk(req);
-+
-+ ireq->ir_loc_addr = ip_hdr(skb)->daddr;
-+ ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
-+ ireq->no_srccheck = inet_sk(sk)->transparent;
-+ ireq->opt = tcp_v4_save_options(skb);
-+ ireq->ir_mark = inet_request_mark(sk, skb);
-+
-+ return 0;
-+}
-+
-+static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
-+ const struct request_sock *req,
-+ bool *strict)
-+{
-+ struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
-+
-+ if (strict) {
-+ if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
-+ *strict = true;
-+ else
-+ *strict = false;
-+ }
-+
-+ return dst;
-+}
-+
- struct request_sock_ops tcp_request_sock_ops __read_mostly = {
- .family = PF_INET,
- .obj_size = sizeof(struct tcp_request_sock),
-- .rtx_syn_ack = tcp_v4_rtx_synack,
-+ .rtx_syn_ack = tcp_rtx_synack,
- .send_ack = tcp_v4_reqsk_send_ack,
- .destructor = tcp_v4_reqsk_destructor,
- .send_reset = tcp_v4_send_reset,
- .syn_ack_timeout = tcp_syn_ack_timeout,
- };
-
-+const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
-+ .mss_clamp = TCP_MSS_DEFAULT,
- #ifdef CONFIG_TCP_MD5SIG
--static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
- .md5_lookup = tcp_v4_reqsk_md5_lookup,
- .calc_md5_hash = tcp_v4_md5_hash_skb,
--};
- #endif
-+ .init_req = tcp_v4_init_req,
-+#ifdef CONFIG_SYN_COOKIES
-+ .cookie_init_seq = cookie_v4_init_sequence,
-+#endif
-+ .route_req = tcp_v4_route_req,
-+ .init_seq = tcp_v4_init_sequence,
-+ .send_synack = tcp_v4_send_synack,
-+ .queue_hash_add = inet_csk_reqsk_queue_hash_add,
-+};
-
- int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
- {
-- struct tcp_options_received tmp_opt;
-- struct request_sock *req;
-- struct inet_request_sock *ireq;
-- struct tcp_sock *tp = tcp_sk(sk);
-- struct dst_entry *dst = NULL;
-- __be32 saddr = ip_hdr(skb)->saddr;
-- __be32 daddr = ip_hdr(skb)->daddr;
-- __u32 isn = TCP_SKB_CB(skb)->when;
-- bool want_cookie = false, fastopen;
-- struct flowi4 fl4;
-- struct tcp_fastopen_cookie foc = { .len = -1 };
-- int err;
--
- /* Never answer to SYNs send to broadcast or multicast */
- if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
- goto drop;
-
-- /* TW buckets are converted to open requests without
-- * limitations, they conserve resources and peer is
-- * evidently real one.
-- */
-- if ((sysctl_tcp_syncookies == 2 ||
-- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-- want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
-- if (!want_cookie)
-- goto drop;
-- }
--
-- /* Accept backlog is full. If we have already queued enough
-- * of warm entries in syn queue, drop request. It is better than
-- * clogging syn queue with openreqs with exponentially increasing
-- * timeout.
-- */
-- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-- goto drop;
-- }
--
-- req = inet_reqsk_alloc(&tcp_request_sock_ops);
-- if (!req)
-- goto drop;
--
--#ifdef CONFIG_TCP_MD5SIG
-- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
--#endif
--
-- tcp_clear_options(&tmp_opt);
-- tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
-- tmp_opt.user_mss = tp->rx_opt.user_mss;
-- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
--
-- if (want_cookie && !tmp_opt.saw_tstamp)
-- tcp_clear_options(&tmp_opt);
--
-- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-- tcp_openreq_init(req, &tmp_opt, skb);
-+ return tcp_conn_request(&tcp_request_sock_ops,
-+ &tcp_request_sock_ipv4_ops, sk, skb);
-
-- ireq = inet_rsk(req);
-- ireq->ir_loc_addr = daddr;
-- ireq->ir_rmt_addr = saddr;
-- ireq->no_srccheck = inet_sk(sk)->transparent;
-- ireq->opt = tcp_v4_save_options(skb);
-- ireq->ir_mark = inet_request_mark(sk, skb);
--
-- if (security_inet_conn_request(sk, skb, req))
-- goto drop_and_free;
--
-- if (!want_cookie || tmp_opt.tstamp_ok)
-- TCP_ECN_create_request(req, skb, sock_net(sk));
--
-- if (want_cookie) {
-- isn = cookie_v4_init_sequence(sk, skb, &req->mss);
-- req->cookie_ts = tmp_opt.tstamp_ok;
-- } else if (!isn) {
-- /* VJ's idea. We save last timestamp seen
-- * from the destination in peer table, when entering
-- * state TIME-WAIT, and check against it before
-- * accepting new connection request.
-- *
-- * If "isn" is not zero, this request hit alive
-- * timewait bucket, so that all the necessary checks
-- * are made in the function processing timewait state.
-- */
-- if (tmp_opt.saw_tstamp &&
-- tcp_death_row.sysctl_tw_recycle &&
-- (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
-- fl4.daddr == saddr) {
-- if (!tcp_peer_is_proven(req, dst, true)) {
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
-- goto drop_and_release;
-- }
-- }
-- /* Kill the following clause, if you dislike this way. */
-- else if (!sysctl_tcp_syncookies &&
-- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
-- (sysctl_max_syn_backlog >> 2)) &&
-- !tcp_peer_is_proven(req, dst, false)) {
-- /* Without syncookies last quarter of
-- * backlog is filled with destinations,
-- * proven to be alive.
-- * It means that we continue to communicate
-- * to destinations, already remembered
-- * to the moment of synflood.
-- */
-- LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
-- &saddr, ntohs(tcp_hdr(skb)->source));
-- goto drop_and_release;
-- }
--
-- isn = tcp_v4_init_sequence(skb);
-- }
-- if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
-- goto drop_and_free;
--
-- tcp_rsk(req)->snt_isn = isn;
-- tcp_rsk(req)->snt_synack = tcp_time_stamp;
-- tcp_openreq_init_rwin(req, sk, dst);
-- fastopen = !want_cookie &&
-- tcp_try_fastopen(sk, skb, req, &foc, dst);
-- err = tcp_v4_send_synack(sk, dst, req,
-- skb_get_queue_mapping(skb), &foc);
-- if (!fastopen) {
-- if (err || want_cookie)
-- goto drop_and_free;
--
-- tcp_rsk(req)->snt_synack = tcp_time_stamp;
-- tcp_rsk(req)->listener = NULL;
-- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-- }
--
-- return 0;
--
--drop_and_release:
-- dst_release(dst);
--drop_and_free:
-- reqsk_free(req);
- drop:
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
- return 0;
-@@ -1497,7 +1433,7 @@ put_and_exit:
- }
- EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
-
--static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
-+struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- struct tcphdr *th = tcp_hdr(skb);
- const struct iphdr *iph = ip_hdr(skb);
-@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
-
- if (nsk) {
- if (nsk->sk_state != TCP_TIME_WAIT) {
-+ /* Don't lock again the meta-sk. It has been locked
-+ * before mptcp_v4_do_rcv.
-+ */
-+ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
-+ bh_lock_sock(mptcp_meta_sk(nsk));
- bh_lock_sock(nsk);
-+
- return nsk;
-+
- }
- inet_twsk_put(inet_twsk(nsk));
- return NULL;
-@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
- goto discard;
- #endif
-
-+ if (is_meta_sk(sk))
-+ return mptcp_v4_do_rcv(sk, skb);
-+
- if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
- struct dst_entry *dst = sk->sk_rx_dst;
-
-@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
- } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
- wake_up_interruptible_sync_poll(sk_sleep(sk),
- POLLIN | POLLRDNORM | POLLRDBAND);
-- if (!inet_csk_ack_scheduled(sk))
-+ if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
- (3 * tcp_rto_min(sk)) / 4,
- TCP_RTO_MAX);
-@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
- {
- const struct iphdr *iph;
- const struct tcphdr *th;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk = NULL;
- int ret;
- struct net *net = dev_net(skb->dev);
-
-@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
- TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
- skb->len - th->doff * 4);
- TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-+#ifdef CONFIG_MPTCP
-+ TCP_SKB_CB(skb)->mptcp_flags = 0;
-+ TCP_SKB_CB(skb)->dss_off = 0;
-+#endif
- TCP_SKB_CB(skb)->when = 0;
- TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
- TCP_SKB_CB(skb)->sacked = 0;
-
- sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
-- if (!sk)
-- goto no_tcp_socket;
-
- process:
-- if (sk->sk_state == TCP_TIME_WAIT)
-+ if (sk && sk->sk_state == TCP_TIME_WAIT)
- goto do_time_wait;
-
-+#ifdef CONFIG_MPTCP
-+ if (!sk && th->syn && !th->ack) {
-+ int ret = mptcp_lookup_join(skb, NULL);
-+
-+ if (ret < 0) {
-+ tcp_v4_send_reset(NULL, skb);
-+ goto discard_it;
-+ } else if (ret > 0) {
-+ return 0;
-+ }
-+ }
-+
-+ /* Is there a pending request sock for this segment ? */
-+ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
-+ if (sk)
-+ sock_put(sk);
-+ return 0;
-+ }
-+#endif
-+ if (!sk)
-+ goto no_tcp_socket;
-+
- if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
- NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- goto discard_and_relse;
-@@ -1759,11 +1729,21 @@ process:
- sk_mark_napi_id(sk, skb);
- skb->dev = NULL;
-
-- bh_lock_sock_nested(sk);
-+ if (mptcp(tcp_sk(sk))) {
-+ meta_sk = mptcp_meta_sk(sk);
-+
-+ bh_lock_sock_nested(meta_sk);
-+ if (sock_owned_by_user(meta_sk))
-+ skb->sk = sk;
-+ } else {
-+ meta_sk = sk;
-+ bh_lock_sock_nested(sk);
-+ }
-+
- ret = 0;
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- #ifdef CONFIG_NET_DMA
-- struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_sock *tp = tcp_sk(meta_sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
- if (tp->ucopy.dma_chan)
-@@ -1771,16 +1751,16 @@ process:
- else
- #endif
- {
-- if (!tcp_prequeue(sk, skb))
-+ if (!tcp_prequeue(meta_sk, skb))
- ret = tcp_v4_do_rcv(sk, skb);
- }
-- } else if (unlikely(sk_add_backlog(sk, skb,
-- sk->sk_rcvbuf + sk->sk_sndbuf))) {
-- bh_unlock_sock(sk);
-+ } else if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+ bh_unlock_sock(meta_sk);
- NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
- goto discard_and_relse;
- }
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
-
- sock_put(sk);
-
-@@ -1835,6 +1815,18 @@ do_time_wait:
- sk = sk2;
- goto process;
- }
-+#ifdef CONFIG_MPTCP
-+ if (th->syn && !th->ack) {
-+ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
-+
-+ if (ret < 0) {
-+ tcp_v4_send_reset(NULL, skb);
-+ goto discard_it;
-+ } else if (ret > 0) {
-+ return 0;
-+ }
-+ }
-+#endif
- /* Fall through to ACK */
- }
- case TCP_TW_ACK:
-@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
-
- tcp_init_sock(sk);
-
-- icsk->icsk_af_ops = &ipv4_specific;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v4_specific;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv4_specific;
-
- #ifdef CONFIG_TCP_MD5SIG
- tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
-@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
-
- tcp_cleanup_congestion_control(sk);
-
-+ if (mptcp(tp))
-+ mptcp_destroy_sock(sk);
-+ if (tp->inside_tk_table)
-+ mptcp_hash_remove(tp);
-+
- /* Cleanup up the write buffer. */
- tcp_write_queue_purge(sk);
-
-@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
- }
- #endif /* CONFIG_PROC_FS */
-
-+#ifdef CONFIG_MPTCP
-+static void tcp_v4_clear_sk(struct sock *sk, int size)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ /* we do not want to clear tk_table field, because of RCU lookups */
-+ sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
-+
-+ size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
-+ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
-+}
-+#endif
-+
- struct proto tcp_prot = {
- .name = "TCP",
- .owner = THIS_MODULE,
-@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
- .destroy_cgroup = tcp_destroy_cgroup,
- .proto_cgroup = tcp_proto_cgroup,
- #endif
-+#ifdef CONFIG_MPTCP
-+ .clear_sk = tcp_v4_clear_sk,
-+#endif
- };
- EXPORT_SYMBOL(tcp_prot);
-
-diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
-index e68e0d4af6c9..ae6946857dff 100644
---- a/net/ipv4/tcp_minisocks.c
-+++ b/net/ipv4/tcp_minisocks.c
-@@ -18,11 +18,13 @@
- * Jorge Cwik, <jorge@laser.satlink.net>
- */
-
-+#include <linux/kconfig.h>
- #include <linux/mm.h>
- #include <linux/module.h>
- #include <linux/slab.h>
- #include <linux/sysctl.h>
- #include <linux/workqueue.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- #include <net/inet_common.h>
- #include <net/xfrm.h>
-@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- struct tcp_options_received tmp_opt;
- struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
- bool paws_reject = false;
-+ struct mptcp_options_received mopt;
-
- tmp_opt.saw_tstamp = 0;
- if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-- tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+ mptcp_init_mp_opt(&mopt);
-+
-+ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
-
- if (tmp_opt.saw_tstamp) {
- tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
-@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- tmp_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
- paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
- }
-+
-+ if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
-+ if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
-+ goto kill_with_rst;
-+ }
- }
-
- if (tw->tw_substate == TCP_FIN_WAIT2) {
-@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- if (!th->ack ||
- !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
- TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
-+ /* If mptcp_is_data_fin() returns true, we are sure that
-+ * mopt has been initialized - otherwise it would not
-+ * be a DATA_FIN.
-+ */
-+ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
-+ mptcp_is_data_fin(skb) &&
-+ TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
-+ mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
-+ return TCP_TW_ACK;
-+
- inet_twsk_put(tw);
- return TCP_TW_SUCCESS;
- }
-@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
- tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
- tcptw->tw_ts_offset = tp->tsoffset;
-
-+ if (mptcp(tp)) {
-+ if (mptcp_init_tw_sock(sk, tcptw)) {
-+ inet_twsk_free(tw);
-+ goto exit;
-+ }
-+ } else {
-+ tcptw->mptcp_tw = NULL;
-+ }
-+
- #if IS_ENABLED(CONFIG_IPV6)
- if (tw->tw_family == PF_INET6) {
- struct ipv6_pinfo *np = inet6_sk(sk);
-@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
- }
-
-+exit:
- tcp_update_metrics(sk);
- tcp_done(sk);
- }
-
- void tcp_twsk_destructor(struct sock *sk)
- {
--#ifdef CONFIG_TCP_MD5SIG
- struct tcp_timewait_sock *twsk = tcp_twsk(sk);
-
-+ if (twsk->mptcp_tw)
-+ mptcp_twsk_destructor(twsk);
-+#ifdef CONFIG_TCP_MD5SIG
- if (twsk->tw_md5_key)
- kfree_rcu(twsk->tw_md5_key, rcu);
- #endif
-@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
- req->window_clamp = tcp_full_space(sk);
-
- /* tcp_full_space because it is guaranteed to be the first packet */
-- tcp_select_initial_window(tcp_full_space(sk),
-- mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
-+ tp->ops->select_initial_window(tcp_full_space(sk),
-+ mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
-+ (ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
- &req->rcv_wnd,
- &req->window_clamp,
- ireq->wscale_ok,
- &rcv_wscale,
-- dst_metric(dst, RTAX_INITRWND));
-+ dst_metric(dst, RTAX_INITRWND), sk);
- ireq->rcv_wscale = rcv_wscale;
- }
- EXPORT_SYMBOL(tcp_openreq_init_rwin);
-@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
- newtp->rx_opt.ts_recent_stamp = 0;
- newtp->tcp_header_len = sizeof(struct tcphdr);
- }
-+ if (ireq->saw_mpc)
-+ newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
- newtp->tsoffset = 0;
- #ifdef CONFIG_TCP_MD5SIG
- newtp->md5sig_info = NULL; /*XXX*/
-@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- bool fastopen)
- {
- struct tcp_options_received tmp_opt;
-+ struct mptcp_options_received mopt;
- struct sock *child;
- const struct tcphdr *th = tcp_hdr(skb);
- __be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
- bool paws_reject = false;
-
-- BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
-+ BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
-
- tmp_opt.saw_tstamp = 0;
-+
-+ mptcp_init_mp_opt(&mopt);
-+
- if (th->doff > (sizeof(struct tcphdr)>>2)) {
-- tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
-
- if (tmp_opt.saw_tstamp) {
- tmp_opt.ts_recent = req->ts_recent;
-@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- *
- * Reset timer after retransmitting SYNACK, similar to
- * the idea of fast retransmit in recovery.
-+ *
-+ * Fall back to TCP if MP_CAPABLE is not set.
- */
-+
-+ if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
-+ inet_rsk(req)->saw_mpc = false;
-+
-+
- if (!inet_rtx_syn_ack(sk, req))
- req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
- TCP_RTO_MAX) + jiffies;
-@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- * socket is created, wait for troubles.
- */
- child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
-+
- if (child == NULL)
- goto listen_overflow;
-
-+ if (!is_meta_sk(sk)) {
-+ int ret = mptcp_check_req_master(sk, child, req, prev);
-+ if (ret < 0)
-+ goto listen_overflow;
-+
-+ /* MPTCP-supported */
-+ if (!ret)
-+ return tcp_sk(child)->mpcb->master_sk;
-+ } else {
-+ return mptcp_check_req_child(sk, child, req, prev, &mopt);
-+ }
- inet_csk_reqsk_queue_unlink(sk, req, prev);
- inet_csk_reqsk_queue_removed(sk, req);
-
-@@ -746,7 +804,17 @@ embryonic_reset:
- tcp_reset(sk);
- }
- if (!fastopen) {
-- inet_csk_reqsk_queue_drop(sk, req, prev);
-+ if (is_meta_sk(sk)) {
-+ /* We want to avoid stoping the keepalive-timer and so
-+ * avoid ending up in inet_csk_reqsk_queue_removed ...
-+ */
-+ inet_csk_reqsk_queue_unlink(sk, req, prev);
-+ if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
-+ mptcp_delete_synack_timer(sk);
-+ reqsk_free(req);
-+ } else {
-+ inet_csk_reqsk_queue_drop(sk, req, prev);
-+ }
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
- }
- return NULL;
-@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
- {
- int ret = 0;
- int state = child->sk_state;
-+ struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
-
-- if (!sock_owned_by_user(child)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
- skb->len);
- /* Wakeup parent, send SIGIO */
-@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
- * in main socket hash table and lock on listening
- * socket does not protect us more.
- */
-- __sk_add_backlog(child, skb);
-+ if (mptcp(tcp_sk(child)))
-+ skb->sk = child;
-+ __sk_add_backlog(meta_sk, skb);
- }
-
-- bh_unlock_sock(child);
-+ if (mptcp(tcp_sk(child)))
-+ bh_unlock_sock(child);
-+ bh_unlock_sock(meta_sk);
- sock_put(child);
- return ret;
- }
-diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
-index 179b51e6bda3..efd31b6c5784 100644
---- a/net/ipv4/tcp_output.c
-+++ b/net/ipv4/tcp_output.c
-@@ -36,6 +36,12 @@
-
- #define pr_fmt(fmt) "TCP: " fmt
-
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#endif
-+#include <net/ipv6.h>
- #include <net/tcp.h>
-
- #include <linux/compiler.h>
-@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
- unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
- EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
-
--static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-- int push_one, gfp_t gfp);
--
- /* Account for new data that has been sent to the network. */
--static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
-+void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
- {
- struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
-@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
- void tcp_select_initial_window(int __space, __u32 mss,
- __u32 *rcv_wnd, __u32 *window_clamp,
- int wscale_ok, __u8 *rcv_wscale,
-- __u32 init_rcv_wnd)
-+ __u32 init_rcv_wnd, const struct sock *sk)
- {
- unsigned int space = (__space < 0 ? 0 : __space);
-
-@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
- * value can be stuffed directly into th->window for an outgoing
- * frame.
- */
--static u16 tcp_select_window(struct sock *sk)
-+u16 tcp_select_window(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- u32 old_win = tp->rcv_wnd;
-- u32 cur_win = tcp_receive_window(tp);
-- u32 new_win = __tcp_select_window(sk);
-+ /* The window must never shrink at the meta-level. At the subflow we
-+ * have to allow this. Otherwise we may announce a window too large
-+ * for the current meta-level sk_rcvbuf.
-+ */
-+ u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
-+ u32 new_win = tp->ops->__select_window(sk);
-
- /* Never shrink the offered window */
- if (new_win < cur_win) {
-@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
- LINUX_MIB_TCPWANTZEROWINDOWADV);
- new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
- }
-+
- tp->rcv_wnd = new_win;
- tp->rcv_wup = tp->rcv_nxt;
-
-@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
- /* Constructs common control bits of non-data skb. If SYN/FIN is present,
- * auto increment end seqno.
- */
--static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
-+void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
- {
- struct skb_shared_info *shinfo = skb_shinfo(skb);
-
-@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
- TCP_SKB_CB(skb)->end_seq = seq;
- }
-
--static inline bool tcp_urg_mode(const struct tcp_sock *tp)
-+bool tcp_urg_mode(const struct tcp_sock *tp)
- {
- return tp->snd_una != tp->snd_up;
- }
-@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
- #define OPTION_MD5 (1 << 2)
- #define OPTION_WSCALE (1 << 3)
- #define OPTION_FAST_OPEN_COOKIE (1 << 8)
--
--struct tcp_out_options {
-- u16 options; /* bit field of OPTION_* */
-- u16 mss; /* 0 to disable */
-- u8 ws; /* window scale, 0 to disable */
-- u8 num_sack_blocks; /* number of SACK blocks to include */
-- u8 hash_size; /* bytes in hash_location */
-- __u8 *hash_location; /* temporary pointer, overloaded */
-- __u32 tsval, tsecr; /* need to include OPTION_TS */
-- struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
--};
-+/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
-
- /* Write previously computed TCP options to the packet.
- *
-@@ -430,7 +428,7 @@ struct tcp_out_options {
- * (but it may well be that other scenarios fail similarly).
- */
- static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-- struct tcp_out_options *opts)
-+ struct tcp_out_options *opts, struct sk_buff *skb)
- {
- u16 options = opts->options; /* mungable copy */
-
-@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
- }
- ptr += (foc->len + 3) >> 2;
- }
-+
-+ if (unlikely(OPTION_MPTCP & opts->options))
-+ mptcp_options_write(ptr, tp, opts, skb);
- }
-
- /* Compute TCP options for SYN packets. This is not the final
-@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
- if (unlikely(!(OPTION_TS & opts->options)))
- remaining -= TCPOLEN_SACKPERM_ALIGNED;
- }
-+ if (tp->request_mptcp || mptcp(tp))
-+ mptcp_syn_options(sk, opts, &remaining);
-
- if (fastopen && fastopen->cookie.len >= 0) {
- u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
-@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
- }
- }
-
-+ if (ireq->saw_mpc)
-+ mptcp_synack_options(req, opts, &remaining);
-+
- return MAX_TCP_OPTION_SPACE - remaining;
- }
-
-@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
- opts->tsecr = tp->rx_opt.ts_recent;
- size += TCPOLEN_TSTAMP_ALIGNED;
- }
-+ if (mptcp(tp))
-+ mptcp_established_options(sk, skb, opts, &size);
-
- eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
- if (unlikely(eff_sacks)) {
-- const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-- opts->num_sack_blocks =
-- min_t(unsigned int, eff_sacks,
-- (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
-- TCPOLEN_SACK_PERBLOCK);
-- size += TCPOLEN_SACK_BASE_ALIGNED +
-- opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
-+ const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
-+ if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
-+ opts->num_sack_blocks = 0;
-+ else
-+ opts->num_sack_blocks =
-+ min_t(unsigned int, eff_sacks,
-+ (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
-+ TCPOLEN_SACK_PERBLOCK);
-+ if (opts->num_sack_blocks)
-+ size += TCPOLEN_SACK_BASE_ALIGNED +
-+ opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
- }
-
- return size;
-@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
- if ((1 << sk->sk_state) &
- (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
- TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
-- tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
-- 0, GFP_ATOMIC);
-+ tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
-+ tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
- }
- /*
- * One tasklet per cpu tries to send more skbs.
-@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
- unsigned long flags;
- struct list_head *q, *n;
- struct tcp_sock *tp;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk;
-
- local_irq_save(flags);
- list_splice_init(&tsq->head, &list);
-@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
- list_del(&tp->tsq_node);
-
- sk = (struct sock *)tp;
-- bh_lock_sock(sk);
-+ meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-+ bh_lock_sock(meta_sk);
-
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- tcp_tsq_handler(sk);
-+ if (mptcp(tp))
-+ tcp_tsq_handler(meta_sk);
- } else {
-+ if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
-+ goto exit;
-+
- /* defer the work to tcp_release_cb() */
- set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
-+
-+ if (mptcp(tp))
-+ mptcp_tsq_flags(sk);
- }
-- bh_unlock_sock(sk);
-+exit:
-+ bh_unlock_sock(meta_sk);
-
- clear_bit(TSQ_QUEUED, &tp->tsq_flags);
- sk_free(sk);
-@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
- #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) | \
- (1UL << TCP_WRITE_TIMER_DEFERRED) | \
- (1UL << TCP_DELACK_TIMER_DEFERRED) | \
-- (1UL << TCP_MTU_REDUCED_DEFERRED))
-+ (1UL << TCP_MTU_REDUCED_DEFERRED) | \
-+ (1UL << MPTCP_PATH_MANAGER) | \
-+ (1UL << MPTCP_SUB_DEFERRED))
-+
- /**
- * tcp_release_cb - tcp release_sock() callback
- * @sk: socket
-@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
- sk->sk_prot->mtu_reduced(sk);
- __sock_put(sk);
- }
-+ if (flags & (1UL << MPTCP_PATH_MANAGER)) {
-+ if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
-+ tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
-+ __sock_put(sk);
-+ }
-+ if (flags & (1UL << MPTCP_SUB_DEFERRED))
-+ mptcp_tsq_sub_deferred(sk);
- }
- EXPORT_SYMBOL(tcp_release_cb);
-
-@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
- * We are working here with either a clone of the original
- * SKB, or a fresh unique copy made by the retransmit engine.
- */
--static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-- gfp_t gfp_mask)
-+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-+ gfp_t gfp_mask)
- {
- const struct inet_connection_sock *icsk = inet_csk(sk);
- struct inet_sock *inet;
-@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- */
- th->window = htons(min(tp->rcv_wnd, 65535U));
- } else {
-- th->window = htons(tcp_select_window(sk));
-+ th->window = htons(tp->ops->select_window(sk));
- }
- th->check = 0;
- th->urg_ptr = 0;
-@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- }
- }
-
-- tcp_options_write((__be32 *)(th + 1), tp, &opts);
-+ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
- if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
- TCP_ECN_send(sk, skb, tcp_header_size);
-
-@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
- * otherwise socket can stall.
- */
--static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
-+void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
- }
-
- /* Initialize TSO segments for a packet. */
--static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-- unsigned int mss_now)
-+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+ unsigned int mss_now)
- {
- struct skb_shared_info *shinfo = skb_shinfo(skb);
-
- /* Make sure we own this skb before messing gso_size/gso_segs */
- WARN_ON_ONCE(skb_cloned(skb));
-
-- if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
-+ if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
-+ (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
- /* Avoid the costly divide in the normal
- * non-TSO case.
- */
-@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
- /* Pcount in the middle of the write queue got changed, we need to do various
- * tweaks to fix counters
- */
--static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
-+void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
- * eventually). The difference is that pulled data not copied, but
- * immediately discarded.
- */
--static void __pskb_trim_head(struct sk_buff *skb, int len)
-+void __pskb_trim_head(struct sk_buff *skb, int len)
- {
- struct skb_shared_info *shinfo;
- int i, k, eat;
-@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
- /* Remove acked data from a packet in the transmit queue. */
- int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
- {
-+ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
-+ return mptcp_trim_head(sk, skb, len);
-+
- if (skb_unclone(skb, GFP_ATOMIC))
- return -ENOMEM;
-
-@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
- if (tcp_skb_pcount(skb) > 1)
- tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
-
-+#ifdef CONFIG_MPTCP
-+ /* Some data got acked - we assume that the seq-number reached the dest.
-+ * Anyway, our MPTCP-option has been trimmed above - we lost it here.
-+ * Only remove the SEQ if the call does not come from a meta retransmit.
-+ */
-+ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
-+ TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
-+#endif
-+
- return 0;
- }
-
-@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
-
- return mss_now;
- }
-+EXPORT_SYMBOL(tcp_current_mss);
-
- /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
- * As additional protections, we do not touch cwnd in retransmission phases,
-@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
- * But we can avoid doing the divide again given we already have
- * skb_pcount = skb->len / mss_now
- */
--static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-- const struct sk_buff *skb)
-+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-+ const struct sk_buff *skb)
- {
- if (skb->len < tcp_skb_pcount(skb) * mss_now)
- tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
-@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
- (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
- }
- /* Returns the portion of skb which can be sent right away */
--static unsigned int tcp_mss_split_point(const struct sock *sk,
-- const struct sk_buff *skb,
-- unsigned int mss_now,
-- unsigned int max_segs,
-- int nonagle)
-+unsigned int tcp_mss_split_point(const struct sock *sk,
-+ const struct sk_buff *skb,
-+ unsigned int mss_now,
-+ unsigned int max_segs,
-+ int nonagle)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
- u32 partial, needed, window, max_len;
-@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
- /* Can at least one segment of SKB be sent right now, according to the
- * congestion window rules? If so, return how many segments are allowed.
- */
--static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-- const struct sk_buff *skb)
-+unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-+ const struct sk_buff *skb)
- {
- u32 in_flight, cwnd;
-
- /* Don't be strict about the congestion window for the final FIN. */
-- if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
-+ if (skb &&
-+ (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
- tcp_skb_pcount(skb) == 1)
- return 1;
-
-@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
- * This must be invoked the first time we consider transmitting
- * SKB onto the wire.
- */
--static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-- unsigned int mss_now)
-+int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+ unsigned int mss_now)
- {
- int tso_segs = tcp_skb_pcount(skb);
-
-@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
- /* Return true if the Nagle test allows this packet to be
- * sent now.
- */
--static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-- unsigned int cur_mss, int nonagle)
-+bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ unsigned int cur_mss, int nonagle)
- {
- /* Nagle rule does not apply to frames, which sit in the middle of the
- * write_queue (they have no chances to get new data).
-@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
- return true;
-
- /* Don't use the nagle rule for urgent data (or for the final FIN). */
-- if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
-+ if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
-+ mptcp_is_data_fin(skb))
- return true;
-
- if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
-@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
- }
-
- /* Does at least the first segment of SKB fit into the send window? */
--static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
-- const struct sk_buff *skb,
-- unsigned int cur_mss)
-+bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ unsigned int cur_mss)
- {
- u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-
-@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
- u32 send_win, cong_win, limit, in_flight;
- int win_divisor;
-
-- if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
-+ if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
- goto send_now;
-
- if (icsk->icsk_ca_state != TCP_CA_Open)
-@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
- * Returns true, if no segments are in flight and we have queued segments,
- * but cannot send anything now because of SWS or another problem.
- */
--static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
- int push_one, gfp_t gfp)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-
- sent_pkts = 0;
-
-- if (!push_one) {
-+ /* pmtu not yet supported with MPTCP. Should be possible, by early
-+ * exiting the loop inside tcp_mtu_probe, making sure that only one
-+ * single DSS-mapping gets probed.
-+ */
-+ if (!push_one && !mptcp(tp)) {
- /* Do MTU probing. */
- result = tcp_mtu_probe(sk);
- if (!result) {
-@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
- int err = -1;
-
- if (tcp_send_head(sk) != NULL) {
-- err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
-+ err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
-+ GFP_ATOMIC);
- goto rearm_timer;
- }
-
-@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
- if (unlikely(sk->sk_state == TCP_CLOSE))
- return;
-
-- if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
-- sk_gfp_atomic(sk, GFP_ATOMIC)))
-+ if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
-+ sk_gfp_atomic(sk, GFP_ATOMIC)))
- tcp_check_probe_timer(sk);
- }
-
-@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
-
- BUG_ON(!skb || skb->len < mss_now);
-
-- tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
-+ tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
-+ sk->sk_allocation);
- }
-
- /* This function returns the amount that we can raise the
-@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
- if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
- return;
-
-+ /* Currently not supported for MPTCP - but it should be possible */
-+ if (mptcp(tp))
-+ return;
-+
- tcp_for_write_queue_from_safe(skb, tmp, sk) {
- if (!tcp_can_collapse(sk, skb))
- break;
-@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
-
- /* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
- th->window = htons(min(req->rcv_wnd, 65535U));
-- tcp_options_write((__be32 *)(th + 1), tp, &opts);
-+ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
- th->doff = (tcp_header_size >> 2);
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
-
-@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
- (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
- tp->window_clamp = tcp_full_space(sk);
-
-- tcp_select_initial_window(tcp_full_space(sk),
-- tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
-- &tp->rcv_wnd,
-- &tp->window_clamp,
-- sysctl_tcp_window_scaling,
-- &rcv_wscale,
-- dst_metric(dst, RTAX_INITRWND));
-+ tp->ops->select_initial_window(tcp_full_space(sk),
-+ tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
-+ &tp->rcv_wnd,
-+ &tp->window_clamp,
-+ sysctl_tcp_window_scaling,
-+ &rcv_wscale,
-+ dst_metric(dst, RTAX_INITRWND), sk);
-
- tp->rx_opt.rcv_wscale = rcv_wscale;
- tp->rcv_ssthresh = tp->rcv_wnd;
-@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
- inet_csk(sk)->icsk_retransmits = 0;
- tcp_clear_retrans(tp);
-+
-+#ifdef CONFIG_MPTCP
-+ if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
-+ if (is_master_tp(tp)) {
-+ tp->request_mptcp = 1;
-+ mptcp_connect_init(sk);
-+ } else if (tp->mptcp) {
-+ struct inet_sock *inet = inet_sk(sk);
-+
-+ tp->mptcp->snt_isn = tp->write_seq;
-+ tp->mptcp->init_rcv_wnd = tp->rcv_wnd;
-+
-+ /* Set nonce for new subflows */
-+ if (sk->sk_family == AF_INET)
-+ tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
-+ inet->inet_saddr,
-+ inet->inet_daddr,
-+ inet->inet_sport,
-+ inet->inet_dport);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
-+ inet6_sk(sk)->saddr.s6_addr32,
-+ sk->sk_v6_daddr.s6_addr32,
-+ inet->inet_sport,
-+ inet->inet_dport);
-+#endif
-+ }
-+ }
-+#endif
- }
-
- static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
-@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
- TCP_SKB_CB(buff)->when = tcp_time_stamp;
- tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
- }
-+EXPORT_SYMBOL(tcp_send_ack);
-
- /* This routine sends a packet with an out of date sequence
- * number. It assumes the other end will try to ack it.
-@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
- * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
- * out-of-date with SND.UNA-1 to probe window.
- */
--static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
-+int tcp_xmit_probe_skb(struct sock *sk, int urgent)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- struct sk_buff *skb;
-@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
- struct tcp_sock *tp = tcp_sk(sk);
- int err;
-
-- err = tcp_write_wakeup(sk);
-+ err = tp->ops->write_wakeup(sk);
-
- if (tp->packets_out || !tcp_send_head(sk)) {
- /* Cancel probe timer, if it is not required. */
-@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
- TCP_RTO_MAX);
- }
- }
-+
-+int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
-+{
-+ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
-+ struct flowi fl;
-+ int res;
-+
-+ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
-+ if (!res) {
-+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-+ }
-+ return res;
-+}
-+EXPORT_SYMBOL(tcp_rtx_synack);
-diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
-index 286227abed10..966b873cbf3e 100644
---- a/net/ipv4/tcp_timer.c
-+++ b/net/ipv4/tcp_timer.c
-@@ -20,6 +20,7 @@
-
- #include <linux/module.h>
- #include <linux/gfp.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
-
- int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
-@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
- int sysctl_tcp_orphan_retries __read_mostly;
- int sysctl_tcp_thin_linear_timeouts __read_mostly;
-
--static void tcp_write_err(struct sock *sk)
-+void tcp_write_err(struct sock *sk)
- {
- sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
- sk->sk_error_report(sk);
-@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
- (!tp->snd_wnd && !tp->packets_out))
- do_reset = 1;
- if (do_reset)
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
- tcp_done(sk);
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
- return 1;
-@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
- * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
- * syn_set flag is set.
- */
--static bool retransmits_timed_out(struct sock *sk,
-- unsigned int boundary,
-- unsigned int timeout,
-- bool syn_set)
-+bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
-+ unsigned int timeout, bool syn_set)
- {
- unsigned int linear_backoff_thresh, start_ts;
- unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
-@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
- }
-
- /* A write timeout has occurred. Process the after effects. */
--static int tcp_write_timeout(struct sock *sk)
-+int tcp_write_timeout(struct sock *sk)
- {
- struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
-@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
- }
- retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
- syn_set = true;
-+ /* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
-+ if (tcp_sk(sk)->request_mptcp &&
-+ icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
-+ tcp_sk(sk)->request_mptcp = 0;
- } else {
- if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
- /* Black hole detection */
-@@ -251,18 +254,22 @@ out:
- static void tcp_delack_timer(unsigned long data)
- {
- struct sock *sk = (struct sock *)data;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-
-- bh_lock_sock(sk);
-- if (!sock_owned_by_user(sk)) {
-+ bh_lock_sock(meta_sk);
-+ if (!sock_owned_by_user(meta_sk)) {
- tcp_delack_timer_handler(sk);
- } else {
- inet_csk(sk)->icsk_ack.blocked = 1;
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
- /* deleguate our work to tcp_release_cb() */
- if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
- sock_hold(sk);
-+ if (mptcp(tp))
-+ mptcp_tsq_flags(sk);
- }
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-@@ -479,6 +486,10 @@ out_reset_timer:
- __sk_dst_reset(sk);
-
- out:;
-+ if (mptcp(tp)) {
-+ mptcp_reinject_data(sk, 1);
-+ mptcp_set_rto(sk);
-+ }
- }
-
- void tcp_write_timer_handler(struct sock *sk)
-@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
- break;
- case ICSK_TIME_RETRANS:
- icsk->icsk_pending = 0;
-- tcp_retransmit_timer(sk);
-+ tcp_sk(sk)->ops->retransmit_timer(sk);
- break;
- case ICSK_TIME_PROBE0:
- icsk->icsk_pending = 0;
-@@ -520,16 +531,19 @@ out:
- static void tcp_write_timer(unsigned long data)
- {
- struct sock *sk = (struct sock *)data;
-+ struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
-
-- bh_lock_sock(sk);
-- if (!sock_owned_by_user(sk)) {
-+ bh_lock_sock(meta_sk);
-+ if (!sock_owned_by_user(meta_sk)) {
- tcp_write_timer_handler(sk);
- } else {
- /* deleguate our work to tcp_release_cb() */
- if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
- sock_hold(sk);
-+ if (mptcp(tcp_sk(sk)))
-+ mptcp_tsq_flags(sk);
- }
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
- struct sock *sk = (struct sock *) data;
- struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
- u32 elapsed;
-
- /* Only process if socket is not in use. */
-- bh_lock_sock(sk);
-- if (sock_owned_by_user(sk)) {
-+ bh_lock_sock(meta_sk);
-+ if (sock_owned_by_user(meta_sk)) {
- /* Try again later. */
- inet_csk_reset_keepalive_timer (sk, HZ/20);
- goto out;
-@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
- goto out;
- }
-
-+ if (tp->send_mp_fclose) {
-+ /* MUST do this before tcp_write_timeout, because retrans_stamp
-+ * may have been set to 0 in another part while we are
-+ * retransmitting MP_FASTCLOSE. Then, we would crash, because
-+ * retransmits_timed_out accesses the meta-write-queue.
-+ *
-+ * We make sure that the timestamp is != 0.
-+ */
-+ if (!tp->retrans_stamp)
-+ tp->retrans_stamp = tcp_time_stamp ? : 1;
-+
-+ if (tcp_write_timeout(sk))
-+ goto out;
-+
-+ tcp_send_ack(sk);
-+ icsk->icsk_retransmits++;
-+
-+ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-+ elapsed = icsk->icsk_rto;
-+ goto resched;
-+ }
-+
- if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
- if (tp->linger2 >= 0) {
- const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
-
- if (tmo > 0) {
-- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
- goto out;
- }
- }
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
- goto death;
- }
-
-@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
- icsk->icsk_probes_out > 0) ||
- (icsk->icsk_user_timeout == 0 &&
- icsk->icsk_probes_out >= keepalive_probes(tp))) {
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
- tcp_write_err(sk);
- goto out;
- }
-- if (tcp_write_wakeup(sk) <= 0) {
-+ if (tp->ops->write_wakeup(sk) <= 0) {
- icsk->icsk_probes_out++;
- elapsed = keepalive_intvl_when(tp);
- } else {
-@@ -642,7 +679,7 @@ death:
- tcp_done(sk);
-
- out:
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
-index 5667b3003af9..7139c2973fd2 100644
---- a/net/ipv6/addrconf.c
-+++ b/net/ipv6/addrconf.c
-@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
-
- kfree_rcu(ifp, rcu);
- }
-+EXPORT_SYMBOL(inet6_ifa_finish_destroy);
-
- static void
- ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
-diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
-index 7cb4392690dd..7057afbca4df 100644
---- a/net/ipv6/af_inet6.c
-+++ b/net/ipv6/af_inet6.c
-@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
- return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
- }
-
--static int inet6_create(struct net *net, struct socket *sock, int protocol,
-- int kern)
-+int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
- {
- struct inet_sock *inet;
- struct ipv6_pinfo *np;
-diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
-index a245e5ddffbd..99c892b8992d 100644
---- a/net/ipv6/inet6_connection_sock.c
-+++ b/net/ipv6/inet6_connection_sock.c
-@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
- /*
- * request_sock (formerly open request) hash tables.
- */
--static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-- const u32 rnd, const u32 synq_hsize)
-+u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-+ const u32 rnd, const u32 synq_hsize)
- {
- u32 c;
-
-diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
-index edb58aff4ae7..ea4d9fda0927 100644
---- a/net/ipv6/ipv6_sockglue.c
-+++ b/net/ipv6/ipv6_sockglue.c
-@@ -48,6 +48,8 @@
- #include <net/addrconf.h>
- #include <net/inet_common.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
- #include <net/udp.h>
- #include <net/udplite.h>
- #include <net/xfrm.h>
-@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
- sock_prot_inuse_add(net, &tcp_prot, 1);
- local_bh_enable();
- sk->sk_prot = &tcp_prot;
-- icsk->icsk_af_ops = &ipv4_specific;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v4_specific;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv4_specific;
- sk->sk_socket->ops = &inet_stream_ops;
- sk->sk_family = PF_INET;
- tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
-diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
-index a822b880689b..b2b38869d795 100644
---- a/net/ipv6/syncookies.c
-+++ b/net/ipv6/syncookies.c
-@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
-
- /* check for timestamp cookie support */
- memset(&tcp_opt, 0, sizeof(tcp_opt));
-- tcp_parse_options(skb, &tcp_opt, 0, NULL);
-+ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
-
- if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
- goto out;
-
- ret = NULL;
-- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
-+ req = inet_reqsk_alloc(&tcp6_request_sock_ops);
- if (!req)
- goto out;
-
-@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
- }
-
- req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
-- tcp_select_initial_window(tcp_full_space(sk), req->mss,
-- &req->rcv_wnd, &req->window_clamp,
-- ireq->wscale_ok, &rcv_wscale,
-- dst_metric(dst, RTAX_INITRWND));
-+ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
-+ &req->rcv_wnd, &req->window_clamp,
-+ ireq->wscale_ok, &rcv_wscale,
-+ dst_metric(dst, RTAX_INITRWND), sk);
-
- ireq->rcv_wscale = rcv_wscale;
-
-diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
-index 229239ad96b1..fda94d71666e 100644
---- a/net/ipv6/tcp_ipv6.c
-+++ b/net/ipv6/tcp_ipv6.c
-@@ -63,6 +63,8 @@
- #include <net/inet_common.h>
- #include <net/secure_seq.h>
- #include <net/tcp_memcontrol.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v6.h>
- #include <net/busy_poll.h>
-
- #include <linux/proc_fs.h>
-@@ -71,12 +73,6 @@
- #include <linux/crypto.h>
- #include <linux/scatterlist.h>
-
--static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
--static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-- struct request_sock *req);
--
--static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
--
- static const struct inet_connection_sock_af_ops ipv6_mapped;
- static const struct inet_connection_sock_af_ops ipv6_specific;
- #ifdef CONFIG_TCP_MD5SIG
-@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
- }
- #endif
-
--static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
-+void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
- {
- struct dst_entry *dst = skb_dst(skb);
- const struct rt6_info *rt = (const struct rt6_info *)dst;
-@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
- inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
- }
-
--static void tcp_v6_hash(struct sock *sk)
-+void tcp_v6_hash(struct sock *sk)
- {
- if (sk->sk_state != TCP_CLOSE) {
-- if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
-+ if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
-+ inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
- tcp_prot.hash(sk);
- return;
- }
-@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
- }
- }
-
--static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
-+__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
- {
- return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
- ipv6_hdr(skb)->saddr.s6_addr32,
-@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
- tcp_hdr(skb)->source);
- }
-
--static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
-+int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- int addr_len)
- {
- struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
-@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- sin.sin_port = usin->sin6_port;
- sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
-
-- icsk->icsk_af_ops = &ipv6_mapped;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v6_mapped;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv6_mapped;
- sk->sk_backlog_rcv = tcp_v4_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- tp->af_specific = &tcp_sock_ipv6_mapped_specific;
-@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
-
- if (err) {
- icsk->icsk_ext_hdr_len = exthdrlen;
-- icsk->icsk_af_ops = &ipv6_specific;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v6_specific;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv6_specific;
- sk->sk_backlog_rcv = tcp_v6_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- tp->af_specific = &tcp_sock_ipv6_specific;
-@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
- const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
- struct ipv6_pinfo *np;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk;
- int err;
- struct tcp_sock *tp;
- struct request_sock *fastopen;
-@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- return;
- }
-
-- bh_lock_sock(sk);
-- if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
-+ tp = tcp_sk(sk);
-+ if (mptcp(tp))
-+ meta_sk = mptcp_meta_sk(sk);
-+ else
-+ meta_sk = sk;
-+
-+ bh_lock_sock(meta_sk);
-+ if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
- NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
-
- if (sk->sk_state == TCP_CLOSE)
-@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- goto out;
- }
-
-- tp = tcp_sk(sk);
- seq = ntohl(th->seq);
- /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
- fastopen = tp->fastopen_rsk;
-@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- goto out;
-
- tp->mtu_info = ntohl(info);
-- if (!sock_owned_by_user(sk))
-+ if (!sock_owned_by_user(meta_sk))
- tcp_v6_mtu_reduced(sk);
-- else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
-+ else {
-+ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
- &tp->tsq_flags))
-- sock_hold(sk);
-+ sock_hold(sk);
-+ if (mptcp(tp))
-+ mptcp_tsq_flags(sk);
-+ }
- goto out;
- }
-
-@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- switch (sk->sk_state) {
- struct request_sock *req, **prev;
- case TCP_LISTEN:
-- if (sock_owned_by_user(sk))
-+ if (sock_owned_by_user(meta_sk))
- goto out;
-
- req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
-@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- if (fastopen && fastopen->sk == NULL)
- break;
-
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- sk->sk_err = err;
- sk->sk_error_report(sk); /* Wake people up to see the error (see connect in sock.c) */
-
-@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- goto out;
- }
-
-- if (!sock_owned_by_user(sk) && np->recverr) {
-+ if (!sock_owned_by_user(meta_sk) && np->recverr) {
- sk->sk_err = err;
- sk->sk_error_report(sk);
- } else
- sk->sk_err_soft = err;
-
- out:
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-
--static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-- struct flowi6 *fl6,
-- struct request_sock *req,
-- u16 queue_mapping,
-- struct tcp_fastopen_cookie *foc)
-+int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl,
-+ struct request_sock *req,
-+ u16 queue_mapping,
-+ struct tcp_fastopen_cookie *foc)
- {
- struct inet_request_sock *ireq = inet_rsk(req);
- struct ipv6_pinfo *np = inet6_sk(sk);
-+ struct flowi6 *fl6 = &fl->u.ip6;
- struct sk_buff *skb;
- int err = -ENOMEM;
-
-@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
- skb_set_queue_mapping(skb, queue_mapping);
- err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
- err = net_xmit_eval(err);
-+ if (!tcp_rsk(req)->snt_synack && !err)
-+ tcp_rsk(req)->snt_synack = tcp_time_stamp;
- }
-
- done:
- return err;
- }
-
--static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
-+int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
- {
-- struct flowi6 fl6;
-+ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
-+ struct flowi fl;
- int res;
-
-- res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
-+ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
- if (!res) {
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
- return res;
- }
-
--static void tcp_v6_reqsk_destructor(struct request_sock *req)
-+void tcp_v6_reqsk_destructor(struct request_sock *req)
- {
- kfree_skb(inet_rsk(req)->pktopts);
- }
-@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
- }
- #endif
-
-+static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ struct inet_request_sock *ireq = inet_rsk(req);
-+ struct ipv6_pinfo *np = inet6_sk(sk);
-+
-+ ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
-+ ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
-+
-+ ireq->ir_iif = sk->sk_bound_dev_if;
-+ ireq->ir_mark = inet_request_mark(sk, skb);
-+
-+ /* So that link locals have meaning */
-+ if (!sk->sk_bound_dev_if &&
-+ ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
-+ ireq->ir_iif = inet6_iif(skb);
-+
-+ if (!TCP_SKB_CB(skb)->when &&
-+ (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
-+ np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
-+ np->rxopt.bits.rxohlim || np->repflow)) {
-+ atomic_inc(&skb->users);
-+ ireq->pktopts = skb;
-+ }
-+
-+ return 0;
-+}
-+
-+static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
-+ const struct request_sock *req,
-+ bool *strict)
-+{
-+ if (strict)
-+ *strict = true;
-+ return inet6_csk_route_req(sk, &fl->u.ip6, req);
-+}
-+
- struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
- .family = AF_INET6,
- .obj_size = sizeof(struct tcp6_request_sock),
-- .rtx_syn_ack = tcp_v6_rtx_synack,
-+ .rtx_syn_ack = tcp_rtx_synack,
- .send_ack = tcp_v6_reqsk_send_ack,
- .destructor = tcp_v6_reqsk_destructor,
- .send_reset = tcp_v6_send_reset,
- .syn_ack_timeout = tcp_syn_ack_timeout,
- };
-
-+const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
-+ .mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) -
-+ sizeof(struct ipv6hdr),
- #ifdef CONFIG_TCP_MD5SIG
--static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
- .md5_lookup = tcp_v6_reqsk_md5_lookup,
- .calc_md5_hash = tcp_v6_md5_hash_skb,
--};
- #endif
-+ .init_req = tcp_v6_init_req,
-+#ifdef CONFIG_SYN_COOKIES
-+ .cookie_init_seq = cookie_v6_init_sequence,
-+#endif
-+ .route_req = tcp_v6_route_req,
-+ .init_seq = tcp_v6_init_sequence,
-+ .send_synack = tcp_v6_send_synack,
-+ .queue_hash_add = inet6_csk_reqsk_queue_hash_add,
-+};
-
--static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
-- u32 tsval, u32 tsecr, int oif,
-- struct tcp_md5sig_key *key, int rst, u8 tclass,
-- u32 label)
-+static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
-+ u32 data_ack, u32 win, u32 tsval, u32 tsecr,
-+ int oif, struct tcp_md5sig_key *key, int rst,
-+ u8 tclass, u32 label, int mptcp)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- struct tcphdr *t1;
-@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- if (key)
- tot_len += TCPOLEN_MD5SIG_ALIGNED;
- #endif
--
-+#ifdef CONFIG_MPTCP
-+ if (mptcp)
-+ tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
-+#endif
- buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
- GFP_ATOMIC);
- if (buff == NULL)
-@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- tcp_v6_md5_hash_hdr((__u8 *)topt, key,
- &ipv6_hdr(skb)->saddr,
- &ipv6_hdr(skb)->daddr, t1);
-+ topt += 4;
-+ }
-+#endif
-+#ifdef CONFIG_MPTCP
-+ if (mptcp) {
-+ /* Construction of 32-bit data_ack */
-+ *topt++ = htonl((TCPOPT_MPTCP << 24) |
-+ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
-+ (0x20 << 8) |
-+ (0x01));
-+ *topt++ = htonl(data_ack);
- }
- #endif
-
-@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- kfree_skb(buff);
- }
-
--static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
-+void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- u32 seq = 0, ack_seq = 0;
-@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
- (th->doff << 2);
-
- oif = sk ? sk->sk_bound_dev_if : 0;
-- tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
-+ tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
-
- #ifdef CONFIG_TCP_MD5SIG
- release_sk1:
-@@ -902,45 +983,52 @@ release_sk1:
- #endif
- }
-
--static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
-+static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
- u32 win, u32 tsval, u32 tsecr, int oif,
- struct tcp_md5sig_key *key, u8 tclass,
-- u32 label)
-+ u32 label, int mptcp)
- {
-- tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
-- label);
-+ tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
-+ key, 0, tclass, label, mptcp);
- }
-
- static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
- {
- struct inet_timewait_sock *tw = inet_twsk(sk);
- struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
-+ u32 data_ack = 0;
-+ int mptcp = 0;
-
-+ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
-+ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
-+ mptcp = 1;
-+ }
- tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
-+ data_ack,
- tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
- tcp_time_stamp + tcptw->tw_ts_offset,
- tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
-- tw->tw_tclass, (tw->tw_flowlabel << 12));
-+ tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
-
- inet_twsk_put(tw);
- }
-
--static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-- struct request_sock *req)
-+void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req)
- {
- /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
- * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
- */
- tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
- tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
-- tcp_rsk(req)->rcv_nxt,
-+ tcp_rsk(req)->rcv_nxt, 0,
- req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
- tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
-- 0, 0);
-+ 0, 0, 0);
- }
-
-
--static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
-+struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- struct request_sock *req, **prev;
- const struct tcphdr *th = tcp_hdr(skb);
-@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
-
- if (nsk) {
- if (nsk->sk_state != TCP_TIME_WAIT) {
-+ /* Don't lock again the meta-sk. It has been locked
-+ * before mptcp_v6_do_rcv.
-+ */
-+ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
-+ bh_lock_sock(mptcp_meta_sk(nsk));
- bh_lock_sock(nsk);
-+
- return nsk;
- }
- inet_twsk_put(inet_twsk(nsk));
-@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- return sk;
- }
-
--/* FIXME: this is substantially similar to the ipv4 code.
-- * Can some kind of merge be done? -- erics
-- */
--static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
-+int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
- {
-- struct tcp_options_received tmp_opt;
-- struct request_sock *req;
-- struct inet_request_sock *ireq;
-- struct ipv6_pinfo *np = inet6_sk(sk);
-- struct tcp_sock *tp = tcp_sk(sk);
-- __u32 isn = TCP_SKB_CB(skb)->when;
-- struct dst_entry *dst = NULL;
-- struct tcp_fastopen_cookie foc = { .len = -1 };
-- bool want_cookie = false, fastopen;
-- struct flowi6 fl6;
-- int err;
--
- if (skb->protocol == htons(ETH_P_IP))
- return tcp_v4_conn_request(sk, skb);
-
- if (!ipv6_unicast_destination(skb))
- goto drop;
-
-- if ((sysctl_tcp_syncookies == 2 ||
-- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-- want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
-- if (!want_cookie)
-- goto drop;
-- }
--
-- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-- goto drop;
-- }
--
-- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
-- if (req == NULL)
-- goto drop;
--
--#ifdef CONFIG_TCP_MD5SIG
-- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
--#endif
--
-- tcp_clear_options(&tmp_opt);
-- tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
-- tmp_opt.user_mss = tp->rx_opt.user_mss;
-- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
--
-- if (want_cookie && !tmp_opt.saw_tstamp)
-- tcp_clear_options(&tmp_opt);
-+ return tcp_conn_request(&tcp6_request_sock_ops,
-+ &tcp_request_sock_ipv6_ops, sk, skb);
-
-- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-- tcp_openreq_init(req, &tmp_opt, skb);
--
-- ireq = inet_rsk(req);
-- ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
-- ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
-- if (!want_cookie || tmp_opt.tstamp_ok)
-- TCP_ECN_create_request(req, skb, sock_net(sk));
--
-- ireq->ir_iif = sk->sk_bound_dev_if;
-- ireq->ir_mark = inet_request_mark(sk, skb);
--
-- /* So that link locals have meaning */
-- if (!sk->sk_bound_dev_if &&
-- ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
-- ireq->ir_iif = inet6_iif(skb);
--
-- if (!isn) {
-- if (ipv6_opt_accepted(sk, skb) ||
-- np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
-- np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
-- np->repflow) {
-- atomic_inc(&skb->users);
-- ireq->pktopts = skb;
-- }
--
-- if (want_cookie) {
-- isn = cookie_v6_init_sequence(sk, skb, &req->mss);
-- req->cookie_ts = tmp_opt.tstamp_ok;
-- goto have_isn;
-- }
--
-- /* VJ's idea. We save last timestamp seen
-- * from the destination in peer table, when entering
-- * state TIME-WAIT, and check against it before
-- * accepting new connection request.
-- *
-- * If "isn" is not zero, this request hit alive
-- * timewait bucket, so that all the necessary checks
-- * are made in the function processing timewait state.
-- */
-- if (tmp_opt.saw_tstamp &&
-- tcp_death_row.sysctl_tw_recycle &&
-- (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
-- if (!tcp_peer_is_proven(req, dst, true)) {
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
-- goto drop_and_release;
-- }
-- }
-- /* Kill the following clause, if you dislike this way. */
-- else if (!sysctl_tcp_syncookies &&
-- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
-- (sysctl_max_syn_backlog >> 2)) &&
-- !tcp_peer_is_proven(req, dst, false)) {
-- /* Without syncookies last quarter of
-- * backlog is filled with destinations,
-- * proven to be alive.
-- * It means that we continue to communicate
-- * to destinations, already remembered
-- * to the moment of synflood.
-- */
-- LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
-- &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
-- goto drop_and_release;
-- }
--
-- isn = tcp_v6_init_sequence(skb);
-- }
--have_isn:
--
-- if (security_inet_conn_request(sk, skb, req))
-- goto drop_and_release;
--
-- if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
-- goto drop_and_free;
--
-- tcp_rsk(req)->snt_isn = isn;
-- tcp_rsk(req)->snt_synack = tcp_time_stamp;
-- tcp_openreq_init_rwin(req, sk, dst);
-- fastopen = !want_cookie &&
-- tcp_try_fastopen(sk, skb, req, &foc, dst);
-- err = tcp_v6_send_synack(sk, dst, &fl6, req,
-- skb_get_queue_mapping(skb), &foc);
-- if (!fastopen) {
-- if (err || want_cookie)
-- goto drop_and_free;
--
-- tcp_rsk(req)->listener = NULL;
-- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-- }
-- return 0;
--
--drop_and_release:
-- dst_release(dst);
--drop_and_free:
-- reqsk_free(req);
- drop:
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
- return 0; /* don't send reset */
- }
-
--static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-- struct request_sock *req,
-- struct dst_entry *dst)
-+struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req,
-+ struct dst_entry *dst)
- {
- struct inet_request_sock *ireq;
- struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
-@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-
- newsk->sk_v6_rcv_saddr = newnp->saddr;
-
-- inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(newsk))
-+ inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
-+ else
-+#endif
-+ inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
- newsk->sk_backlog_rcv = tcp_v4_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
-@@ -1329,7 +1292,7 @@ out:
- * This is because we cannot sleep with the original spinlock
- * held.
- */
--static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
-+int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
- {
- struct ipv6_pinfo *np = inet6_sk(sk);
- struct tcp_sock *tp;
-@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
- goto discard;
- #endif
-
-+ if (is_meta_sk(sk))
-+ return mptcp_v6_do_rcv(sk, skb);
-+
- if (sk_filter(sk, skb))
- goto discard;
-
-@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
- {
- const struct tcphdr *th;
- const struct ipv6hdr *hdr;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk = NULL;
- int ret;
- struct net *net = dev_net(skb->dev);
-
-@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
- TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
- skb->len - th->doff*4);
- TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-+#ifdef CONFIG_MPTCP
-+ TCP_SKB_CB(skb)->mptcp_flags = 0;
-+ TCP_SKB_CB(skb)->dss_off = 0;
-+#endif
- TCP_SKB_CB(skb)->when = 0;
- TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
- TCP_SKB_CB(skb)->sacked = 0;
-
- sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
-- if (!sk)
-- goto no_tcp_socket;
-
- process:
-- if (sk->sk_state == TCP_TIME_WAIT)
-+ if (sk && sk->sk_state == TCP_TIME_WAIT)
- goto do_time_wait;
-
-+#ifdef CONFIG_MPTCP
-+ if (!sk && th->syn && !th->ack) {
-+ int ret = mptcp_lookup_join(skb, NULL);
-+
-+ if (ret < 0) {
-+ tcp_v6_send_reset(NULL, skb);
-+ goto discard_it;
-+ } else if (ret > 0) {
-+ return 0;
-+ }
-+ }
-+
-+ /* Is there a pending request sock for this segment ? */
-+ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
-+ if (sk)
-+ sock_put(sk);
-+ return 0;
-+ }
-+#endif
-+
-+ if (!sk)
-+ goto no_tcp_socket;
-+
- if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
- NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- goto discard_and_relse;
-@@ -1529,11 +1520,21 @@ process:
- sk_mark_napi_id(sk, skb);
- skb->dev = NULL;
-
-- bh_lock_sock_nested(sk);
-+ if (mptcp(tcp_sk(sk))) {
-+ meta_sk = mptcp_meta_sk(sk);
-+
-+ bh_lock_sock_nested(meta_sk);
-+ if (sock_owned_by_user(meta_sk))
-+ skb->sk = sk;
-+ } else {
-+ meta_sk = sk;
-+ bh_lock_sock_nested(sk);
-+ }
-+
- ret = 0;
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- #ifdef CONFIG_NET_DMA
-- struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_sock *tp = tcp_sk(meta_sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
- if (tp->ucopy.dma_chan)
-@@ -1541,16 +1542,17 @@ process:
- else
- #endif
- {
-- if (!tcp_prequeue(sk, skb))
-+ if (!tcp_prequeue(meta_sk, skb))
- ret = tcp_v6_do_rcv(sk, skb);
- }
-- } else if (unlikely(sk_add_backlog(sk, skb,
-- sk->sk_rcvbuf + sk->sk_sndbuf))) {
-- bh_unlock_sock(sk);
-+ } else if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+ bh_unlock_sock(meta_sk);
- NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
- goto discard_and_relse;
- }
-- bh_unlock_sock(sk);
-+
-+ bh_unlock_sock(meta_sk);
-
- sock_put(sk);
- return ret ? -1 : 0;
-@@ -1607,6 +1609,18 @@ do_time_wait:
- sk = sk2;
- goto process;
- }
-+#ifdef CONFIG_MPTCP
-+ if (th->syn && !th->ack) {
-+ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
-+
-+ if (ret < 0) {
-+ tcp_v6_send_reset(NULL, skb);
-+ goto discard_it;
-+ } else if (ret > 0) {
-+ return 0;
-+ }
-+ }
-+#endif
- /* Fall through to ACK */
- }
- case TCP_TW_ACK:
-@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
- }
- }
-
--static struct timewait_sock_ops tcp6_timewait_sock_ops = {
-+struct timewait_sock_ops tcp6_timewait_sock_ops = {
- .twsk_obj_size = sizeof(struct tcp6_timewait_sock),
- .twsk_unique = tcp_twsk_unique,
- .twsk_destructor = tcp_twsk_destructor,
-@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
-
- tcp_init_sock(sk);
-
-- icsk->icsk_af_ops = &ipv6_specific;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v6_specific;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv6_specific;
-
- #ifdef CONFIG_TCP_MD5SIG
- tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
-@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
- return 0;
- }
-
--static void tcp_v6_destroy_sock(struct sock *sk)
-+void tcp_v6_destroy_sock(struct sock *sk)
- {
- tcp_v4_destroy_sock(sk);
- inet6_destroy_sock(sk);
-@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
- static void tcp_v6_clear_sk(struct sock *sk, int size)
- {
- struct inet_sock *inet = inet_sk(sk);
-+#ifdef CONFIG_MPTCP
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ /* size_tk_table goes from the end of tk_table to the end of sk */
-+ int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
-+ sizeof(tp->tk_table);
-+#endif
-
- /* we do not want to clear pinet6 field, because of RCU lookups */
- sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
-
- size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
-+
-+#ifdef CONFIG_MPTCP
-+ /* We zero out only from pinet6 to tk_table */
-+ size -= size_tk_table + sizeof(tp->tk_table);
-+#endif
- memset(&inet->pinet6 + 1, 0, size);
-+
-+#ifdef CONFIG_MPTCP
-+ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
-+#endif
-+
- }
-
- struct proto tcpv6_prot = {
-diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
-new file mode 100644
-index 000000000000..cdfc03adabf8
---- /dev/null
-+++ b/net/mptcp/Kconfig
-@@ -0,0 +1,115 @@
-+#
-+# MPTCP configuration
-+#
-+config MPTCP
-+ bool "MPTCP protocol"
-+ depends on (IPV6=y || IPV6=n)
-+ ---help---
-+ This replaces the normal TCP stack with a Multipath TCP stack,
-+ able to use several paths at once.
-+
-+menuconfig MPTCP_PM_ADVANCED
-+ bool "MPTCP: advanced path-manager control"
-+ depends on MPTCP=y
-+ ---help---
-+ Support for selection of different path-managers. You should choose 'Y' here,
-+ because otherwise you will not actively create new MPTCP-subflows.
-+
-+if MPTCP_PM_ADVANCED
-+
-+config MPTCP_FULLMESH
-+ tristate "MPTCP Full-Mesh Path-Manager"
-+ depends on MPTCP=y
-+ ---help---
-+ This path-management module will create a full-mesh among all IP-addresses.
-+
-+config MPTCP_NDIFFPORTS
-+ tristate "MPTCP ndiff-ports"
-+ depends on MPTCP=y
-+ ---help---
-+ This path-management module will create multiple subflows between the same
-+ pair of IP-addresses, modifying the source-port. You can set the number
-+ of subflows via the mptcp_ndiffports-sysctl.
-+
-+config MPTCP_BINDER
-+ tristate "MPTCP Binder"
-+ depends on (MPTCP=y)
-+ ---help---
-+ This path-management module works like ndiffports, and adds the sysctl
-+ option to set the gateway (and/or path to) per each additional subflow
-+ via Loose Source Routing (IPv4 only).
-+
-+choice
-+ prompt "Default MPTCP Path-Manager"
-+ default DEFAULT
-+ help
-+ Select the Path-Manager of your choice
-+
-+ config DEFAULT_FULLMESH
-+ bool "Full mesh" if MPTCP_FULLMESH=y
-+
-+ config DEFAULT_NDIFFPORTS
-+ bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
-+
-+ config DEFAULT_BINDER
-+ bool "binder" if MPTCP_BINDER=y
-+
-+ config DEFAULT_DUMMY
-+ bool "Default"
-+
-+endchoice
-+
-+endif
-+
-+config DEFAULT_MPTCP_PM
-+ string
-+ default "default" if DEFAULT_DUMMY
-+ default "fullmesh" if DEFAULT_FULLMESH
-+ default "ndiffports" if DEFAULT_NDIFFPORTS
-+ default "binder" if DEFAULT_BINDER
-+ default "default"
-+
-+menuconfig MPTCP_SCHED_ADVANCED
-+ bool "MPTCP: advanced scheduler control"
-+ depends on MPTCP=y
-+ ---help---
-+ Support for selection of different schedulers. You should choose 'Y' here,
-+ if you want to choose a different scheduler than the default one.
-+
-+if MPTCP_SCHED_ADVANCED
-+
-+config MPTCP_ROUNDROBIN
-+ tristate "MPTCP Round-Robin"
-+ depends on (MPTCP=y)
-+ ---help---
-+ This is a very simple round-robin scheduler. Probably has bad performance
-+ but might be interesting for researchers.
-+
-+choice
-+ prompt "Default MPTCP Scheduler"
-+ default DEFAULT
-+ help
-+ Select the Scheduler of your choice
-+
-+ config DEFAULT_SCHEDULER
-+ bool "Default"
-+ ---help---
-+ This is the default scheduler, sending first on the subflow
-+ with the lowest RTT.
-+
-+ config DEFAULT_ROUNDROBIN
-+ bool "Round-Robin" if MPTCP_ROUNDROBIN=y
-+ ---help---
-+ This is the round-rob scheduler, sending in a round-robin
-+ fashion..
-+
-+endchoice
-+endif
-+
-+config DEFAULT_MPTCP_SCHED
-+ string
-+ depends on (MPTCP=y)
-+ default "default" if DEFAULT_SCHEDULER
-+ default "roundrobin" if DEFAULT_ROUNDROBIN
-+ default "default"
-+
-diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
-new file mode 100644
-index 000000000000..35561a7012e3
---- /dev/null
-+++ b/net/mptcp/Makefile
-@@ -0,0 +1,20 @@
-+#
-+## Makefile for MultiPath TCP support code.
-+#
-+#
-+
-+obj-$(CONFIG_MPTCP) += mptcp.o
-+
-+mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
-+ mptcp_output.o mptcp_input.o mptcp_sched.o
-+
-+obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
-+obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
-+obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
-+obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
-+obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
-+obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
-+obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
-+
-+mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
-+
-diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
-new file mode 100644
-index 000000000000..95d8da560715
---- /dev/null
-+++ b/net/mptcp/mptcp_binder.c
-@@ -0,0 +1,487 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#include <linux/route.h>
-+#include <linux/inet.h>
-+#include <linux/mroute.h>
-+#include <linux/spinlock_types.h>
-+#include <net/inet_ecn.h>
-+#include <net/route.h>
-+#include <net/xfrm.h>
-+#include <net/compat.h>
-+#include <linux/slab.h>
-+
-+#define MPTCP_GW_MAX_LISTS 10
-+#define MPTCP_GW_LIST_MAX_LEN 6
-+#define MPTCP_GW_SYSCTL_MAX_LEN (15 * MPTCP_GW_LIST_MAX_LEN * \
-+ MPTCP_GW_MAX_LISTS)
-+
-+struct mptcp_gw_list {
-+ struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
-+ u8 len[MPTCP_GW_MAX_LISTS];
-+};
-+
-+struct binder_priv {
-+ /* Worker struct for subflow establishment */
-+ struct work_struct subflow_work;
-+
-+ struct mptcp_cb *mpcb;
-+
-+ /* Prevent multiple sub-sockets concurrently iterating over sockets */
-+ spinlock_t *flow_lock;
-+};
-+
-+static struct mptcp_gw_list *mptcp_gws;
-+static rwlock_t mptcp_gws_lock;
-+
-+static int mptcp_binder_ndiffports __read_mostly = 1;
-+
-+static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
-+
-+static int mptcp_get_avail_list_ipv4(struct sock *sk)
-+{
-+ int i, j, list_taken, opt_ret, opt_len;
-+ unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
-+
-+ for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
-+ if (mptcp_gws->len[i] == 0)
-+ goto error;
-+
-+ mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
-+ list_taken = 0;
-+
-+ /* Loop through all sub-sockets in this connection */
-+ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
-+ mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
-+
-+ /* Reset length and options buffer, then retrieve
-+ * from socket
-+ */
-+ opt_len = MAX_IPOPTLEN;
-+ memset(opt, 0, MAX_IPOPTLEN);
-+ opt_ret = ip_getsockopt(sk, IPPROTO_IP,
-+ IP_OPTIONS, opt, &opt_len);
-+ if (opt_ret < 0) {
-+ mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
-+ __func__, opt_ret);
-+ goto error;
-+ }
-+
-+ /* If socket has no options, it has no stake in this list */
-+ if (opt_len <= 0)
-+ continue;
-+
-+ /* Iterate options buffer */
-+ for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
-+ if (*opt_ptr == IPOPT_LSRR) {
-+ mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
-+ goto sock_lsrr;
-+ }
-+ }
-+ continue;
-+
-+sock_lsrr:
-+ /* Pointer to the 2nd to last address */
-+ opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
-+
-+ /* Addresses start 3 bytes after type offset */
-+ opt_ptr += 3;
-+ j = 0;
-+
-+ /* Different length lists cannot be the same */
-+ if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
-+ continue;
-+
-+ /* Iterate if we are still inside options list
-+ * and sysctl list
-+ */
-+ while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
-+ /* If there is a different address, this list must
-+ * not be set on this socket
-+ */
-+ if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
-+ break;
-+
-+ /* Jump 4 bytes to next address */
-+ opt_ptr += 4;
-+ j++;
-+ }
-+
-+ /* Reached the end without a differing address, lists
-+ * are therefore identical.
-+ */
-+ if (j == mptcp_gws->len[i]) {
-+ mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
-+ list_taken = 1;
-+ break;
-+ }
-+ }
-+
-+ /* Free list found if not taken by a socket */
-+ if (!list_taken) {
-+ mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
-+ break;
-+ }
-+ }
-+
-+ if (i >= MPTCP_GW_MAX_LISTS)
-+ goto error;
-+
-+ return i;
-+error:
-+ return -1;
-+}
-+
-+/* The list of addresses is parsed each time a new connection is opened,
-+ * to make sure it's up to date. In case of error, all the lists are
-+ * marked as unavailable and the subflow's fingerprint is set to 0.
-+ */
-+static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
-+{
-+ int i, j, ret;
-+ unsigned char opt[MAX_IPOPTLEN] = {0};
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
-+
-+ /* Read lock: multiple sockets can read LSRR addresses at the same
-+ * time, but writes are done in mutual exclusion.
-+ * Spin lock: must search for free list for one socket at a time, or
-+ * multiple sockets could take the same list.
-+ */
-+ read_lock(&mptcp_gws_lock);
-+ spin_lock(fmp->flow_lock);
-+
-+ i = mptcp_get_avail_list_ipv4(sk);
-+
-+ /* Execution enters here only if a free path is found.
-+ */
-+ if (i >= 0) {
-+ opt[0] = IPOPT_NOP;
-+ opt[1] = IPOPT_LSRR;
-+ opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
-+ (mptcp_gws->len[i] + 1) + 3;
-+ opt[3] = IPOPT_MINOFF;
-+ for (j = 0; j < mptcp_gws->len[i]; ++j)
-+ memcpy(opt + 4 +
-+ (j * sizeof(mptcp_gws->list[i][0].s_addr)),
-+ &mptcp_gws->list[i][j].s_addr,
-+ sizeof(mptcp_gws->list[i][0].s_addr));
-+ /* Final destination must be part of IP_OPTIONS parameter. */
-+ memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
-+ sizeof(addr.s_addr));
-+
-+ /* setsockopt must be inside the lock, otherwise another
-+ * subflow could fail to see that we have taken a list.
-+ */
-+ ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
-+ 4 + sizeof(mptcp_gws->list[i][0].s_addr)
-+ * (mptcp_gws->len[i] + 1));
-+
-+ if (ret < 0) {
-+ mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
-+ __func__, ret);
-+ }
-+ }
-+
-+ spin_unlock(fmp->flow_lock);
-+ read_unlock(&mptcp_gws_lock);
-+
-+ return;
-+}
-+
-+/* Parses gateways string for a list of paths to different
-+ * gateways, and stores them for use with the Loose Source Routing (LSRR)
-+ * socket option. Each list must have "," separated addresses, and the lists
-+ * themselves must be separated by "-". Returns -1 in case one or more of the
-+ * addresses is not a valid ipv4/6 address.
-+ */
-+static int mptcp_parse_gateway_ipv4(char *gateways)
-+{
-+ int i, j, k, ret;
-+ char *tmp_string = NULL;
-+ struct in_addr tmp_addr;
-+
-+ tmp_string = kzalloc(16, GFP_KERNEL);
-+ if (tmp_string == NULL)
-+ return -ENOMEM;
-+
-+ write_lock(&mptcp_gws_lock);
-+
-+ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
-+
-+ /* A TMP string is used since inet_pton needs a null terminated string
-+ * but we do not want to modify the sysctl for obvious reasons.
-+ * i will iterate over the SYSCTL string, j will iterate over the
-+ * temporary string where each IP is copied into, k will iterate over
-+ * the IPs in each list.
-+ */
-+ for (i = j = k = 0;
-+ i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
-+ ++i) {
-+ if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
-+ /* If the temp IP is empty and the current list is
-+ * empty, we are done.
-+ */
-+ if (j == 0 && mptcp_gws->len[k] == 0)
-+ break;
-+
-+ /* Terminate the temp IP string, then if it is
-+ * non-empty parse the IP and copy it.
-+ */
-+ tmp_string[j] = '\0';
-+ if (j > 0) {
-+ mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
-+
-+ ret = in4_pton(tmp_string, strlen(tmp_string),
-+ (u8 *)&tmp_addr.s_addr, '\0',
-+ NULL);
-+
-+ if (ret) {
-+ mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
-+ ret,
-+ &tmp_addr.s_addr);
-+ memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
-+ &tmp_addr.s_addr,
-+ sizeof(tmp_addr.s_addr));
-+ mptcp_gws->len[k]++;
-+ j = 0;
-+ tmp_string[j] = '\0';
-+ /* Since we can't impose a limit to
-+ * what the user can input, make sure
-+ * there are not too many IPs in the
-+ * SYSCTL string.
-+ */
-+ if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
-+ mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
-+ k,
-+ MPTCP_GW_LIST_MAX_LEN);
-+ goto error;
-+ }
-+ } else {
-+ goto error;
-+ }
-+ }
-+
-+ if (gateways[i] == '-' || gateways[i] == '\0')
-+ ++k;
-+ } else {
-+ tmp_string[j] = gateways[i];
-+ ++j;
-+ }
-+ }
-+
-+ /* Number of flows is number of gateway lists plus master flow */
-+ mptcp_binder_ndiffports = k+1;
-+
-+ write_unlock(&mptcp_gws_lock);
-+ kfree(tmp_string);
-+
-+ return 0;
-+
-+error:
-+ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
-+ memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
-+ write_unlock(&mptcp_gws_lock);
-+ kfree(tmp_string);
-+ return -1;
-+}
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+ const struct binder_priv *pm_priv = container_of(work,
-+ struct binder_priv,
-+ subflow_work);
-+ struct mptcp_cb *mpcb = pm_priv->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ int iter = 0;
-+
-+next_subflow:
-+ if (iter) {
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+
-+ cond_resched();
-+ }
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ iter++;
-+
-+ if (sock_flag(meta_sk, SOCK_DEAD))
-+ goto exit;
-+
-+ if (mpcb->master_sk &&
-+ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+ goto exit;
-+
-+ if (mptcp_binder_ndiffports > iter &&
-+ mptcp_binder_ndiffports > mpcb->cnt_subflows) {
-+ struct mptcp_loc4 loc;
-+ struct mptcp_rem4 rem;
-+
-+ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
-+ loc.loc4_id = 0;
-+ loc.low_prio = 0;
-+
-+ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
-+ rem.port = inet_sk(meta_sk)->inet_dport;
-+ rem.rem4_id = 0; /* Default 0 */
-+
-+ mptcp_init4_subsockets(meta_sk, &loc, &rem);
-+
-+ goto next_subflow;
-+ }
-+
-+exit:
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk);
-+}
-+
-+static void binder_new_session(const struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
-+ static DEFINE_SPINLOCK(flow_lock);
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ if (meta_sk->sk_family == AF_INET6 &&
-+ !mptcp_v6_is_v4_mapped(meta_sk)) {
-+ mptcp_fallback_default(mpcb);
-+ return;
-+ }
-+#endif
-+
-+ /* Initialize workqueue-struct */
-+ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+ fmp->mpcb = mpcb;
-+
-+ fmp->flow_lock = &flow_lock;
-+}
-+
-+static void binder_create_subflows(struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
-+
-+ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+ mpcb->send_infinite_mapping ||
-+ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+ return;
-+
-+ if (!work_pending(&pm_priv->subflow_work)) {
-+ sock_hold(meta_sk);
-+ queue_work(mptcp_wq, &pm_priv->subflow_work);
-+ }
-+}
-+
-+static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio)
-+{
-+ return 0;
-+}
-+
-+/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
-+ * Inspired from proc_tcp_congestion_control().
-+ */
-+static int proc_mptcp_gateways(ctl_table *ctl, int write,
-+ void __user *buffer, size_t *lenp,
-+ loff_t *ppos)
-+{
-+ int ret;
-+ ctl_table tbl = {
-+ .maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
-+ };
-+
-+ if (write) {
-+ tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
-+ if (tbl.data == NULL)
-+ return -1;
-+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+ if (ret == 0) {
-+ ret = mptcp_parse_gateway_ipv4(tbl.data);
-+ memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
-+ }
-+ kfree(tbl.data);
-+ } else {
-+ ret = proc_dostring(ctl, write, buffer, lenp, ppos);
-+ }
-+
-+
-+ return ret;
-+}
-+
-+static struct mptcp_pm_ops binder __read_mostly = {
-+ .new_session = binder_new_session,
-+ .fully_established = binder_create_subflows,
-+ .get_local_id = binder_get_local_id,
-+ .init_subsocket_v4 = mptcp_v4_add_lsrr,
-+ .name = "binder",
-+ .owner = THIS_MODULE,
-+};
-+
-+static struct ctl_table binder_table[] = {
-+ {
-+ .procname = "mptcp_binder_gateways",
-+ .data = &sysctl_mptcp_binder_gateways,
-+ .maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
-+ .mode = 0644,
-+ .proc_handler = &proc_mptcp_gateways
-+ },
-+ { }
-+};
-+
-+struct ctl_table_header *mptcp_sysctl_binder;
-+
-+/* General initialization of MPTCP_PM */
-+static int __init binder_register(void)
-+{
-+ mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
-+ if (!mptcp_gws)
-+ return -ENOMEM;
-+
-+ rwlock_init(&mptcp_gws_lock);
-+
-+ BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
-+
-+ mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
-+ binder_table);
-+ if (!mptcp_sysctl_binder)
-+ goto sysctl_fail;
-+
-+ if (mptcp_register_path_manager(&binder))
-+ goto pm_failed;
-+
-+ return 0;
-+
-+pm_failed:
-+ unregister_net_sysctl_table(mptcp_sysctl_binder);
-+sysctl_fail:
-+ kfree(mptcp_gws);
-+
-+ return -1;
-+}
-+
-+static void binder_unregister(void)
-+{
-+ mptcp_unregister_path_manager(&binder);
-+ unregister_net_sysctl_table(mptcp_sysctl_binder);
-+ kfree(mptcp_gws);
-+}
-+
-+module_init(binder_register);
-+module_exit(binder_unregister);
-+
-+MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("BINDER MPTCP");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
-new file mode 100644
-index 000000000000..5d761164eb85
---- /dev/null
-+++ b/net/mptcp/mptcp_coupled.c
-@@ -0,0 +1,270 @@
-+/*
-+ * MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+#include <linux/module.h>
-+
-+/* Scaling is done in the numerator with alpha_scale_num and in the denominator
-+ * with alpha_scale_den.
-+ *
-+ * To downscale, we just need to use alpha_scale.
-+ *
-+ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
-+ */
-+static int alpha_scale_den = 10;
-+static int alpha_scale_num = 32;
-+static int alpha_scale = 12;
-+
-+struct mptcp_ccc {
-+ u64 alpha;
-+ bool forced_update;
-+};
-+
-+static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
-+{
-+ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
-+}
-+
-+static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
-+{
-+ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
-+}
-+
-+static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
-+{
-+ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
-+}
-+
-+static inline u64 mptcp_ccc_scale(u32 val, int scale)
-+{
-+ return (u64) val << scale;
-+}
-+
-+static inline bool mptcp_get_forced(const struct sock *meta_sk)
-+{
-+ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
-+}
-+
-+static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
-+{
-+ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
-+}
-+
-+static void mptcp_ccc_recalc_alpha(const struct sock *sk)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+ const struct sock *sub_sk;
-+ int best_cwnd = 0, best_rtt = 0, can_send = 0;
-+ u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
-+
-+ if (!mpcb)
-+ return;
-+
-+ /* Only one subflow left - fall back to normal reno-behavior
-+ * (set alpha to 1)
-+ */
-+ if (mpcb->cnt_established <= 1)
-+ goto exit;
-+
-+ /* Do regular alpha-calculation for multiple subflows */
-+
-+ /* Find the max numerator of the alpha-calculation */
-+ mptcp_for_each_sk(mpcb, sub_sk) {
-+ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+ u64 tmp;
-+
-+ if (!mptcp_ccc_sk_can_send(sub_sk))
-+ continue;
-+
-+ can_send++;
-+
-+ /* We need to look for the path, that provides the max-value.
-+ * Integer-overflow is not possible here, because
-+ * tmp will be in u64.
-+ */
-+ tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
-+ alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
-+
-+ if (tmp >= max_numerator) {
-+ max_numerator = tmp;
-+ best_cwnd = sub_tp->snd_cwnd;
-+ best_rtt = sub_tp->srtt_us;
-+ }
-+ }
-+
-+ /* No subflow is able to send - we don't care anymore */
-+ if (unlikely(!can_send))
-+ goto exit;
-+
-+ /* Calculate the denominator */
-+ mptcp_for_each_sk(mpcb, sub_sk) {
-+ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+
-+ if (!mptcp_ccc_sk_can_send(sub_sk))
-+ continue;
-+
-+ sum_denominator += div_u64(
-+ mptcp_ccc_scale(sub_tp->snd_cwnd,
-+ alpha_scale_den) * best_rtt,
-+ sub_tp->srtt_us);
-+ }
-+ sum_denominator *= sum_denominator;
-+ if (unlikely(!sum_denominator)) {
-+ pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
-+ __func__, mpcb->cnt_established);
-+ mptcp_for_each_sk(mpcb, sub_sk) {
-+ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+ pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
-+ __func__, sub_tp->mptcp->path_index,
-+ sub_sk->sk_state, sub_tp->srtt_us,
-+ sub_tp->snd_cwnd);
-+ }
-+ }
-+
-+ alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
-+
-+ if (unlikely(!alpha))
-+ alpha = 1;
-+
-+exit:
-+ mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
-+}
-+
-+static void mptcp_ccc_init(struct sock *sk)
-+{
-+ if (mptcp(tcp_sk(sk))) {
-+ mptcp_set_forced(mptcp_meta_sk(sk), 0);
-+ mptcp_set_alpha(mptcp_meta_sk(sk), 1);
-+ }
-+ /* If we do not mptcp, behave like reno: return */
-+}
-+
-+static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
-+{
-+ if (event == CA_EVENT_LOSS)
-+ mptcp_ccc_recalc_alpha(sk);
-+}
-+
-+static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
-+{
-+ if (!mptcp(tcp_sk(sk)))
-+ return;
-+
-+ mptcp_set_forced(mptcp_meta_sk(sk), 1);
-+}
-+
-+static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ const struct mptcp_cb *mpcb = tp->mpcb;
-+ int snd_cwnd;
-+
-+ if (!mptcp(tp)) {
-+ tcp_reno_cong_avoid(sk, ack, acked);
-+ return;
-+ }
-+
-+ if (!tcp_is_cwnd_limited(sk))
-+ return;
-+
-+ if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+ /* In "safe" area, increase. */
-+ tcp_slow_start(tp, acked);
-+ mptcp_ccc_recalc_alpha(sk);
-+ return;
-+ }
-+
-+ if (mptcp_get_forced(mptcp_meta_sk(sk))) {
-+ mptcp_ccc_recalc_alpha(sk);
-+ mptcp_set_forced(mptcp_meta_sk(sk), 0);
-+ }
-+
-+ if (mpcb->cnt_established > 1) {
-+ u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
-+
-+ /* This may happen, if at the initialization, the mpcb
-+ * was not yet attached to the sock, and thus
-+ * initializing alpha failed.
-+ */
-+ if (unlikely(!alpha))
-+ alpha = 1;
-+
-+ snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
-+ alpha);
-+
-+ /* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
-+ * Thus, we select here the max value.
-+ */
-+ if (snd_cwnd < tp->snd_cwnd)
-+ snd_cwnd = tp->snd_cwnd;
-+ } else {
-+ snd_cwnd = tp->snd_cwnd;
-+ }
-+
-+ if (tp->snd_cwnd_cnt >= snd_cwnd) {
-+ if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
-+ tp->snd_cwnd++;
-+ mptcp_ccc_recalc_alpha(sk);
-+ }
-+
-+ tp->snd_cwnd_cnt = 0;
-+ } else {
-+ tp->snd_cwnd_cnt++;
-+ }
-+}
-+
-+static struct tcp_congestion_ops mptcp_ccc = {
-+ .init = mptcp_ccc_init,
-+ .ssthresh = tcp_reno_ssthresh,
-+ .cong_avoid = mptcp_ccc_cong_avoid,
-+ .cwnd_event = mptcp_ccc_cwnd_event,
-+ .set_state = mptcp_ccc_set_state,
-+ .owner = THIS_MODULE,
-+ .name = "lia",
-+};
-+
-+static int __init mptcp_ccc_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
-+ return tcp_register_congestion_control(&mptcp_ccc);
-+}
-+
-+static void __exit mptcp_ccc_unregister(void)
-+{
-+ tcp_unregister_congestion_control(&mptcp_ccc);
-+}
-+
-+module_init(mptcp_ccc_register);
-+module_exit(mptcp_ccc_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
-new file mode 100644
-index 000000000000..28dfa0479f5e
---- /dev/null
-+++ b/net/mptcp/mptcp_ctrl.c
-@@ -0,0 +1,2401 @@
-+/*
-+ * MPTCP implementation - MPTCP-control
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <net/inet_common.h>
-+#include <net/inet6_hashtables.h>
-+#include <net/ipv6.h>
-+#include <net/ip6_checksum.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/ip6_route.h>
-+#include <net/mptcp_v6.h>
-+#endif
-+#include <net/sock.h>
-+#include <net/tcp.h>
-+#include <net/tcp_states.h>
-+#include <net/transp_v6.h>
-+#include <net/xfrm.h>
-+
-+#include <linux/cryptohash.h>
-+#include <linux/kconfig.h>
-+#include <linux/module.h>
-+#include <linux/netpoll.h>
-+#include <linux/list.h>
-+#include <linux/jhash.h>
-+#include <linux/tcp.h>
-+#include <linux/net.h>
-+#include <linux/in.h>
-+#include <linux/random.h>
-+#include <linux/inetdevice.h>
-+#include <linux/workqueue.h>
-+#include <linux/atomic.h>
-+#include <linux/sysctl.h>
-+
-+static struct kmem_cache *mptcp_sock_cache __read_mostly;
-+static struct kmem_cache *mptcp_cb_cache __read_mostly;
-+static struct kmem_cache *mptcp_tw_cache __read_mostly;
-+
-+int sysctl_mptcp_enabled __read_mostly = 1;
-+int sysctl_mptcp_checksum __read_mostly = 1;
-+int sysctl_mptcp_debug __read_mostly;
-+EXPORT_SYMBOL(sysctl_mptcp_debug);
-+int sysctl_mptcp_syn_retries __read_mostly = 3;
-+
-+bool mptcp_init_failed __read_mostly;
-+
-+struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
-+EXPORT_SYMBOL(mptcp_static_key);
-+
-+static int proc_mptcp_path_manager(ctl_table *ctl, int write,
-+ void __user *buffer, size_t *lenp,
-+ loff_t *ppos)
-+{
-+ char val[MPTCP_PM_NAME_MAX];
-+ ctl_table tbl = {
-+ .data = val,
-+ .maxlen = MPTCP_PM_NAME_MAX,
-+ };
-+ int ret;
-+
-+ mptcp_get_default_path_manager(val);
-+
-+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+ if (write && ret == 0)
-+ ret = mptcp_set_default_path_manager(val);
-+ return ret;
-+}
-+
-+static int proc_mptcp_scheduler(ctl_table *ctl, int write,
-+ void __user *buffer, size_t *lenp,
-+ loff_t *ppos)
-+{
-+ char val[MPTCP_SCHED_NAME_MAX];
-+ ctl_table tbl = {
-+ .data = val,
-+ .maxlen = MPTCP_SCHED_NAME_MAX,
-+ };
-+ int ret;
-+
-+ mptcp_get_default_scheduler(val);
-+
-+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+ if (write && ret == 0)
-+ ret = mptcp_set_default_scheduler(val);
-+ return ret;
-+}
-+
-+static struct ctl_table mptcp_table[] = {
-+ {
-+ .procname = "mptcp_enabled",
-+ .data = &sysctl_mptcp_enabled,
-+ .maxlen = sizeof(int),
-+ .mode = 0644,
-+ .proc_handler = &proc_dointvec
-+ },
-+ {
-+ .procname = "mptcp_checksum",
-+ .data = &sysctl_mptcp_checksum,
-+ .maxlen = sizeof(int),
-+ .mode = 0644,
-+ .proc_handler = &proc_dointvec
-+ },
-+ {
-+ .procname = "mptcp_debug",
-+ .data = &sysctl_mptcp_debug,
-+ .maxlen = sizeof(int),
-+ .mode = 0644,
-+ .proc_handler = &proc_dointvec
-+ },
-+ {
-+ .procname = "mptcp_syn_retries",
-+ .data = &sysctl_mptcp_syn_retries,
-+ .maxlen = sizeof(int),
-+ .mode = 0644,
-+ .proc_handler = &proc_dointvec
-+ },
-+ {
-+ .procname = "mptcp_path_manager",
-+ .mode = 0644,
-+ .maxlen = MPTCP_PM_NAME_MAX,
-+ .proc_handler = proc_mptcp_path_manager,
-+ },
-+ {
-+ .procname = "mptcp_scheduler",
-+ .mode = 0644,
-+ .maxlen = MPTCP_SCHED_NAME_MAX,
-+ .proc_handler = proc_mptcp_scheduler,
-+ },
-+ { }
-+};
-+
-+static inline u32 mptcp_hash_tk(u32 token)
-+{
-+ return token % MPTCP_HASH_SIZE;
-+}
-+
-+struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
-+EXPORT_SYMBOL(tk_hashtable);
-+
-+/* This second hashtable is needed to retrieve request socks
-+ * created as a result of a join request. While the SYN contains
-+ * the token, the final ack does not, so we need a separate hashtable
-+ * to retrieve the mpcb.
-+ */
-+struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
-+spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
-+
-+/* The following hash table is used to avoid collision of token */
-+static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
-+spinlock_t mptcp_tk_hashlock; /* hashtable protection */
-+
-+static bool mptcp_reqsk_find_tk(const u32 token)
-+{
-+ const u32 hash = mptcp_hash_tk(token);
-+ const struct mptcp_request_sock *mtreqsk;
-+ const struct hlist_nulls_node *node;
-+
-+begin:
-+ hlist_nulls_for_each_entry_rcu(mtreqsk, node,
-+ &mptcp_reqsk_tk_htb[hash], hash_entry) {
-+ if (token == mtreqsk->mptcp_loc_token)
-+ return true;
-+ }
-+ /* A request-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash)
-+ goto begin;
-+ return false;
-+}
-+
-+static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
-+{
-+ u32 hash = mptcp_hash_tk(token);
-+
-+ hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
-+ &mptcp_reqsk_tk_htb[hash]);
-+}
-+
-+static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
-+{
-+ rcu_read_lock();
-+ spin_lock(&mptcp_tk_hashlock);
-+ hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock();
-+}
-+
-+void mptcp_reqsk_destructor(struct request_sock *req)
-+{
-+ if (!mptcp_rsk(req)->is_sub) {
-+ if (in_softirq()) {
-+ mptcp_reqsk_remove_tk(req);
-+ } else {
-+ rcu_read_lock_bh();
-+ spin_lock(&mptcp_tk_hashlock);
-+ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock_bh();
-+ }
-+ } else {
-+ mptcp_hash_request_remove(req);
-+ }
-+}
-+
-+static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
-+{
-+ u32 hash = mptcp_hash_tk(token);
-+ hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
-+ meta_tp->inside_tk_table = 1;
-+}
-+
-+static bool mptcp_find_token(u32 token)
-+{
-+ const u32 hash = mptcp_hash_tk(token);
-+ const struct tcp_sock *meta_tp;
-+ const struct hlist_nulls_node *node;
-+
-+begin:
-+ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
-+ if (token == meta_tp->mptcp_loc_token)
-+ return true;
-+ }
-+ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash)
-+ goto begin;
-+ return false;
-+}
-+
-+static void mptcp_set_key_reqsk(struct request_sock *req,
-+ const struct sk_buff *skb)
-+{
-+ const struct inet_request_sock *ireq = inet_rsk(req);
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+
-+ if (skb->protocol == htons(ETH_P_IP)) {
-+ mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
-+ ip_hdr(skb)->daddr,
-+ htons(ireq->ir_num),
-+ ireq->ir_rmt_port);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else {
-+ mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
-+ ipv6_hdr(skb)->daddr.s6_addr32,
-+ htons(ireq->ir_num),
-+ ireq->ir_rmt_port);
-+#endif
-+ }
-+
-+ mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
-+}
-+
-+/* New MPTCP-connection request, prepare a new token for the meta-socket that
-+ * will be created in mptcp_check_req_master(), and store the received token.
-+ */
-+void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+ const struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb)
-+{
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+
-+ inet_rsk(req)->saw_mpc = 1;
-+
-+ rcu_read_lock();
-+ spin_lock(&mptcp_tk_hashlock);
-+ do {
-+ mptcp_set_key_reqsk(req, skb);
-+ } while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
-+ mptcp_find_token(mtreq->mptcp_loc_token));
-+
-+ mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock();
-+ mtreq->mptcp_rem_key = mopt->mptcp_key;
-+}
-+
-+static void mptcp_set_key_sk(const struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ const struct inet_sock *isk = inet_sk(sk);
-+
-+ if (sk->sk_family == AF_INET)
-+ tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
-+ isk->inet_daddr,
-+ isk->inet_sport,
-+ isk->inet_dport);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
-+ sk->sk_v6_daddr.s6_addr32,
-+ isk->inet_sport,
-+ isk->inet_dport);
-+#endif
-+
-+ mptcp_key_sha1(tp->mptcp_loc_key,
-+ &tp->mptcp_loc_token, NULL);
-+}
-+
-+void mptcp_connect_init(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ rcu_read_lock_bh();
-+ spin_lock(&mptcp_tk_hashlock);
-+ do {
-+ mptcp_set_key_sk(sk);
-+ } while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
-+ mptcp_find_token(tp->mptcp_loc_token));
-+
-+ __mptcp_hash_insert(tp, tp->mptcp_loc_token);
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock_bh();
-+}
-+
-+/**
-+ * This function increments the refcount of the mpcb struct.
-+ * It is the responsibility of the caller to decrement when releasing
-+ * the structure.
-+ */
-+struct sock *mptcp_hash_find(const struct net *net, const u32 token)
-+{
-+ const u32 hash = mptcp_hash_tk(token);
-+ const struct tcp_sock *meta_tp;
-+ struct sock *meta_sk = NULL;
-+ const struct hlist_nulls_node *node;
-+
-+ rcu_read_lock();
-+begin:
-+ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
-+ tk_table) {
-+ meta_sk = (struct sock *)meta_tp;
-+ if (token == meta_tp->mptcp_loc_token &&
-+ net_eq(net, sock_net(meta_sk))) {
-+ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+ goto out;
-+ if (unlikely(token != meta_tp->mptcp_loc_token ||
-+ !net_eq(net, sock_net(meta_sk)))) {
-+ sock_gen_put(meta_sk);
-+ goto begin;
-+ }
-+ goto found;
-+ }
-+ }
-+ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash)
-+ goto begin;
-+out:
-+ meta_sk = NULL;
-+found:
-+ rcu_read_unlock();
-+ return meta_sk;
-+}
-+
-+void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
-+{
-+ /* remove from the token hashtable */
-+ rcu_read_lock_bh();
-+ spin_lock(&mptcp_tk_hashlock);
-+ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
-+ meta_tp->inside_tk_table = 0;
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock_bh();
-+}
-+
-+void mptcp_hash_remove(struct tcp_sock *meta_tp)
-+{
-+ rcu_read_lock();
-+ spin_lock(&mptcp_tk_hashlock);
-+ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
-+ meta_tp->inside_tk_table = 0;
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock();
-+}
-+
-+struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
-+{
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sock *sk, *rttsk = NULL, *lastsk = NULL;
-+ u32 min_time = 0, last_active = 0;
-+
-+ mptcp_for_each_sk(meta_tp->mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ u32 elapsed;
-+
-+ if (!mptcp_sk_can_send_ack(sk) || tp->pf)
-+ continue;
-+
-+ elapsed = keepalive_time_elapsed(tp);
-+
-+ /* We take the one with the lowest RTT within a reasonable
-+ * (meta-RTO)-timeframe
-+ */
-+ if (elapsed < inet_csk(meta_sk)->icsk_rto) {
-+ if (!min_time || tp->srtt_us < min_time) {
-+ min_time = tp->srtt_us;
-+ rttsk = sk;
-+ }
-+ continue;
-+ }
-+
-+ /* Otherwise, we just take the most recent active */
-+ if (!rttsk && (!last_active || elapsed < last_active)) {
-+ last_active = elapsed;
-+ lastsk = sk;
-+ }
-+ }
-+
-+ if (rttsk)
-+ return rttsk;
-+
-+ return lastsk;
-+}
-+EXPORT_SYMBOL(mptcp_select_ack_sock);
-+
-+static void mptcp_sock_def_error_report(struct sock *sk)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+ if (!sock_flag(sk, SOCK_DEAD))
-+ mptcp_sub_close(sk, 0);
-+
-+ if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
-+ mpcb->send_infinite_mapping) {
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ meta_sk->sk_err = sk->sk_err;
-+ meta_sk->sk_err_soft = sk->sk_err_soft;
-+
-+ if (!sock_flag(meta_sk, SOCK_DEAD))
-+ meta_sk->sk_error_report(meta_sk);
-+
-+ tcp_done(meta_sk);
-+ }
-+
-+ sk->sk_err = 0;
-+ return;
-+}
-+
-+static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
-+{
-+ if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
-+ mptcp_cleanup_path_manager(mpcb);
-+ mptcp_cleanup_scheduler(mpcb);
-+ kmem_cache_free(mptcp_cb_cache, mpcb);
-+ }
-+}
-+
-+static void mptcp_sock_destruct(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ inet_sock_destruct(sk);
-+
-+ if (!is_meta_sk(sk) && !tp->was_meta_sk) {
-+ BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
-+
-+ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
-+ tp->mptcp = NULL;
-+
-+ /* Taken when mpcb pointer was set */
-+ sock_put(mptcp_meta_sk(sk));
-+ mptcp_mpcb_put(tp->mpcb);
-+ } else {
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct mptcp_tw *mptw;
-+
-+ /* The mpcb is disappearing - we can make the final
-+ * update to the rcv_nxt of the time-wait-sock and remove
-+ * its reference to the mpcb.
-+ */
-+ spin_lock_bh(&mpcb->tw_lock);
-+ list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
-+ list_del_rcu(&mptw->list);
-+ mptw->in_list = 0;
-+ mptcp_mpcb_put(mpcb);
-+ rcu_assign_pointer(mptw->mpcb, NULL);
-+ }
-+ spin_unlock_bh(&mpcb->tw_lock);
-+
-+ mptcp_mpcb_put(mpcb);
-+
-+ mptcp_debug("%s destroying meta-sk\n", __func__);
-+ }
-+
-+ WARN_ON(!static_key_false(&mptcp_static_key));
-+ /* Must be the last call, because is_meta_sk() above still needs the
-+ * static key
-+ */
-+ static_key_slow_dec(&mptcp_static_key);
-+}
-+
-+void mptcp_destroy_sock(struct sock *sk)
-+{
-+ if (is_meta_sk(sk)) {
-+ struct sock *sk_it, *tmpsk;
-+
-+ __skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
-+ mptcp_purge_ofo_queue(tcp_sk(sk));
-+
-+ /* We have to close all remaining subflows. Normally, they
-+ * should all be about to get closed. But, if the kernel is
-+ * forcing a closure (e.g., tcp_write_err), the subflows might
-+ * not have been closed properly (as we are waiting for the
-+ * DATA_ACK of the DATA_FIN).
-+ */
-+ mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
-+ /* Already did call tcp_close - waiting for graceful
-+ * closure, or if we are retransmitting fast-close on
-+ * the subflow. The reset (or timeout) will kill the
-+ * subflow..
-+ */
-+ if (tcp_sk(sk_it)->closing ||
-+ tcp_sk(sk_it)->send_mp_fclose)
-+ continue;
-+
-+ /* Allow the delayed work first to prevent time-wait state */
-+ if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
-+ continue;
-+
-+ mptcp_sub_close(sk_it, 0);
-+ }
-+
-+ mptcp_delete_synack_timer(sk);
-+ } else {
-+ mptcp_del_sock(sk);
-+ }
-+}
-+
-+static void mptcp_set_state(struct sock *sk)
-+{
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ /* Meta is not yet established - wake up the application */
-+ if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
-+ sk->sk_state == TCP_ESTABLISHED) {
-+ tcp_set_state(meta_sk, TCP_ESTABLISHED);
-+
-+ if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+ meta_sk->sk_state_change(meta_sk);
-+ sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
-+ }
-+ }
-+
-+ if (sk->sk_state == TCP_ESTABLISHED) {
-+ tcp_sk(sk)->mptcp->establish_increased = 1;
-+ tcp_sk(sk)->mpcb->cnt_established++;
-+ }
-+}
-+
-+void mptcp_init_congestion_control(struct sock *sk)
-+{
-+ struct inet_connection_sock *icsk = inet_csk(sk);
-+ struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
-+ const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
-+
-+ /* The application didn't set the congestion control to use
-+ * fallback to the default one.
-+ */
-+ if (ca == &tcp_init_congestion_ops)
-+ goto use_default;
-+
-+ /* Use the same congestion control as set by the user. If the
-+ * module is not available fallback to the default one.
-+ */
-+ if (!try_module_get(ca->owner)) {
-+ pr_warn("%s: fallback to the system default CC\n", __func__);
-+ goto use_default;
-+ }
-+
-+ icsk->icsk_ca_ops = ca;
-+ if (icsk->icsk_ca_ops->init)
-+ icsk->icsk_ca_ops->init(sk);
-+
-+ return;
-+
-+use_default:
-+ icsk->icsk_ca_ops = &tcp_init_congestion_ops;
-+ tcp_init_congestion_control(sk);
-+}
-+
-+u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
-+u32 mptcp_seed = 0;
-+
-+void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
-+{
-+ u32 workspace[SHA_WORKSPACE_WORDS];
-+ u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
-+ u8 input[64];
-+ int i;
-+
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ /* Initialize input with appropriate padding */
-+ memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
-+ * is explicitly set too
-+ */
-+ memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
-+ input[8] = 0x80; /* Padding: First bit after message = 1 */
-+ input[63] = 0x40; /* Padding: Length of the message = 64 bits */
-+
-+ sha_init(mptcp_hashed_key);
-+ sha_transform(mptcp_hashed_key, input, workspace);
-+
-+ for (i = 0; i < 5; i++)
-+ mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
-+
-+ if (token)
-+ *token = mptcp_hashed_key[0];
-+ if (idsn)
-+ *idsn = *((u64 *)&mptcp_hashed_key[3]);
-+}
-+
-+void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
-+ u32 *hash_out)
-+{
-+ u32 workspace[SHA_WORKSPACE_WORDS];
-+ u8 input[128]; /* 2 512-bit blocks */
-+ int i;
-+
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ /* Generate key xored with ipad */
-+ memset(input, 0x36, 64);
-+ for (i = 0; i < 8; i++)
-+ input[i] ^= key_1[i];
-+ for (i = 0; i < 8; i++)
-+ input[i + 8] ^= key_2[i];
-+
-+ memcpy(&input[64], rand_1, 4);
-+ memcpy(&input[68], rand_2, 4);
-+ input[72] = 0x80; /* Padding: First bit after message = 1 */
-+ memset(&input[73], 0, 53);
-+
-+ /* Padding: Length of the message = 512 + 64 bits */
-+ input[126] = 0x02;
-+ input[127] = 0x40;
-+
-+ sha_init(hash_out);
-+ sha_transform(hash_out, input, workspace);
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ sha_transform(hash_out, &input[64], workspace);
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ for (i = 0; i < 5; i++)
-+ hash_out[i] = cpu_to_be32(hash_out[i]);
-+
-+ /* Prepare second part of hmac */
-+ memset(input, 0x5C, 64);
-+ for (i = 0; i < 8; i++)
-+ input[i] ^= key_1[i];
-+ for (i = 0; i < 8; i++)
-+ input[i + 8] ^= key_2[i];
-+
-+ memcpy(&input[64], hash_out, 20);
-+ input[84] = 0x80;
-+ memset(&input[85], 0, 41);
-+
-+ /* Padding: Length of the message = 512 + 160 bits */
-+ input[126] = 0x02;
-+ input[127] = 0xA0;
-+
-+ sha_init(hash_out);
-+ sha_transform(hash_out, input, workspace);
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ sha_transform(hash_out, &input[64], workspace);
-+
-+ for (i = 0; i < 5; i++)
-+ hash_out[i] = cpu_to_be32(hash_out[i]);
-+}
-+
-+static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
-+{
-+ /* Socket-options handled by sk_clone_lock while creating the meta-sk.
-+ * ======
-+ * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
-+ * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
-+ * TCP_NODELAY, TCP_CORK
-+ *
-+ * Socket-options handled in this function here
-+ * ======
-+ * TCP_DEFER_ACCEPT
-+ * SO_KEEPALIVE
-+ *
-+ * Socket-options on the todo-list
-+ * ======
-+ * SO_BINDTODEVICE - should probably prevent creation of new subsocks
-+ * across other devices. - what about the api-draft?
-+ * SO_DEBUG
-+ * SO_REUSEADDR - probably we don't care about this
-+ * SO_DONTROUTE, SO_BROADCAST
-+ * SO_OOBINLINE
-+ * SO_LINGER
-+ * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
-+ * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
-+ * SO_RXQ_OVFL
-+ * TCP_COOKIE_TRANSACTIONS
-+ * TCP_MAXSEG
-+ * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
-+ * in mptcp_retransmit_timer. AND we need to check what is
-+ * about the subsockets.
-+ * TCP_LINGER2
-+ * TCP_WINDOW_CLAMP
-+ * TCP_USER_TIMEOUT
-+ * TCP_MD5SIG
-+ *
-+ * Socket-options of no concern for the meta-socket (but for the subsocket)
-+ * ======
-+ * SO_PRIORITY
-+ * SO_MARK
-+ * TCP_CONGESTION
-+ * TCP_SYNCNT
-+ * TCP_QUICKACK
-+ */
-+
-+ /* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
-+ inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
-+
-+ /* Keepalives are handled entirely at the MPTCP-layer */
-+ if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
-+ inet_csk_reset_keepalive_timer(meta_sk,
-+ keepalive_time_when(tcp_sk(meta_sk)));
-+ sock_reset_flag(master_sk, SOCK_KEEPOPEN);
-+ inet_csk_delete_keepalive_timer(master_sk);
-+ }
-+
-+ /* Do not propagate subflow-errors up to the MPTCP-layer */
-+ inet_sk(master_sk)->recverr = 0;
-+}
-+
-+static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
-+{
-+ /* IP_TOS also goes to the subflow. */
-+ if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
-+ inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
-+ sub_sk->sk_priority = meta_sk->sk_priority;
-+ sk_dst_reset(sub_sk);
-+ }
-+
-+ /* Inherit SO_REUSEADDR */
-+ sub_sk->sk_reuse = meta_sk->sk_reuse;
-+
-+ /* Inherit snd/rcv-buffer locks */
-+ sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
-+
-+ /* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
-+ tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
-+
-+ /* Keepalives are handled entirely at the MPTCP-layer */
-+ if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
-+ sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
-+ inet_csk_delete_keepalive_timer(sub_sk);
-+ }
-+
-+ /* Do not propagate subflow-errors up to the MPTCP-layer */
-+ inet_sk(sub_sk)->recverr = 0;
-+}
-+
-+int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ /* skb-sk may be NULL if we receive a packet immediatly after the
-+ * SYN/ACK + MP_CAPABLE.
-+ */
-+ struct sock *sk = skb->sk ? skb->sk : meta_sk;
-+ int ret = 0;
-+
-+ skb->sk = NULL;
-+
-+ if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+
-+ if (sk->sk_family == AF_INET)
-+ ret = tcp_v4_do_rcv(sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ ret = tcp_v6_do_rcv(sk, skb);
-+#endif
-+
-+ sock_put(sk);
-+ return ret;
-+}
-+
-+struct lock_class_key meta_key;
-+struct lock_class_key meta_slock_key;
-+
-+static void mptcp_synack_timer_handler(unsigned long data)
-+{
-+ struct sock *meta_sk = (struct sock *) data;
-+ struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
-+
-+ /* Only process if socket is not in use. */
-+ bh_lock_sock(meta_sk);
-+
-+ if (sock_owned_by_user(meta_sk)) {
-+ /* Try again later. */
-+ mptcp_reset_synack_timer(meta_sk, HZ/20);
-+ goto out;
-+ }
-+
-+ /* May happen if the queue got destructed in mptcp_close */
-+ if (!lopt)
-+ goto out;
-+
-+ inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
-+ TCP_TIMEOUT_INIT, TCP_RTO_MAX);
-+
-+ if (lopt->qlen)
-+ mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
-+
-+out:
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk);
-+}
-+
-+static const struct tcp_sock_ops mptcp_meta_specific = {
-+ .__select_window = __mptcp_select_window,
-+ .select_window = mptcp_select_window,
-+ .select_initial_window = mptcp_select_initial_window,
-+ .init_buffer_space = mptcp_init_buffer_space,
-+ .set_rto = mptcp_tcp_set_rto,
-+ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
-+ .init_congestion_control = mptcp_init_congestion_control,
-+ .send_fin = mptcp_send_fin,
-+ .write_xmit = mptcp_write_xmit,
-+ .send_active_reset = mptcp_send_active_reset,
-+ .write_wakeup = mptcp_write_wakeup,
-+ .prune_ofo_queue = mptcp_prune_ofo_queue,
-+ .retransmit_timer = mptcp_retransmit_timer,
-+ .time_wait = mptcp_time_wait,
-+ .cleanup_rbuf = mptcp_cleanup_rbuf,
-+};
-+
-+static const struct tcp_sock_ops mptcp_sub_specific = {
-+ .__select_window = __mptcp_select_window,
-+ .select_window = mptcp_select_window,
-+ .select_initial_window = mptcp_select_initial_window,
-+ .init_buffer_space = mptcp_init_buffer_space,
-+ .set_rto = mptcp_tcp_set_rto,
-+ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
-+ .init_congestion_control = mptcp_init_congestion_control,
-+ .send_fin = tcp_send_fin,
-+ .write_xmit = tcp_write_xmit,
-+ .send_active_reset = tcp_send_active_reset,
-+ .write_wakeup = tcp_write_wakeup,
-+ .prune_ofo_queue = tcp_prune_ofo_queue,
-+ .retransmit_timer = tcp_retransmit_timer,
-+ .time_wait = tcp_time_wait,
-+ .cleanup_rbuf = tcp_cleanup_rbuf,
-+};
-+
-+static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
-+{
-+ struct mptcp_cb *mpcb;
-+ struct sock *master_sk;
-+ struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
-+ struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
-+ u64 idsn;
-+
-+ dst_release(meta_sk->sk_rx_dst);
-+ meta_sk->sk_rx_dst = NULL;
-+ /* This flag is set to announce sock_lock_init to
-+ * reclassify the lock-class of the master socket.
-+ */
-+ meta_tp->is_master_sk = 1;
-+ master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
-+ meta_tp->is_master_sk = 0;
-+ if (!master_sk)
-+ return -ENOBUFS;
-+
-+ master_tp = tcp_sk(master_sk);
-+ master_icsk = inet_csk(master_sk);
-+
-+ mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
-+ if (!mpcb) {
-+ /* sk_free (and __sk_free) requirese wmem_alloc to be 1.
-+ * All the rest is set to 0 thanks to __GFP_ZERO above.
-+ */
-+ atomic_set(&master_sk->sk_wmem_alloc, 1);
-+ sk_free(master_sk);
-+ return -ENOBUFS;
-+ }
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
-+ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
-+
-+ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
-+
-+ newnp = inet6_sk(master_sk);
-+ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
-+
-+ newnp->ipv6_mc_list = NULL;
-+ newnp->ipv6_ac_list = NULL;
-+ newnp->ipv6_fl_list = NULL;
-+ newnp->opt = NULL;
-+ newnp->pktoptions = NULL;
-+ (void)xchg(&newnp->rxpmtu, NULL);
-+ } else if (meta_sk->sk_family == AF_INET6) {
-+ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
-+
-+ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
-+
-+ newnp = inet6_sk(master_sk);
-+ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
-+
-+ newnp->hop_limit = -1;
-+ newnp->mcast_hops = IPV6_DEFAULT_MCASTHOPS;
-+ newnp->mc_loop = 1;
-+ newnp->pmtudisc = IPV6_PMTUDISC_WANT;
-+ newnp->ipv6only = sock_net(master_sk)->ipv6.sysctl.bindv6only;
-+ }
-+#endif
-+
-+ meta_tp->mptcp = NULL;
-+
-+ /* Store the keys and generate the peer's token */
-+ mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
-+ mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
-+
-+ /* Generate Initial data-sequence-numbers */
-+ mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
-+ idsn = ntohll(idsn) + 1;
-+ mpcb->snd_high_order[0] = idsn >> 32;
-+ mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
-+
-+ meta_tp->write_seq = (u32)idsn;
-+ meta_tp->snd_sml = meta_tp->write_seq;
-+ meta_tp->snd_una = meta_tp->write_seq;
-+ meta_tp->snd_nxt = meta_tp->write_seq;
-+ meta_tp->pushed_seq = meta_tp->write_seq;
-+ meta_tp->snd_up = meta_tp->write_seq;
-+
-+ mpcb->mptcp_rem_key = remote_key;
-+ mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
-+ idsn = ntohll(idsn) + 1;
-+ mpcb->rcv_high_order[0] = idsn >> 32;
-+ mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
-+ meta_tp->copied_seq = (u32) idsn;
-+ meta_tp->rcv_nxt = (u32) idsn;
-+ meta_tp->rcv_wup = (u32) idsn;
-+
-+ meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
-+ meta_tp->snd_wnd = window;
-+ meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
-+
-+ meta_tp->packets_out = 0;
-+ meta_icsk->icsk_probes_out = 0;
-+
-+ /* Set mptcp-pointers */
-+ master_tp->mpcb = mpcb;
-+ master_tp->meta_sk = meta_sk;
-+ meta_tp->mpcb = mpcb;
-+ meta_tp->meta_sk = meta_sk;
-+ mpcb->meta_sk = meta_sk;
-+ mpcb->master_sk = master_sk;
-+
-+ meta_tp->was_meta_sk = 0;
-+
-+ /* Initialize the queues */
-+ skb_queue_head_init(&mpcb->reinject_queue);
-+ skb_queue_head_init(&master_tp->out_of_order_queue);
-+ tcp_prequeue_init(master_tp);
-+ INIT_LIST_HEAD(&master_tp->tsq_node);
-+
-+ master_tp->tsq_flags = 0;
-+
-+ mutex_init(&mpcb->mpcb_mutex);
-+
-+ /* Init the accept_queue structure, we support a queue of 32 pending
-+ * connections, it does not need to be huge, since we only store here
-+ * pending subflow creations.
-+ */
-+ if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
-+ inet_put_port(master_sk);
-+ kmem_cache_free(mptcp_cb_cache, mpcb);
-+ sk_free(master_sk);
-+ return -ENOMEM;
-+ }
-+
-+ /* Redefine function-pointers as the meta-sk is now fully ready */
-+ static_key_slow_inc(&mptcp_static_key);
-+ meta_tp->mpc = 1;
-+ meta_tp->ops = &mptcp_meta_specific;
-+
-+ meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
-+ meta_sk->sk_destruct = mptcp_sock_destruct;
-+
-+ /* Meta-level retransmit timer */
-+ meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
-+
-+ tcp_init_xmit_timers(master_sk);
-+ /* Has been set for sending out the SYN */
-+ inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
-+
-+ if (!meta_tp->inside_tk_table) {
-+ /* Adding the meta_tp in the token hashtable - coming from server-side */
-+ rcu_read_lock();
-+ spin_lock(&mptcp_tk_hashlock);
-+
-+ __mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
-+
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock();
-+ }
-+ master_tp->inside_tk_table = 0;
-+
-+ /* Init time-wait stuff */
-+ INIT_LIST_HEAD(&mpcb->tw_list);
-+ spin_lock_init(&mpcb->tw_lock);
-+
-+ INIT_HLIST_HEAD(&mpcb->callback_list);
-+
-+ mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
-+
-+ mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
-+ mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
-+ mpcb->orig_window_clamp = meta_tp->window_clamp;
-+
-+ /* The meta is directly linked - set refcnt to 1 */
-+ atomic_set(&mpcb->mpcb_refcnt, 1);
-+
-+ mptcp_init_path_manager(mpcb);
-+ mptcp_init_scheduler(mpcb);
-+
-+ setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
-+ (unsigned long)meta_sk);
-+
-+ mptcp_debug("%s: created mpcb with token %#x\n",
-+ __func__, mpcb->mptcp_loc_token);
-+
-+ return 0;
-+}
-+
-+void mptcp_fallback_meta_sk(struct sock *meta_sk)
-+{
-+ kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
-+ kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
-+}
-+
-+int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
-+ gfp_t flags)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
-+ if (!tp->mptcp)
-+ return -ENOMEM;
-+
-+ tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
-+ /* No more space for more subflows? */
-+ if (!tp->mptcp->path_index) {
-+ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
-+ return -EPERM;
-+ }
-+
-+ INIT_HLIST_NODE(&tp->mptcp->cb_list);
-+
-+ tp->mptcp->tp = tp;
-+ tp->mpcb = mpcb;
-+ tp->meta_sk = meta_sk;
-+
-+ static_key_slow_inc(&mptcp_static_key);
-+ tp->mpc = 1;
-+ tp->ops = &mptcp_sub_specific;
-+
-+ tp->mptcp->loc_id = loc_id;
-+ tp->mptcp->rem_id = rem_id;
-+ if (mpcb->sched_ops->init)
-+ mpcb->sched_ops->init(sk);
-+
-+ /* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
-+ * included in mptcp_del_sock(), because the mpcb must remain alive
-+ * until the last subsocket is completely destroyed.
-+ */
-+ sock_hold(meta_sk);
-+ atomic_inc(&mpcb->mpcb_refcnt);
-+
-+ tp->mptcp->next = mpcb->connection_list;
-+ mpcb->connection_list = tp;
-+ tp->mptcp->attached = 1;
-+
-+ mpcb->cnt_subflows++;
-+ atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
-+ &meta_sk->sk_rmem_alloc);
-+
-+ mptcp_sub_inherit_sockopts(meta_sk, sk);
-+ INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
-+
-+ /* As we successfully allocated the mptcp_tcp_sock, we have to
-+ * change the function-pointers here (for sk_destruct to work correctly)
-+ */
-+ sk->sk_error_report = mptcp_sock_def_error_report;
-+ sk->sk_data_ready = mptcp_data_ready;
-+ sk->sk_write_space = mptcp_write_space;
-+ sk->sk_state_change = mptcp_set_state;
-+ sk->sk_destruct = mptcp_sock_destruct;
-+
-+ if (sk->sk_family == AF_INET)
-+ mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
-+ __func__ , mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index,
-+ &((struct inet_sock *)tp)->inet_saddr,
-+ ntohs(((struct inet_sock *)tp)->inet_sport),
-+ &((struct inet_sock *)tp)->inet_daddr,
-+ ntohs(((struct inet_sock *)tp)->inet_dport),
-+ mpcb->cnt_subflows);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
-+ __func__ , mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index, &inet6_sk(sk)->saddr,
-+ ntohs(((struct inet_sock *)tp)->inet_sport),
-+ &sk->sk_v6_daddr,
-+ ntohs(((struct inet_sock *)tp)->inet_dport),
-+ mpcb->cnt_subflows);
-+#endif
-+
-+ return 0;
-+}
-+
-+void mptcp_del_sock(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
-+ struct mptcp_cb *mpcb;
-+
-+ if (!tp->mptcp || !tp->mptcp->attached)
-+ return;
-+
-+ mpcb = tp->mpcb;
-+ tp_prev = mpcb->connection_list;
-+
-+ mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
-+ __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
-+ sk->sk_state, is_meta_sk(sk));
-+
-+ if (tp_prev == tp) {
-+ mpcb->connection_list = tp->mptcp->next;
-+ } else {
-+ for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
-+ if (tp_prev->mptcp->next == tp) {
-+ tp_prev->mptcp->next = tp->mptcp->next;
-+ break;
-+ }
-+ }
-+ }
-+ mpcb->cnt_subflows--;
-+ if (tp->mptcp->establish_increased)
-+ mpcb->cnt_established--;
-+
-+ tp->mptcp->next = NULL;
-+ tp->mptcp->attached = 0;
-+ mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
-+
-+ if (!skb_queue_empty(&sk->sk_write_queue))
-+ mptcp_reinject_data(sk, 0);
-+
-+ if (is_master_tp(tp))
-+ mpcb->master_sk = NULL;
-+ else if (tp->mptcp->pre_established)
-+ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+
-+ rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
-+}
-+
-+/* Updates the metasocket ULID/port data, based on the given sock.
-+ * The argument sock must be the sock accessible to the application.
-+ * In this function, we update the meta socket info, based on the changes
-+ * in the application socket (bind, address allocation, ...)
-+ */
-+void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
-+{
-+ if (tcp_sk(sk)->mpcb->pm_ops->new_session)
-+ tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
-+
-+ tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
-+}
-+
-+/* Clean up the receive buffer for full frames taken by the user,
-+ * then send an ACK if necessary. COPIED is the number of bytes
-+ * tcp_recvmsg has given to the user so far, it speeds up the
-+ * calculation of whether or not we must ACK for the sake of
-+ * a window update.
-+ */
-+void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sock *sk;
-+ __u32 rcv_window_now = 0;
-+
-+ if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
-+ rcv_window_now = tcp_receive_window(meta_tp);
-+
-+ if (2 * rcv_window_now > meta_tp->window_clamp)
-+ rcv_window_now = 0;
-+ }
-+
-+ mptcp_for_each_sk(meta_tp->mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ const struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+ if (!mptcp_sk_can_send_ack(sk))
-+ continue;
-+
-+ if (!inet_csk_ack_scheduled(sk))
-+ goto second_part;
-+ /* Delayed ACKs frequently hit locked sockets during bulk
-+ * receive.
-+ */
-+ if (icsk->icsk_ack.blocked ||
-+ /* Once-per-two-segments ACK was not sent by tcp_input.c */
-+ tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
-+ /* If this read emptied read buffer, we send ACK, if
-+ * connection is not bidirectional, user drained
-+ * receive buffer and there was a small segment
-+ * in queue.
-+ */
-+ (copied > 0 &&
-+ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
-+ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
-+ !icsk->icsk_ack.pingpong)) &&
-+ !atomic_read(&meta_sk->sk_rmem_alloc))) {
-+ tcp_send_ack(sk);
-+ continue;
-+ }
-+
-+second_part:
-+ /* This here is the second part of tcp_cleanup_rbuf */
-+ if (rcv_window_now) {
-+ __u32 new_window = tp->ops->__select_window(sk);
-+
-+ /* Send ACK now, if this read freed lots of space
-+ * in our buffer. Certainly, new_window is new window.
-+ * We can advertise it now, if it is not less than
-+ * current one.
-+ * "Lots" means "at least twice" here.
-+ */
-+ if (new_window && new_window >= 2 * rcv_window_now)
-+ tcp_send_ack(sk);
-+ }
-+ }
-+}
-+
-+static int mptcp_sub_send_fin(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sk_buff *skb = tcp_write_queue_tail(sk);
-+ int mss_now;
-+
-+ /* Optimization, tack on the FIN if we have a queue of
-+ * unsent frames. But be careful about outgoing SACKS
-+ * and IP options.
-+ */
-+ mss_now = tcp_current_mss(sk);
-+
-+ if (tcp_send_head(sk) != NULL) {
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+ TCP_SKB_CB(skb)->end_seq++;
-+ tp->write_seq++;
-+ } else {
-+ skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
-+ if (!skb)
-+ return 1;
-+
-+ /* Reserve space for headers and prepare control bits. */
-+ skb_reserve(skb, MAX_TCP_HEADER);
-+ /* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
-+ tcp_init_nondata_skb(skb, tp->write_seq,
-+ TCPHDR_ACK | TCPHDR_FIN);
-+ tcp_queue_skb(sk, skb);
-+ }
-+ __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
-+
-+ return 0;
-+}
-+
-+void mptcp_sub_close_wq(struct work_struct *work)
-+{
-+ struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
-+ struct sock *sk = (struct sock *)tp;
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ mutex_lock(&tp->mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ if (sock_flag(sk, SOCK_DEAD))
-+ goto exit;
-+
-+ /* We come from tcp_disconnect. We are sure that meta_sk is set */
-+ if (!mptcp(tp)) {
-+ tp->closing = 1;
-+ sock_rps_reset_flow(sk);
-+ tcp_close(sk, 0);
-+ goto exit;
-+ }
-+
-+ if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
-+ tp->closing = 1;
-+ sock_rps_reset_flow(sk);
-+ tcp_close(sk, 0);
-+ } else if (tcp_close_state(sk)) {
-+ sk->sk_shutdown |= SEND_SHUTDOWN;
-+ tcp_send_fin(sk);
-+ }
-+
-+exit:
-+ release_sock(meta_sk);
-+ mutex_unlock(&tp->mpcb->mpcb_mutex);
-+ sock_put(sk);
-+}
-+
-+void mptcp_sub_close(struct sock *sk, unsigned long delay)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
-+
-+ /* We are already closing - e.g., call from sock_def_error_report upon
-+ * tcp_disconnect in tcp_close.
-+ */
-+ if (tp->closing)
-+ return;
-+
-+ /* Work already scheduled ? */
-+ if (work_pending(&work->work)) {
-+ /* Work present - who will be first ? */
-+ if (jiffies + delay > work->timer.expires)
-+ return;
-+
-+ /* Try canceling - if it fails, work will be executed soon */
-+ if (!cancel_delayed_work(work))
-+ return;
-+ sock_put(sk);
-+ }
-+
-+ if (!delay) {
-+ unsigned char old_state = sk->sk_state;
-+
-+ /* If we are in user-context we can directly do the closing
-+ * procedure. No need to schedule a work-queue.
-+ */
-+ if (!in_softirq()) {
-+ if (sock_flag(sk, SOCK_DEAD))
-+ return;
-+
-+ if (!mptcp(tp)) {
-+ tp->closing = 1;
-+ sock_rps_reset_flow(sk);
-+ tcp_close(sk, 0);
-+ return;
-+ }
-+
-+ if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
-+ sk->sk_state == TCP_CLOSE) {
-+ tp->closing = 1;
-+ sock_rps_reset_flow(sk);
-+ tcp_close(sk, 0);
-+ } else if (tcp_close_state(sk)) {
-+ sk->sk_shutdown |= SEND_SHUTDOWN;
-+ tcp_send_fin(sk);
-+ }
-+
-+ return;
-+ }
-+
-+ /* We directly send the FIN. Because it may take so a long time,
-+ * untile the work-queue will get scheduled...
-+ *
-+ * If mptcp_sub_send_fin returns 1, it failed and thus we reset
-+ * the old state so that tcp_close will finally send the fin
-+ * in user-context.
-+ */
-+ if (!sk->sk_err && old_state != TCP_CLOSE &&
-+ tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
-+ if (old_state == TCP_ESTABLISHED)
-+ TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
-+ sk->sk_state = old_state;
-+ }
-+ }
-+
-+ sock_hold(sk);
-+ queue_delayed_work(mptcp_wq, work, delay);
-+}
-+
-+void mptcp_sub_force_close(struct sock *sk)
-+{
-+ /* The below tcp_done may have freed the socket, if he is already dead.
-+ * Thus, we are not allowed to access it afterwards. That's why
-+ * we have to store the dead-state in this local variable.
-+ */
-+ int sock_is_dead = sock_flag(sk, SOCK_DEAD);
-+
-+ tcp_sk(sk)->mp_killed = 1;
-+
-+ if (sk->sk_state != TCP_CLOSE)
-+ tcp_done(sk);
-+
-+ if (!sock_is_dead)
-+ mptcp_sub_close(sk, 0);
-+}
-+EXPORT_SYMBOL(mptcp_sub_force_close);
-+
-+/* Update the mpcb send window, based on the contributions
-+ * of each subflow
-+ */
-+void mptcp_update_sndbuf(const struct tcp_sock *tp)
-+{
-+ struct sock *meta_sk = tp->meta_sk, *sk;
-+ int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
-+
-+ mptcp_for_each_sk(tp->mpcb, sk) {
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ new_sndbuf += sk->sk_sndbuf;
-+
-+ if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
-+ new_sndbuf = sysctl_tcp_wmem[2];
-+ break;
-+ }
-+ }
-+ meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
-+
-+ /* The subflow's call to sk_write_space in tcp_new_space ends up in
-+ * mptcp_write_space.
-+ * It has nothing to do with waking up the application.
-+ * So, we do it here.
-+ */
-+ if (old_sndbuf != meta_sk->sk_sndbuf)
-+ meta_sk->sk_write_space(meta_sk);
-+}
-+
-+void mptcp_close(struct sock *meta_sk, long timeout)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sock *sk_it, *tmpsk;
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sk_buff *skb;
-+ int data_was_unread = 0;
-+ int state;
-+
-+ mptcp_debug("%s: Close of meta_sk with tok %#x\n",
-+ __func__, mpcb->mptcp_loc_token);
-+
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock(meta_sk);
-+
-+ if (meta_tp->inside_tk_table) {
-+ /* Detach the mpcb from the token hashtable */
-+ mptcp_hash_remove_bh(meta_tp);
-+ reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
-+ }
-+
-+ meta_sk->sk_shutdown = SHUTDOWN_MASK;
-+ /* We need to flush the recv. buffs. We do this only on the
-+ * descriptor close, not protocol-sourced closes, because the
-+ * reader process may not have drained the data yet!
-+ */
-+ while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
-+ u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
-+ tcp_hdr(skb)->fin;
-+ data_was_unread += len;
-+ __kfree_skb(skb);
-+ }
-+
-+ sk_mem_reclaim(meta_sk);
-+
-+ /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
-+ if (meta_sk->sk_state == TCP_CLOSE) {
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+ if (tcp_sk(sk_it)->send_mp_fclose)
-+ continue;
-+ mptcp_sub_close(sk_it, 0);
-+ }
-+ goto adjudge_to_death;
-+ }
-+
-+ if (data_was_unread) {
-+ /* Unread data was tossed, zap the connection. */
-+ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
-+ tcp_set_state(meta_sk, TCP_CLOSE);
-+ tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
-+ meta_sk->sk_allocation);
-+ } else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
-+ /* Check zero linger _after_ checking for unread data. */
-+ meta_sk->sk_prot->disconnect(meta_sk, 0);
-+ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+ } else if (tcp_close_state(meta_sk)) {
-+ mptcp_send_fin(meta_sk);
-+ } else if (meta_tp->snd_una == meta_tp->write_seq) {
-+ /* The DATA_FIN has been sent and acknowledged
-+ * (e.g., by sk_shutdown). Close all the other subflows
-+ */
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+ unsigned long delay = 0;
-+ /* If we are the passive closer, don't trigger
-+ * subflow-fin until the subflow has been finned
-+ * by the peer. - thus we add a delay
-+ */
-+ if (mpcb->passive_close &&
-+ sk_it->sk_state == TCP_ESTABLISHED)
-+ delay = inet_csk(sk_it)->icsk_rto << 3;
-+
-+ mptcp_sub_close(sk_it, delay);
-+ }
-+ }
-+
-+ sk_stream_wait_close(meta_sk, timeout);
-+
-+adjudge_to_death:
-+ state = meta_sk->sk_state;
-+ sock_hold(meta_sk);
-+ sock_orphan(meta_sk);
-+
-+ /* socket will be freed after mptcp_close - we have to prevent
-+ * access from the subflows.
-+ */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ /* Similar to sock_orphan, but we don't set it DEAD, because
-+ * the callbacks are still set and must be called.
-+ */
-+ write_lock_bh(&sk_it->sk_callback_lock);
-+ sk_set_socket(sk_it, NULL);
-+ sk_it->sk_wq = NULL;
-+ write_unlock_bh(&sk_it->sk_callback_lock);
-+ }
-+
-+ /* It is the last release_sock in its life. It will remove backlog. */
-+ release_sock(meta_sk);
-+
-+ /* Now socket is owned by kernel and we acquire BH lock
-+ * to finish close. No need to check for user refs.
-+ */
-+ local_bh_disable();
-+ bh_lock_sock(meta_sk);
-+ WARN_ON(sock_owned_by_user(meta_sk));
-+
-+ percpu_counter_inc(meta_sk->sk_prot->orphan_count);
-+
-+ /* Have we already been destroyed by a softirq or backlog? */
-+ if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
-+ goto out;
-+
-+ /* This is a (useful) BSD violating of the RFC. There is a
-+ * problem with TCP as specified in that the other end could
-+ * keep a socket open forever with no application left this end.
-+ * We use a 3 minute timeout (about the same as BSD) then kill
-+ * our end. If they send after that then tough - BUT: long enough
-+ * that we won't make the old 4*rto = almost no time - whoops
-+ * reset mistake.
-+ *
-+ * Nope, it was not mistake. It is really desired behaviour
-+ * f.e. on http servers, when such sockets are useless, but
-+ * consume significant resources. Let's do it with special
-+ * linger2 option. --ANK
-+ */
-+
-+ if (meta_sk->sk_state == TCP_FIN_WAIT2) {
-+ if (meta_tp->linger2 < 0) {
-+ tcp_set_state(meta_sk, TCP_CLOSE);
-+ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
-+ NET_INC_STATS_BH(sock_net(meta_sk),
-+ LINUX_MIB_TCPABORTONLINGER);
-+ } else {
-+ const int tmo = tcp_fin_time(meta_sk);
-+
-+ if (tmo > TCP_TIMEWAIT_LEN) {
-+ inet_csk_reset_keepalive_timer(meta_sk,
-+ tmo - TCP_TIMEWAIT_LEN);
-+ } else {
-+ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
-+ tmo);
-+ goto out;
-+ }
-+ }
-+ }
-+ if (meta_sk->sk_state != TCP_CLOSE) {
-+ sk_mem_reclaim(meta_sk);
-+ if (tcp_too_many_orphans(meta_sk, 0)) {
-+ if (net_ratelimit())
-+ pr_info("MPTCP: too many of orphaned sockets\n");
-+ tcp_set_state(meta_sk, TCP_CLOSE);
-+ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
-+ NET_INC_STATS_BH(sock_net(meta_sk),
-+ LINUX_MIB_TCPABORTONMEMORY);
-+ }
-+ }
-+
-+
-+ if (meta_sk->sk_state == TCP_CLOSE)
-+ inet_csk_destroy_sock(meta_sk);
-+ /* Otherwise, socket is reprieved until protocol close. */
-+
-+out:
-+ bh_unlock_sock(meta_sk);
-+ local_bh_enable();
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk); /* Taken by sock_hold */
-+}
-+
-+void mptcp_disconnect(struct sock *sk)
-+{
-+ struct sock *subsk, *tmpsk;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ mptcp_delete_synack_timer(sk);
-+
-+ __skb_queue_purge(&tp->mpcb->reinject_queue);
-+
-+ if (tp->inside_tk_table) {
-+ mptcp_hash_remove_bh(tp);
-+ reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
-+ }
-+
-+ local_bh_disable();
-+ mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
-+ /* The socket will get removed from the subsocket-list
-+ * and made non-mptcp by setting mpc to 0.
-+ *
-+ * This is necessary, because tcp_disconnect assumes
-+ * that the connection is completly dead afterwards.
-+ * Thus we need to do a mptcp_del_sock. Due to this call
-+ * we have to make it non-mptcp.
-+ *
-+ * We have to lock the socket, because we set mpc to 0.
-+ * An incoming packet would take the subsocket's lock
-+ * and go on into the receive-path.
-+ * This would be a race.
-+ */
-+
-+ bh_lock_sock(subsk);
-+ mptcp_del_sock(subsk);
-+ tcp_sk(subsk)->mpc = 0;
-+ tcp_sk(subsk)->ops = &tcp_specific;
-+ mptcp_sub_force_close(subsk);
-+ bh_unlock_sock(subsk);
-+ }
-+ local_bh_enable();
-+
-+ tp->was_meta_sk = 1;
-+ tp->mpc = 0;
-+ tp->ops = &tcp_specific;
-+}
-+
-+
-+/* Returns 1 if we should enable MPTCP for that socket. */
-+int mptcp_doit(struct sock *sk)
-+{
-+ /* Do not allow MPTCP enabling if the MPTCP initialization failed */
-+ if (mptcp_init_failed)
-+ return 0;
-+
-+ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
-+ return 0;
-+
-+ /* Socket may already be established (e.g., called from tcp_recvmsg) */
-+ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
-+ return 1;
-+
-+ /* Don't do mptcp over loopback */
-+ if (sk->sk_family == AF_INET &&
-+ (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
-+ ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
-+ return 0;
-+#if IS_ENABLED(CONFIG_IPV6)
-+ if (sk->sk_family == AF_INET6 &&
-+ (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
-+ ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
-+ return 0;
-+#endif
-+ if (mptcp_v6_is_v4_mapped(sk) &&
-+ ipv4_is_loopback(inet_sk(sk)->inet_saddr))
-+ return 0;
-+
-+#ifdef CONFIG_TCP_MD5SIG
-+ /* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
-+ if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
-+ return 0;
-+#endif
-+
-+ return 1;
-+}
-+
-+int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
-+{
-+ struct tcp_sock *master_tp;
-+ struct sock *master_sk;
-+
-+ if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
-+ goto err_alloc_mpcb;
-+
-+ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
-+ master_tp = tcp_sk(master_sk);
-+
-+ if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
-+ goto err_add_sock;
-+
-+ if (__inet_inherit_port(meta_sk, master_sk) < 0)
-+ goto err_add_sock;
-+
-+ meta_sk->sk_prot->unhash(meta_sk);
-+
-+ if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
-+ __inet_hash_nolisten(master_sk, NULL);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ __inet6_hash(master_sk, NULL);
-+#endif
-+
-+ master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
-+
-+ return 0;
-+
-+err_add_sock:
-+ mptcp_fallback_meta_sk(meta_sk);
-+
-+ inet_csk_prepare_forced_close(master_sk);
-+ tcp_done(master_sk);
-+ inet_csk_prepare_forced_close(meta_sk);
-+ tcp_done(meta_sk);
-+
-+err_alloc_mpcb:
-+ return -ENOBUFS;
-+}
-+
-+static int __mptcp_check_req_master(struct sock *child,
-+ struct request_sock *req)
-+{
-+ struct tcp_sock *child_tp = tcp_sk(child);
-+ struct sock *meta_sk = child;
-+ struct mptcp_cb *mpcb;
-+ struct mptcp_request_sock *mtreq;
-+
-+ /* Never contained an MP_CAPABLE */
-+ if (!inet_rsk(req)->mptcp_rqsk)
-+ return 1;
-+
-+ if (!inet_rsk(req)->saw_mpc) {
-+ /* Fallback to regular TCP, because we saw one SYN without
-+ * MP_CAPABLE. In tcp_check_req we continue the regular path.
-+ * But, the socket has been added to the reqsk_tk_htb, so we
-+ * must still remove it.
-+ */
-+ mptcp_reqsk_remove_tk(req);
-+ return 1;
-+ }
-+
-+ /* Just set this values to pass them to mptcp_alloc_mpcb */
-+ mtreq = mptcp_rsk(req);
-+ child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
-+ child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
-+
-+ if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
-+ child_tp->snd_wnd))
-+ return -ENOBUFS;
-+
-+ child = tcp_sk(child)->mpcb->master_sk;
-+ child_tp = tcp_sk(child);
-+ mpcb = child_tp->mpcb;
-+
-+ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
-+ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
-+
-+ mpcb->dss_csum = mtreq->dss_csum;
-+ mpcb->server_side = 1;
-+
-+ /* Will be moved to ESTABLISHED by tcp_rcv_state_process() */
-+ mptcp_update_metasocket(child, meta_sk);
-+
-+ /* Needs to be done here additionally, because when accepting a
-+ * new connection we pass by __reqsk_free and not reqsk_free.
-+ */
-+ mptcp_reqsk_remove_tk(req);
-+
-+ /* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
-+ sock_put(meta_sk);
-+
-+ return 0;
-+}
-+
-+int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
-+{
-+ struct sock *meta_sk = child, *master_sk;
-+ struct sk_buff *skb;
-+ u32 new_mapping;
-+ int ret;
-+
-+ ret = __mptcp_check_req_master(child, req);
-+ if (ret)
-+ return ret;
-+
-+ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
-+
-+ /* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
-+ * pre-MPTCP data in the receive queue.
-+ */
-+ tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
-+ tcp_rsk(req)->rcv_isn - 1;
-+
-+ /* Map subflow sequence number to data sequence numbers. We need to map
-+ * these data to [IDSN - len - 1, IDSN[.
-+ */
-+ new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
-+
-+ /* There should be only one skb: the SYN + data. */
-+ skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
-+ TCP_SKB_CB(skb)->seq += new_mapping;
-+ TCP_SKB_CB(skb)->end_seq += new_mapping;
-+ }
-+
-+ /* With fastopen we change the semantics of the relative subflow
-+ * sequence numbers to deal with middleboxes that could add/remove
-+ * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
-+ * instead of the regular TCP ISN.
-+ */
-+ tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
-+
-+ /* We need to update copied_seq of the master_sk to account for the
-+ * already moved data to the meta receive queue.
-+ */
-+ tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
-+
-+ /* Handled by the master_sk */
-+ tcp_sk(meta_sk)->fastopen_rsk = NULL;
-+
-+ return 0;
-+}
-+
-+int mptcp_check_req_master(struct sock *sk, struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev)
-+{
-+ struct sock *meta_sk = child;
-+ int ret;
-+
-+ ret = __mptcp_check_req_master(child, req);
-+ if (ret)
-+ return ret;
-+
-+ inet_csk_reqsk_queue_unlink(sk, req, prev);
-+ inet_csk_reqsk_queue_removed(sk, req);
-+ inet_csk_reqsk_queue_add(sk, req, meta_sk);
-+
-+ return 0;
-+}
-+
-+struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev,
-+ const struct mptcp_options_received *mopt)
-+{
-+ struct tcp_sock *child_tp = tcp_sk(child);
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ u8 hash_mac_check[20];
-+
-+ child_tp->inside_tk_table = 0;
-+
-+ if (!mopt->join_ack)
-+ goto teardown;
-+
-+ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
-+ (u8 *)&mpcb->mptcp_loc_key,
-+ (u8 *)&mtreq->mptcp_rem_nonce,
-+ (u8 *)&mtreq->mptcp_loc_nonce,
-+ (u32 *)hash_mac_check);
-+
-+ if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
-+ goto teardown;
-+
-+ /* Point it to the same struct socket and wq as the meta_sk */
-+ sk_set_socket(child, meta_sk->sk_socket);
-+ child->sk_wq = meta_sk->sk_wq;
-+
-+ if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
-+ /* Has been inherited, but now child_tp->mptcp is NULL */
-+ child_tp->mpc = 0;
-+ child_tp->ops = &tcp_specific;
-+
-+ /* TODO when we support acking the third ack for new subflows,
-+ * we should silently discard this third ack, by returning NULL.
-+ *
-+ * Maybe, at the retransmission we will have enough memory to
-+ * fully add the socket to the meta-sk.
-+ */
-+ goto teardown;
-+ }
-+
-+ /* The child is a clone of the meta socket, we must now reset
-+ * some of the fields
-+ */
-+ child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
-+
-+ /* We should allow proper increase of the snd/rcv-buffers. Thus, we
-+ * use the original values instead of the bloated up ones from the
-+ * clone.
-+ */
-+ child->sk_sndbuf = mpcb->orig_sk_sndbuf;
-+ child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
-+
-+ child_tp->mptcp->slave_sk = 1;
-+ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
-+ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
-+ child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
-+
-+ child_tp->tsq_flags = 0;
-+
-+ /* Subflows do not use the accept queue, as they
-+ * are attached immediately to the mpcb.
-+ */
-+ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
-+ reqsk_free(req);
-+ return child;
-+
-+teardown:
-+ /* Drop this request - sock creation failed. */
-+ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
-+ reqsk_free(req);
-+ inet_csk_prepare_forced_close(child);
-+ tcp_done(child);
-+ return meta_sk;
-+}
-+
-+int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
-+{
-+ struct mptcp_tw *mptw;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+
-+ /* A subsocket in tw can only receive data. So, if we are in
-+ * infinite-receive, then we should not reply with a data-ack or act
-+ * upon general MPTCP-signaling. We prevent this by simply not creating
-+ * the mptcp_tw_sock.
-+ */
-+ if (mpcb->infinite_mapping_rcv) {
-+ tw->mptcp_tw = NULL;
-+ return 0;
-+ }
-+
-+ /* Alloc MPTCP-tw-sock */
-+ mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
-+ if (!mptw)
-+ return -ENOBUFS;
-+
-+ atomic_inc(&mpcb->mpcb_refcnt);
-+
-+ tw->mptcp_tw = mptw;
-+ mptw->loc_key = mpcb->mptcp_loc_key;
-+ mptw->meta_tw = mpcb->in_time_wait;
-+ if (mptw->meta_tw) {
-+ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
-+ if (mpcb->mptw_state != TCP_TIME_WAIT)
-+ mptw->rcv_nxt++;
-+ }
-+ rcu_assign_pointer(mptw->mpcb, mpcb);
-+
-+ spin_lock(&mpcb->tw_lock);
-+ list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
-+ mptw->in_list = 1;
-+ spin_unlock(&mpcb->tw_lock);
-+
-+ return 0;
-+}
-+
-+void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
-+{
-+ struct mptcp_cb *mpcb;
-+
-+ rcu_read_lock();
-+ mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
-+
-+ /* If we are still holding a ref to the mpcb, we have to remove ourself
-+ * from the list and drop the ref properly.
-+ */
-+ if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
-+ spin_lock(&mpcb->tw_lock);
-+ if (tw->mptcp_tw->in_list) {
-+ list_del_rcu(&tw->mptcp_tw->list);
-+ tw->mptcp_tw->in_list = 0;
-+ }
-+ spin_unlock(&mpcb->tw_lock);
-+
-+ /* Twice, because we increased it above */
-+ mptcp_mpcb_put(mpcb);
-+ mptcp_mpcb_put(mpcb);
-+ }
-+
-+ rcu_read_unlock();
-+
-+ kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
-+}
-+
-+/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
-+ * data-fin.
-+ */
-+void mptcp_time_wait(struct sock *sk, int state, int timeo)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_tw *mptw;
-+
-+ /* Used for sockets that go into tw after the meta
-+ * (see mptcp_init_tw_sock())
-+ */
-+ tp->mpcb->in_time_wait = 1;
-+ tp->mpcb->mptw_state = state;
-+
-+ /* Update the time-wait-sock's information */
-+ rcu_read_lock_bh();
-+ list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
-+ mptw->meta_tw = 1;
-+ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
-+
-+ /* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
-+ * pretend as if the DATA_FIN has already reached us, that way
-+ * the checks in tcp_timewait_state_process will be good as the
-+ * DATA_FIN comes in.
-+ */
-+ if (state != TCP_TIME_WAIT)
-+ mptw->rcv_nxt++;
-+ }
-+ rcu_read_unlock_bh();
-+
-+ tcp_done(sk);
-+}
-+
-+void mptcp_tsq_flags(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ /* It will be handled as a regular deferred-call */
-+ if (is_meta_sk(sk))
-+ return;
-+
-+ if (hlist_unhashed(&tp->mptcp->cb_list)) {
-+ hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
-+ /* We need to hold it here, as the sock_hold is not assured
-+ * by the release_sock as it is done in regular TCP.
-+ *
-+ * The subsocket may get inet_csk_destroy'd while it is inside
-+ * the callback_list.
-+ */
-+ sock_hold(sk);
-+ }
-+
-+ if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
-+ sock_hold(meta_sk);
-+}
-+
-+void mptcp_tsq_sub_deferred(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_tcp_sock *mptcp;
-+ struct hlist_node *tmp;
-+
-+ BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
-+
-+ __sock_put(meta_sk);
-+ hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
-+ struct tcp_sock *tp = mptcp->tp;
-+ struct sock *sk = (struct sock *)tp;
-+
-+ hlist_del_init(&mptcp->cb_list);
-+ sk->sk_prot->release_cb(sk);
-+ /* Final sock_put (cfr. mptcp_tsq_flags */
-+ sock_put(sk);
-+ }
-+}
-+
-+void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
-+ struct sk_buff *skb)
-+{
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+ struct mptcp_options_received mopt;
-+ u8 mptcp_hash_mac[20];
-+
-+ mptcp_init_mp_opt(&mopt);
-+ tcp_parse_mptcp_options(skb, &mopt);
-+
-+ mtreq = mptcp_rsk(req);
-+ mtreq->mptcp_mpcb = mpcb;
-+ mtreq->is_sub = 1;
-+ inet_rsk(req)->mptcp_rqsk = 1;
-+
-+ mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
-+
-+ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
-+ (u8 *)&mpcb->mptcp_rem_key,
-+ (u8 *)&mtreq->mptcp_loc_nonce,
-+ (u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
-+ mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
-+
-+ mtreq->rem_id = mopt.rem_id;
-+ mtreq->rcv_low_prio = mopt.low_prio;
-+ inet_rsk(req)->saw_mpc = 1;
-+}
-+
-+void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
-+{
-+ struct mptcp_options_received mopt;
-+ struct mptcp_request_sock *mreq = mptcp_rsk(req);
-+
-+ mptcp_init_mp_opt(&mopt);
-+ tcp_parse_mptcp_options(skb, &mopt);
-+
-+ mreq->is_sub = 0;
-+ inet_rsk(req)->mptcp_rqsk = 1;
-+ mreq->dss_csum = mopt.dss_csum;
-+ mreq->hash_entry.pprev = NULL;
-+
-+ mptcp_reqsk_new_mptcp(req, &mopt, skb);
-+}
-+
-+int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
-+{
-+ struct mptcp_options_received mopt;
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ __u32 isn = TCP_SKB_CB(skb)->when;
-+ bool want_cookie = false;
-+
-+ if ((sysctl_tcp_syncookies == 2 ||
-+ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-+ want_cookie = tcp_syn_flood_action(sk, skb,
-+ mptcp_request_sock_ops.slab_name);
-+ if (!want_cookie)
-+ goto drop;
-+ }
-+
-+ mptcp_init_mp_opt(&mopt);
-+ tcp_parse_mptcp_options(skb, &mopt);
-+
-+ if (mopt.is_mp_join)
-+ return mptcp_do_join_short(skb, &mopt, sock_net(sk));
-+ if (mopt.drop_me)
-+ goto drop;
-+
-+ if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
-+ mopt.saw_mpc = 0;
-+
-+ if (skb->protocol == htons(ETH_P_IP)) {
-+ if (mopt.saw_mpc && !want_cookie) {
-+ if (skb_rtable(skb)->rt_flags &
-+ (RTCF_BROADCAST | RTCF_MULTICAST))
-+ goto drop;
-+
-+ return tcp_conn_request(&mptcp_request_sock_ops,
-+ &mptcp_request_sock_ipv4_ops,
-+ sk, skb);
-+ }
-+
-+ return tcp_v4_conn_request(sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else {
-+ if (mopt.saw_mpc && !want_cookie) {
-+ if (!ipv6_unicast_destination(skb))
-+ goto drop;
-+
-+ return tcp_conn_request(&mptcp6_request_sock_ops,
-+ &mptcp_request_sock_ipv6_ops,
-+ sk, skb);
-+ }
-+
-+ return tcp_v6_conn_request(sk, skb);
-+#endif
-+ }
-+drop:
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-+ return 0;
-+}
-+
-+struct workqueue_struct *mptcp_wq;
-+EXPORT_SYMBOL(mptcp_wq);
-+
-+/* Output /proc/net/mptcp */
-+static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
-+{
-+ struct tcp_sock *meta_tp;
-+ const struct net *net = seq->private;
-+ int i, n = 0;
-+
-+ seq_printf(seq, " sl loc_tok rem_tok v6 local_address remote_address st ns tx_queue rx_queue inode");
-+ seq_putc(seq, '\n');
-+
-+ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+ struct hlist_nulls_node *node;
-+ rcu_read_lock_bh();
-+ hlist_nulls_for_each_entry_rcu(meta_tp, node,
-+ &tk_hashtable[i], tk_table) {
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sock *meta_sk = (struct sock *)meta_tp;
-+ struct inet_sock *isk = inet_sk(meta_sk);
-+
-+ if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
-+ continue;
-+
-+ if (capable(CAP_NET_ADMIN)) {
-+ seq_printf(seq, "%4d: %04X %04X ", n++,
-+ mpcb->mptcp_loc_token,
-+ mpcb->mptcp_rem_token);
-+ } else {
-+ seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
-+ }
-+ if (meta_sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(meta_sk)) {
-+ seq_printf(seq, " 0 %08X:%04X %08X:%04X ",
-+ isk->inet_rcv_saddr,
-+ ntohs(isk->inet_sport),
-+ isk->inet_daddr,
-+ ntohs(isk->inet_dport));
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else if (meta_sk->sk_family == AF_INET6) {
-+ struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
-+ struct in6_addr *dst = &meta_sk->sk_v6_daddr;
-+ seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
-+ src->s6_addr32[0], src->s6_addr32[1],
-+ src->s6_addr32[2], src->s6_addr32[3],
-+ ntohs(isk->inet_sport),
-+ dst->s6_addr32[0], dst->s6_addr32[1],
-+ dst->s6_addr32[2], dst->s6_addr32[3],
-+ ntohs(isk->inet_dport));
-+#endif
-+ }
-+ seq_printf(seq, " %02X %02X %08X:%08X %lu",
-+ meta_sk->sk_state, mpcb->cnt_subflows,
-+ meta_tp->write_seq - meta_tp->snd_una,
-+ max_t(int, meta_tp->rcv_nxt -
-+ meta_tp->copied_seq, 0),
-+ sock_i_ino(meta_sk));
-+ seq_putc(seq, '\n');
-+ }
-+
-+ rcu_read_unlock_bh();
-+ }
-+
-+ return 0;
-+}
-+
-+static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
-+{
-+ return single_open_net(inode, file, mptcp_pm_seq_show);
-+}
-+
-+static const struct file_operations mptcp_pm_seq_fops = {
-+ .owner = THIS_MODULE,
-+ .open = mptcp_pm_seq_open,
-+ .read = seq_read,
-+ .llseek = seq_lseek,
-+ .release = single_release_net,
-+};
-+
-+static int mptcp_pm_init_net(struct net *net)
-+{
-+ if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
-+ return -ENOMEM;
-+
-+ return 0;
-+}
-+
-+static void mptcp_pm_exit_net(struct net *net)
-+{
-+ remove_proc_entry("mptcp", net->proc_net);
-+}
-+
-+static struct pernet_operations mptcp_pm_proc_ops = {
-+ .init = mptcp_pm_init_net,
-+ .exit = mptcp_pm_exit_net,
-+};
-+
-+/* General initialization of mptcp */
-+void __init mptcp_init(void)
-+{
-+ int i;
-+ struct ctl_table_header *mptcp_sysctl;
-+
-+ mptcp_sock_cache = kmem_cache_create("mptcp_sock",
-+ sizeof(struct mptcp_tcp_sock),
-+ 0, SLAB_HWCACHE_ALIGN,
-+ NULL);
-+ if (!mptcp_sock_cache)
-+ goto mptcp_sock_cache_failed;
-+
-+ mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
-+ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+ NULL);
-+ if (!mptcp_cb_cache)
-+ goto mptcp_cb_cache_failed;
-+
-+ mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
-+ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+ NULL);
-+ if (!mptcp_tw_cache)
-+ goto mptcp_tw_cache_failed;
-+
-+ get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
-+
-+ mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
-+ if (!mptcp_wq)
-+ goto alloc_workqueue_failed;
-+
-+ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+ INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
-+ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
-+ i + MPTCP_REQSK_NULLS_BASE);
-+ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
-+ }
-+
-+ spin_lock_init(&mptcp_reqsk_hlock);
-+ spin_lock_init(&mptcp_tk_hashlock);
-+
-+ if (register_pernet_subsys(&mptcp_pm_proc_ops))
-+ goto pernet_failed;
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ if (mptcp_pm_v6_init())
-+ goto mptcp_pm_v6_failed;
-+#endif
-+ if (mptcp_pm_v4_init())
-+ goto mptcp_pm_v4_failed;
-+
-+ mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
-+ if (!mptcp_sysctl)
-+ goto register_sysctl_failed;
-+
-+ if (mptcp_register_path_manager(&mptcp_pm_default))
-+ goto register_pm_failed;
-+
-+ if (mptcp_register_scheduler(&mptcp_sched_default))
-+ goto register_sched_failed;
-+
-+ pr_info("MPTCP: Stable release v0.89.0-rc");
-+
-+ mptcp_init_failed = false;
-+
-+ return;
-+
-+register_sched_failed:
-+ mptcp_unregister_path_manager(&mptcp_pm_default);
-+register_pm_failed:
-+ unregister_net_sysctl_table(mptcp_sysctl);
-+register_sysctl_failed:
-+ mptcp_pm_v4_undo();
-+mptcp_pm_v4_failed:
-+#if IS_ENABLED(CONFIG_IPV6)
-+ mptcp_pm_v6_undo();
-+mptcp_pm_v6_failed:
-+#endif
-+ unregister_pernet_subsys(&mptcp_pm_proc_ops);
-+pernet_failed:
-+ destroy_workqueue(mptcp_wq);
-+alloc_workqueue_failed:
-+ kmem_cache_destroy(mptcp_tw_cache);
-+mptcp_tw_cache_failed:
-+ kmem_cache_destroy(mptcp_cb_cache);
-+mptcp_cb_cache_failed:
-+ kmem_cache_destroy(mptcp_sock_cache);
-+mptcp_sock_cache_failed:
-+ mptcp_init_failed = true;
-+}
-diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
-new file mode 100644
-index 000000000000..3a54413ce25b
---- /dev/null
-+++ b/net/mptcp/mptcp_fullmesh.c
-@@ -0,0 +1,1722 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#include <net/addrconf.h>
-+#endif
-+
-+enum {
-+ MPTCP_EVENT_ADD = 1,
-+ MPTCP_EVENT_DEL,
-+ MPTCP_EVENT_MOD,
-+};
-+
-+#define MPTCP_SUBFLOW_RETRY_DELAY 1000
-+
-+/* Max number of local or remote addresses we can store.
-+ * When changing, see the bitfield below in fullmesh_rem4/6.
-+ */
-+#define MPTCP_MAX_ADDR 8
-+
-+struct fullmesh_rem4 {
-+ u8 rem4_id;
-+ u8 bitfield;
-+ u8 retry_bitfield;
-+ __be16 port;
-+ struct in_addr addr;
-+};
-+
-+struct fullmesh_rem6 {
-+ u8 rem6_id;
-+ u8 bitfield;
-+ u8 retry_bitfield;
-+ __be16 port;
-+ struct in6_addr addr;
-+};
-+
-+struct mptcp_loc_addr {
-+ struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
-+ u8 loc4_bits;
-+ u8 next_v4_index;
-+
-+ struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
-+ u8 loc6_bits;
-+ u8 next_v6_index;
-+};
-+
-+struct mptcp_addr_event {
-+ struct list_head list;
-+ unsigned short family;
-+ u8 code:7,
-+ low_prio:1;
-+ union inet_addr addr;
-+};
-+
-+struct fullmesh_priv {
-+ /* Worker struct for subflow establishment */
-+ struct work_struct subflow_work;
-+ /* Delayed worker, when the routing-tables are not yet ready. */
-+ struct delayed_work subflow_retry_work;
-+
-+ /* Remote addresses */
-+ struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
-+ struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
-+
-+ struct mptcp_cb *mpcb;
-+
-+ u16 remove_addrs; /* Addresses to remove */
-+ u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
-+ u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
-+
-+ u8 add_addr; /* Are we sending an add_addr? */
-+
-+ u8 rem4_bits;
-+ u8 rem6_bits;
-+};
-+
-+struct mptcp_fm_ns {
-+ struct mptcp_loc_addr __rcu *local;
-+ spinlock_t local_lock; /* Protecting the above pointer */
-+ struct list_head events;
-+ struct delayed_work address_worker;
-+
-+ struct net *net;
-+};
-+
-+static struct mptcp_pm_ops full_mesh __read_mostly;
-+
-+static void full_mesh_create_subflows(struct sock *meta_sk);
-+
-+static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
-+{
-+ return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
-+}
-+
-+static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
-+{
-+ return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
-+}
-+
-+/* Find the first free index in the bitfield */
-+static int __mptcp_find_free_index(u8 bitfield, u8 base)
-+{
-+ int i;
-+
-+ /* There are anyways no free bits... */
-+ if (bitfield == 0xff)
-+ goto exit;
-+
-+ i = ffs(~(bitfield >> base)) - 1;
-+ if (i < 0)
-+ goto exit;
-+
-+ /* No free bits when starting at base, try from 0 on */
-+ if (i + base >= sizeof(bitfield) * 8)
-+ return __mptcp_find_free_index(bitfield, 0);
-+
-+ return i + base;
-+exit:
-+ return -1;
-+}
-+
-+static int mptcp_find_free_index(u8 bitfield)
-+{
-+ return __mptcp_find_free_index(bitfield, 0);
-+}
-+
-+static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
-+ const struct in_addr *addr,
-+ __be16 port, u8 id)
-+{
-+ int i;
-+ struct fullmesh_rem4 *rem4;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ rem4 = &fmp->remaddr4[i];
-+
-+ /* Address is already in the list --- continue */
-+ if (rem4->rem4_id == id &&
-+ rem4->addr.s_addr == addr->s_addr && rem4->port == port)
-+ return;
-+
-+ /* This may be the case, when the peer is behind a NAT. He is
-+ * trying to JOIN, thus sending the JOIN with a certain ID.
-+ * However the src_addr of the IP-packet has been changed. We
-+ * update the addr in the list, because this is the address as
-+ * OUR BOX sees it.
-+ */
-+ if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
-+ /* update the address */
-+ mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
-+ __func__, &rem4->addr.s_addr,
-+ &addr->s_addr, id);
-+ rem4->addr.s_addr = addr->s_addr;
-+ rem4->port = port;
-+ mpcb->list_rcvd = 1;
-+ return;
-+ }
-+ }
-+
-+ i = mptcp_find_free_index(fmp->rem4_bits);
-+ /* Do we have already the maximum number of local/remote addresses? */
-+ if (i < 0) {
-+ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
-+ __func__, MPTCP_MAX_ADDR, &addr->s_addr);
-+ return;
-+ }
-+
-+ rem4 = &fmp->remaddr4[i];
-+
-+ /* Address is not known yet, store it */
-+ rem4->addr.s_addr = addr->s_addr;
-+ rem4->port = port;
-+ rem4->bitfield = 0;
-+ rem4->retry_bitfield = 0;
-+ rem4->rem4_id = id;
-+ mpcb->list_rcvd = 1;
-+ fmp->rem4_bits |= (1 << i);
-+
-+ return;
-+}
-+
-+static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
-+ const struct in6_addr *addr,
-+ __be16 port, u8 id)
-+{
-+ int i;
-+ struct fullmesh_rem6 *rem6;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ rem6 = &fmp->remaddr6[i];
-+
-+ /* Address is already in the list --- continue */
-+ if (rem6->rem6_id == id &&
-+ ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
-+ return;
-+
-+ /* This may be the case, when the peer is behind a NAT. He is
-+ * trying to JOIN, thus sending the JOIN with a certain ID.
-+ * However the src_addr of the IP-packet has been changed. We
-+ * update the addr in the list, because this is the address as
-+ * OUR BOX sees it.
-+ */
-+ if (rem6->rem6_id == id) {
-+ /* update the address */
-+ mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
-+ __func__, &rem6->addr, addr, id);
-+ rem6->addr = *addr;
-+ rem6->port = port;
-+ mpcb->list_rcvd = 1;
-+ return;
-+ }
-+ }
-+
-+ i = mptcp_find_free_index(fmp->rem6_bits);
-+ /* Do we have already the maximum number of local/remote addresses? */
-+ if (i < 0) {
-+ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
-+ __func__, MPTCP_MAX_ADDR, addr);
-+ return;
-+ }
-+
-+ rem6 = &fmp->remaddr6[i];
-+
-+ /* Address is not known yet, store it */
-+ rem6->addr = *addr;
-+ rem6->port = port;
-+ rem6->bitfield = 0;
-+ rem6->retry_bitfield = 0;
-+ rem6->rem6_id = id;
-+ mpcb->list_rcvd = 1;
-+ fmp->rem6_bits |= (1 << i);
-+
-+ return;
-+}
-+
-+static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
-+{
-+ int i;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ if (fmp->remaddr4[i].rem4_id == id) {
-+ /* remove address from bitfield */
-+ fmp->rem4_bits &= ~(1 << i);
-+
-+ break;
-+ }
-+ }
-+}
-+
-+static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
-+{
-+ int i;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ if (fmp->remaddr6[i].rem6_id == id) {
-+ /* remove address from bitfield */
-+ fmp->rem6_bits &= ~(1 << i);
-+
-+ break;
-+ }
-+ }
-+}
-+
-+/* Sets the bitfield of the remote-address field */
-+static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
-+ const struct in_addr *addr, u8 index)
-+{
-+ int i;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
-+ fmp->remaddr4[i].bitfield |= (1 << index);
-+ return;
-+ }
-+ }
-+}
-+
-+/* Sets the bitfield of the remote-address field */
-+static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
-+ const struct in6_addr *addr, u8 index)
-+{
-+ int i;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
-+ fmp->remaddr6[i].bitfield |= (1 << index);
-+ return;
-+ }
-+ }
-+}
-+
-+static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
-+ const union inet_addr *addr,
-+ sa_family_t family, u8 id)
-+{
-+ if (family == AF_INET)
-+ mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
-+ else
-+ mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
-+}
-+
-+static void retry_subflow_worker(struct work_struct *work)
-+{
-+ struct delayed_work *delayed_work = container_of(work,
-+ struct delayed_work,
-+ work);
-+ struct fullmesh_priv *fmp = container_of(delayed_work,
-+ struct fullmesh_priv,
-+ subflow_retry_work);
-+ struct mptcp_cb *mpcb = fmp->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+ int iter = 0, i;
-+
-+ /* We need a local (stable) copy of the address-list. Really, it is not
-+ * such a big deal, if the address-list is not 100% up-to-date.
-+ */
-+ rcu_read_lock_bh();
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
-+ rcu_read_unlock_bh();
-+
-+ if (!mptcp_local)
-+ return;
-+
-+next_subflow:
-+ if (iter) {
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+
-+ cond_resched();
-+ }
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ iter++;
-+
-+ if (sock_flag(meta_sk, SOCK_DEAD))
-+ goto exit;
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
-+ /* Do we need to retry establishing a subflow ? */
-+ if (rem->retry_bitfield) {
-+ int i = mptcp_find_free_index(~rem->retry_bitfield);
-+ struct mptcp_rem4 rem4;
-+
-+ rem->bitfield |= (1 << i);
-+ rem->retry_bitfield &= ~(1 << i);
-+
-+ rem4.addr = rem->addr;
-+ rem4.port = rem->port;
-+ rem4.rem4_id = rem->rem4_id;
-+
-+ mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
-+ goto next_subflow;
-+ }
-+ }
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
-+
-+ /* Do we need to retry establishing a subflow ? */
-+ if (rem->retry_bitfield) {
-+ int i = mptcp_find_free_index(~rem->retry_bitfield);
-+ struct mptcp_rem6 rem6;
-+
-+ rem->bitfield |= (1 << i);
-+ rem->retry_bitfield &= ~(1 << i);
-+
-+ rem6.addr = rem->addr;
-+ rem6.port = rem->port;
-+ rem6.rem6_id = rem->rem6_id;
-+
-+ mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
-+ goto next_subflow;
-+ }
-+ }
-+#endif
-+
-+exit:
-+ kfree(mptcp_local);
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk);
-+}
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+ struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
-+ subflow_work);
-+ struct mptcp_cb *mpcb = fmp->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ struct mptcp_loc_addr *mptcp_local;
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+ int iter = 0, retry = 0;
-+ int i;
-+
-+ /* We need a local (stable) copy of the address-list. Really, it is not
-+ * such a big deal, if the address-list is not 100% up-to-date.
-+ */
-+ rcu_read_lock_bh();
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
-+ rcu_read_unlock_bh();
-+
-+ if (!mptcp_local)
-+ return;
-+
-+next_subflow:
-+ if (iter) {
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+
-+ cond_resched();
-+ }
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ iter++;
-+
-+ if (sock_flag(meta_sk, SOCK_DEAD))
-+ goto exit;
-+
-+ if (mpcb->master_sk &&
-+ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+ goto exit;
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ struct fullmesh_rem4 *rem;
-+ u8 remaining_bits;
-+
-+ rem = &fmp->remaddr4[i];
-+ remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
-+
-+ /* Are there still combinations to handle? */
-+ if (remaining_bits) {
-+ int i = mptcp_find_free_index(~remaining_bits);
-+ struct mptcp_rem4 rem4;
-+
-+ rem->bitfield |= (1 << i);
-+
-+ rem4.addr = rem->addr;
-+ rem4.port = rem->port;
-+ rem4.rem4_id = rem->rem4_id;
-+
-+ /* If a route is not yet available then retry once */
-+ if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
-+ &rem4) == -ENETUNREACH)
-+ retry = rem->retry_bitfield |= (1 << i);
-+ goto next_subflow;
-+ }
-+ }
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ struct fullmesh_rem6 *rem;
-+ u8 remaining_bits;
-+
-+ rem = &fmp->remaddr6[i];
-+ remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
-+
-+ /* Are there still combinations to handle? */
-+ if (remaining_bits) {
-+ int i = mptcp_find_free_index(~remaining_bits);
-+ struct mptcp_rem6 rem6;
-+
-+ rem->bitfield |= (1 << i);
-+
-+ rem6.addr = rem->addr;
-+ rem6.port = rem->port;
-+ rem6.rem6_id = rem->rem6_id;
-+
-+ /* If a route is not yet available then retry once */
-+ if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
-+ &rem6) == -ENETUNREACH)
-+ retry = rem->retry_bitfield |= (1 << i);
-+ goto next_subflow;
-+ }
-+ }
-+#endif
-+
-+ if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
-+ sock_hold(meta_sk);
-+ queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
-+ msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
-+ }
-+
-+exit:
-+ kfree(mptcp_local);
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk);
-+}
-+
-+static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ struct sock *sk = mptcp_select_ack_sock(meta_sk);
-+
-+ fmp->remove_addrs |= (1 << addr_id);
-+ mpcb->addr_signal = 1;
-+
-+ if (sk)
-+ tcp_send_ack(sk);
-+}
-+
-+static void update_addr_bitfields(struct sock *meta_sk,
-+ const struct mptcp_loc_addr *mptcp_local)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ int i;
-+
-+ /* The bits in announced_addrs_* always match with loc*_bits. So, a
-+ * simply & operation unsets the correct bits, because these go from
-+ * announced to non-announced
-+ */
-+ fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
-+ fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
-+ }
-+
-+ fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
-+
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
-+ fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
-+ }
-+}
-+
-+static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
-+ sa_family_t family, const union inet_addr *addr)
-+{
-+ int i;
-+ u8 loc_bits;
-+ bool found = false;
-+
-+ if (family == AF_INET)
-+ loc_bits = mptcp_local->loc4_bits;
-+ else
-+ loc_bits = mptcp_local->loc6_bits;
-+
-+ mptcp_for_each_bit_set(loc_bits, i) {
-+ if (family == AF_INET &&
-+ mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
-+ found = true;
-+ break;
-+ }
-+ if (family == AF_INET6 &&
-+ ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
-+ &addr->in6)) {
-+ found = true;
-+ break;
-+ }
-+ }
-+
-+ if (!found)
-+ return -1;
-+
-+ return i;
-+}
-+
-+static void mptcp_address_worker(struct work_struct *work)
-+{
-+ const struct delayed_work *delayed_work = container_of(work,
-+ struct delayed_work,
-+ work);
-+ struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
-+ struct mptcp_fm_ns,
-+ address_worker);
-+ struct net *net = fm_ns->net;
-+ struct mptcp_addr_event *event = NULL;
-+ struct mptcp_loc_addr *mptcp_local, *old;
-+ int i, id = -1; /* id is used in the socket-code on a delete-event */
-+ bool success; /* Used to indicate if we succeeded handling the event */
-+
-+next_event:
-+ success = false;
-+ kfree(event);
-+
-+ /* First, let's dequeue an event from our event-list */
-+ rcu_read_lock_bh();
-+ spin_lock(&fm_ns->local_lock);
-+
-+ event = list_first_entry_or_null(&fm_ns->events,
-+ struct mptcp_addr_event, list);
-+ if (!event) {
-+ spin_unlock(&fm_ns->local_lock);
-+ rcu_read_unlock_bh();
-+ return;
-+ }
-+
-+ list_del(&event->list);
-+
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+
-+ if (event->code == MPTCP_EVENT_DEL) {
-+ id = mptcp_find_address(mptcp_local, event->family, &event->addr);
-+
-+ /* Not in the list - so we don't care */
-+ if (id < 0) {
-+ mptcp_debug("%s could not find id\n", __func__);
-+ goto duno;
-+ }
-+
-+ old = mptcp_local;
-+ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
-+ GFP_ATOMIC);
-+ if (!mptcp_local)
-+ goto duno;
-+
-+ if (event->family == AF_INET)
-+ mptcp_local->loc4_bits &= ~(1 << id);
-+ else
-+ mptcp_local->loc6_bits &= ~(1 << id);
-+
-+ rcu_assign_pointer(fm_ns->local, mptcp_local);
-+ kfree(old);
-+ } else {
-+ int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
-+ int j = i;
-+
-+ if (j < 0) {
-+ /* Not in the list, so we have to find an empty slot */
-+ if (event->family == AF_INET)
-+ i = __mptcp_find_free_index(mptcp_local->loc4_bits,
-+ mptcp_local->next_v4_index);
-+ if (event->family == AF_INET6)
-+ i = __mptcp_find_free_index(mptcp_local->loc6_bits,
-+ mptcp_local->next_v6_index);
-+
-+ if (i < 0) {
-+ mptcp_debug("%s no more space\n", __func__);
-+ goto duno;
-+ }
-+
-+ /* It might have been a MOD-event. */
-+ event->code = MPTCP_EVENT_ADD;
-+ } else {
-+ /* Let's check if anything changes */
-+ if (event->family == AF_INET &&
-+ event->low_prio == mptcp_local->locaddr4[i].low_prio)
-+ goto duno;
-+
-+ if (event->family == AF_INET6 &&
-+ event->low_prio == mptcp_local->locaddr6[i].low_prio)
-+ goto duno;
-+ }
-+
-+ old = mptcp_local;
-+ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
-+ GFP_ATOMIC);
-+ if (!mptcp_local)
-+ goto duno;
-+
-+ if (event->family == AF_INET) {
-+ mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
-+ mptcp_local->locaddr4[i].loc4_id = i + 1;
-+ mptcp_local->locaddr4[i].low_prio = event->low_prio;
-+ } else {
-+ mptcp_local->locaddr6[i].addr = event->addr.in6;
-+ mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
-+ mptcp_local->locaddr6[i].low_prio = event->low_prio;
-+ }
-+
-+ if (j < 0) {
-+ if (event->family == AF_INET) {
-+ mptcp_local->loc4_bits |= (1 << i);
-+ mptcp_local->next_v4_index = i + 1;
-+ } else {
-+ mptcp_local->loc6_bits |= (1 << i);
-+ mptcp_local->next_v6_index = i + 1;
-+ }
-+ }
-+
-+ rcu_assign_pointer(fm_ns->local, mptcp_local);
-+ kfree(old);
-+ }
-+ success = true;
-+
-+duno:
-+ spin_unlock(&fm_ns->local_lock);
-+ rcu_read_unlock_bh();
-+
-+ if (!success)
-+ goto next_event;
-+
-+ /* Now we iterate over the MPTCP-sockets and apply the event. */
-+ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+ const struct hlist_nulls_node *node;
-+ struct tcp_sock *meta_tp;
-+
-+ rcu_read_lock_bh();
-+ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
-+ tk_table) {
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sock *meta_sk = (struct sock *)meta_tp, *sk;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+ if (sock_net(meta_sk) != net)
-+ continue;
-+
-+ if (meta_v4) {
-+ /* skip IPv6 events if meta is IPv4 */
-+ if (event->family == AF_INET6)
-+ continue;
-+ }
-+ /* skip IPv4 events if IPV6_V6ONLY is set */
-+ else if (event->family == AF_INET &&
-+ inet6_sk(meta_sk)->ipv6only)
-+ continue;
-+
-+ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+ continue;
-+
-+ bh_lock_sock(meta_sk);
-+
-+ if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
-+ mpcb->infinite_mapping_snd ||
-+ mpcb->infinite_mapping_rcv ||
-+ mpcb->send_infinite_mapping)
-+ goto next;
-+
-+ /* May be that the pm has changed in-between */
-+ if (mpcb->pm_ops != &full_mesh)
-+ goto next;
-+
-+ if (sock_owned_by_user(meta_sk)) {
-+ if (!test_and_set_bit(MPTCP_PATH_MANAGER,
-+ &meta_tp->tsq_flags))
-+ sock_hold(meta_sk);
-+
-+ goto next;
-+ }
-+
-+ if (event->code == MPTCP_EVENT_ADD) {
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+
-+ sk = mptcp_select_ack_sock(meta_sk);
-+ if (sk)
-+ tcp_send_ack(sk);
-+
-+ full_mesh_create_subflows(meta_sk);
-+ }
-+
-+ if (event->code == MPTCP_EVENT_DEL) {
-+ struct sock *sk, *tmpsk;
-+ struct mptcp_loc_addr *mptcp_local;
-+ bool found = false;
-+
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+
-+ /* In any case, we need to update our bitfields */
-+ if (id >= 0)
-+ update_addr_bitfields(meta_sk, mptcp_local);
-+
-+ /* Look for the socket and remove him */
-+ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
-+ if ((event->family == AF_INET6 &&
-+ (sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(sk))) ||
-+ (event->family == AF_INET &&
-+ (sk->sk_family == AF_INET6 &&
-+ !mptcp_v6_is_v4_mapped(sk))))
-+ continue;
-+
-+ if (event->family == AF_INET &&
-+ (sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(sk)) &&
-+ inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
-+ continue;
-+
-+ if (event->family == AF_INET6 &&
-+ sk->sk_family == AF_INET6 &&
-+ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
-+ continue;
-+
-+ /* Reinject, so that pf = 1 and so we
-+ * won't select this one as the
-+ * ack-sock.
-+ */
-+ mptcp_reinject_data(sk, 0);
-+
-+ /* We announce the removal of this id */
-+ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
-+
-+ mptcp_sub_force_close(sk);
-+ found = true;
-+ }
-+
-+ if (found)
-+ goto next;
-+
-+ /* The id may have been given by the event,
-+ * matching on a local address. And it may not
-+ * have matched on one of the above sockets,
-+ * because the client never created a subflow.
-+ * So, we have to finally remove it here.
-+ */
-+ if (id > 0)
-+ announce_remove_addr(id, meta_sk);
-+ }
-+
-+ if (event->code == MPTCP_EVENT_MOD) {
-+ struct sock *sk;
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ if (event->family == AF_INET &&
-+ (sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(sk)) &&
-+ inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
-+ if (event->low_prio != tp->mptcp->low_prio) {
-+ tp->mptcp->send_mp_prio = 1;
-+ tp->mptcp->low_prio = event->low_prio;
-+
-+ tcp_send_ack(sk);
-+ }
-+ }
-+
-+ if (event->family == AF_INET6 &&
-+ sk->sk_family == AF_INET6 &&
-+ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
-+ if (event->low_prio != tp->mptcp->low_prio) {
-+ tp->mptcp->send_mp_prio = 1;
-+ tp->mptcp->low_prio = event->low_prio;
-+
-+ tcp_send_ack(sk);
-+ }
-+ }
-+ }
-+ }
-+next:
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk);
-+ }
-+ rcu_read_unlock_bh();
-+ }
-+ goto next_event;
-+}
-+
-+static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
-+ const struct mptcp_addr_event *event)
-+{
-+ struct mptcp_addr_event *eventq;
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+
-+ list_for_each_entry(eventq, &fm_ns->events, list) {
-+ if (eventq->family != event->family)
-+ continue;
-+ if (event->family == AF_INET) {
-+ if (eventq->addr.in.s_addr == event->addr.in.s_addr)
-+ return eventq;
-+ } else {
-+ if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
-+ return eventq;
-+ }
-+ }
-+ return NULL;
-+}
-+
-+/* We already hold the net-namespace MPTCP-lock */
-+static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
-+{
-+ struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+
-+ if (eventq) {
-+ switch (event->code) {
-+ case MPTCP_EVENT_DEL:
-+ mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
-+ list_del(&eventq->list);
-+ kfree(eventq);
-+ break;
-+ case MPTCP_EVENT_ADD:
-+ mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
-+ eventq->low_prio = event->low_prio;
-+ eventq->code = MPTCP_EVENT_ADD;
-+ return;
-+ case MPTCP_EVENT_MOD:
-+ mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
-+ eventq->low_prio = event->low_prio;
-+ eventq->code = MPTCP_EVENT_MOD;
-+ return;
-+ }
-+ }
-+
-+ /* OK, we have to add the new address to the wait queue */
-+ eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
-+ if (!eventq)
-+ return;
-+
-+ list_add_tail(&eventq->list, &fm_ns->events);
-+
-+ /* Create work-queue */
-+ if (!delayed_work_pending(&fm_ns->address_worker))
-+ queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
-+ msecs_to_jiffies(500));
-+}
-+
-+static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
-+ struct net *net)
-+{
-+ const struct net_device *netdev = ifa->ifa_dev->dev;
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+ struct mptcp_addr_event mpevent;
-+
-+ if (ifa->ifa_scope > RT_SCOPE_LINK ||
-+ ipv4_is_loopback(ifa->ifa_local))
-+ return;
-+
-+ spin_lock_bh(&fm_ns->local_lock);
-+
-+ mpevent.family = AF_INET;
-+ mpevent.addr.in.s_addr = ifa->ifa_local;
-+ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
-+
-+ if (event == NETDEV_DOWN || !netif_running(netdev) ||
-+ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
-+ mpevent.code = MPTCP_EVENT_DEL;
-+ else if (event == NETDEV_UP)
-+ mpevent.code = MPTCP_EVENT_ADD;
-+ else if (event == NETDEV_CHANGE)
-+ mpevent.code = MPTCP_EVENT_MOD;
-+
-+ mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
-+ &ifa->ifa_local, mpevent.code, mpevent.low_prio);
-+ add_pm_event(net, &mpevent);
-+
-+ spin_unlock_bh(&fm_ns->local_lock);
-+ return;
-+}
-+
-+/* React on IPv4-addr add/rem-events */
-+static int mptcp_pm_inetaddr_event(struct notifier_block *this,
-+ unsigned long event, void *ptr)
-+{
-+ const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
-+ struct net *net = dev_net(ifa->ifa_dev->dev);
-+
-+ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+ event == NETDEV_CHANGE))
-+ return NOTIFY_DONE;
-+
-+ addr4_event_handler(ifa, event, net);
-+
-+ return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block mptcp_pm_inetaddr_notifier = {
-+ .notifier_call = mptcp_pm_inetaddr_event,
-+};
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+
-+/* IPV6-related address/interface watchers */
-+struct mptcp_dad_data {
-+ struct timer_list timer;
-+ struct inet6_ifaddr *ifa;
-+};
-+
-+static void dad_callback(unsigned long arg);
-+static int inet6_addr_event(struct notifier_block *this,
-+ unsigned long event, void *ptr);
-+
-+static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
-+{
-+ return (ifa->flags & IFA_F_TENTATIVE) &&
-+ ifa->state == INET6_IFADDR_STATE_DAD;
-+}
-+
-+static void dad_init_timer(struct mptcp_dad_data *data,
-+ struct inet6_ifaddr *ifa)
-+{
-+ data->ifa = ifa;
-+ data->timer.data = (unsigned long)data;
-+ data->timer.function = dad_callback;
-+ if (ifa->idev->cnf.rtr_solicit_delay)
-+ data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
-+ else
-+ data->timer.expires = jiffies + (HZ/10);
-+}
-+
-+static void dad_callback(unsigned long arg)
-+{
-+ struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
-+
-+ if (ipv6_is_in_dad_state(data->ifa)) {
-+ dad_init_timer(data, data->ifa);
-+ add_timer(&data->timer);
-+ } else {
-+ inet6_addr_event(NULL, NETDEV_UP, data->ifa);
-+ in6_ifa_put(data->ifa);
-+ kfree(data);
-+ }
-+}
-+
-+static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
-+{
-+ struct mptcp_dad_data *data;
-+
-+ data = kmalloc(sizeof(*data), GFP_ATOMIC);
-+
-+ if (!data)
-+ return;
-+
-+ init_timer(&data->timer);
-+ dad_init_timer(data, ifa);
-+ add_timer(&data->timer);
-+ in6_ifa_hold(ifa);
-+}
-+
-+static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
-+ struct net *net)
-+{
-+ const struct net_device *netdev = ifa->idev->dev;
-+ int addr_type = ipv6_addr_type(&ifa->addr);
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+ struct mptcp_addr_event mpevent;
-+
-+ if (ifa->scope > RT_SCOPE_LINK ||
-+ addr_type == IPV6_ADDR_ANY ||
-+ (addr_type & IPV6_ADDR_LOOPBACK) ||
-+ (addr_type & IPV6_ADDR_LINKLOCAL))
-+ return;
-+
-+ spin_lock_bh(&fm_ns->local_lock);
-+
-+ mpevent.family = AF_INET6;
-+ mpevent.addr.in6 = ifa->addr;
-+ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
-+
-+ if (event == NETDEV_DOWN || !netif_running(netdev) ||
-+ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
-+ mpevent.code = MPTCP_EVENT_DEL;
-+ else if (event == NETDEV_UP)
-+ mpevent.code = MPTCP_EVENT_ADD;
-+ else if (event == NETDEV_CHANGE)
-+ mpevent.code = MPTCP_EVENT_MOD;
-+
-+ mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
-+ &ifa->addr, mpevent.code, mpevent.low_prio);
-+ add_pm_event(net, &mpevent);
-+
-+ spin_unlock_bh(&fm_ns->local_lock);
-+ return;
-+}
-+
-+/* React on IPv6-addr add/rem-events */
-+static int inet6_addr_event(struct notifier_block *this, unsigned long event,
-+ void *ptr)
-+{
-+ struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
-+ struct net *net = dev_net(ifa6->idev->dev);
-+
-+ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+ event == NETDEV_CHANGE))
-+ return NOTIFY_DONE;
-+
-+ if (ipv6_is_in_dad_state(ifa6))
-+ dad_setup_timer(ifa6);
-+ else
-+ addr6_event_handler(ifa6, event, net);
-+
-+ return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block inet6_addr_notifier = {
-+ .notifier_call = inet6_addr_event,
-+};
-+
-+#endif
-+
-+/* React on ifup/down-events */
-+static int netdev_event(struct notifier_block *this, unsigned long event,
-+ void *ptr)
-+{
-+ const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-+ struct in_device *in_dev;
-+#if IS_ENABLED(CONFIG_IPV6)
-+ struct inet6_dev *in6_dev;
-+#endif
-+
-+ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+ event == NETDEV_CHANGE))
-+ return NOTIFY_DONE;
-+
-+ rcu_read_lock();
-+ in_dev = __in_dev_get_rtnl(dev);
-+
-+ if (in_dev) {
-+ for_ifa(in_dev) {
-+ mptcp_pm_inetaddr_event(NULL, event, ifa);
-+ } endfor_ifa(in_dev);
-+ }
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ in6_dev = __in6_dev_get(dev);
-+
-+ if (in6_dev) {
-+ struct inet6_ifaddr *ifa6;
-+ list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
-+ inet6_addr_event(NULL, event, ifa6);
-+ }
-+#endif
-+
-+ rcu_read_unlock();
-+ return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block mptcp_pm_netdev_notifier = {
-+ .notifier_call = netdev_event,
-+};
-+
-+static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
-+ const union inet_addr *addr,
-+ sa_family_t family, __be16 port, u8 id)
-+{
-+ if (family == AF_INET)
-+ mptcp_addv4_raddr(mpcb, &addr->in, port, id);
-+ else
-+ mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
-+}
-+
-+static void full_mesh_new_session(const struct sock *meta_sk)
-+{
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+ int i, index;
-+ union inet_addr saddr, daddr;
-+ sa_family_t family;
-+ bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+ /* Init local variables necessary for the rest */
-+ if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
-+ saddr.ip = inet_sk(meta_sk)->inet_saddr;
-+ daddr.ip = inet_sk(meta_sk)->inet_daddr;
-+ family = AF_INET;
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else {
-+ saddr.in6 = inet6_sk(meta_sk)->saddr;
-+ daddr.in6 = meta_sk->sk_v6_daddr;
-+ family = AF_INET6;
-+#endif
-+ }
-+
-+ rcu_read_lock();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ index = mptcp_find_address(mptcp_local, family, &saddr);
-+ if (index < 0)
-+ goto fallback;
-+
-+ full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
-+ mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
-+
-+ /* Initialize workqueue-struct */
-+ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+ INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
-+ fmp->mpcb = mpcb;
-+
-+ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+ goto skip_ipv4;
-+
-+ /* Look for the address among the local addresses */
-+ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+ __be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
-+
-+ /* We do not need to announce the initial subflow's address again */
-+ if (family == AF_INET && saddr.ip == ifa_address)
-+ continue;
-+
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+ }
-+
-+skip_ipv4:
-+#if IS_ENABLED(CONFIG_IPV6)
-+ /* skip IPv6 addresses if meta-socket is IPv4 */
-+ if (meta_v4)
-+ goto skip_ipv6;
-+
-+ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+ const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
-+
-+ /* We do not need to announce the initial subflow's address again */
-+ if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
-+ continue;
-+
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+ }
-+
-+skip_ipv6:
-+#endif
-+
-+ rcu_read_unlock();
-+
-+ if (family == AF_INET)
-+ fmp->announced_addrs_v4 |= (1 << index);
-+ else
-+ fmp->announced_addrs_v6 |= (1 << index);
-+
-+ for (i = fmp->add_addr; i && fmp->add_addr; i--)
-+ tcp_send_ack(mpcb->master_sk);
-+
-+ return;
-+
-+fallback:
-+ rcu_read_unlock();
-+ mptcp_fallback_default(mpcb);
-+ return;
-+}
-+
-+static void full_mesh_create_subflows(struct sock *meta_sk)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+ mpcb->send_infinite_mapping ||
-+ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+ return;
-+
-+ if (mpcb->master_sk &&
-+ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+ return;
-+
-+ if (!work_pending(&fmp->subflow_work)) {
-+ sock_hold(meta_sk);
-+ queue_work(mptcp_wq, &fmp->subflow_work);
-+ }
-+}
-+
-+/* Called upon release_sock, if the socket was owned by the user during
-+ * a path-management event.
-+ */
-+static void full_mesh_release_sock(struct sock *meta_sk)
-+{
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+ struct sock *sk, *tmpsk;
-+ bool meta_v4 = meta_sk->sk_family == AF_INET;
-+ int i;
-+
-+ rcu_read_lock();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+ goto skip_ipv4;
-+
-+ /* First, detect modifications or additions */
-+ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+ struct in_addr ifa = mptcp_local->locaddr4[i].addr;
-+ bool found = false;
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (sk->sk_family == AF_INET6 &&
-+ !mptcp_v6_is_v4_mapped(sk))
-+ continue;
-+
-+ if (inet_sk(sk)->inet_saddr != ifa.s_addr)
-+ continue;
-+
-+ found = true;
-+
-+ if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
-+ tp->mptcp->send_mp_prio = 1;
-+ tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
-+
-+ tcp_send_ack(sk);
-+ }
-+ }
-+
-+ if (!found) {
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+
-+ sk = mptcp_select_ack_sock(meta_sk);
-+ if (sk)
-+ tcp_send_ack(sk);
-+ full_mesh_create_subflows(meta_sk);
-+ }
-+ }
-+
-+skip_ipv4:
-+#if IS_ENABLED(CONFIG_IPV6)
-+ /* skip IPv6 addresses if meta-socket is IPv4 */
-+ if (meta_v4)
-+ goto removal;
-+
-+ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+ struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
-+ bool found = false;
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(sk))
-+ continue;
-+
-+ if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
-+ continue;
-+
-+ found = true;
-+
-+ if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
-+ tp->mptcp->send_mp_prio = 1;
-+ tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
-+
-+ tcp_send_ack(sk);
-+ }
-+ }
-+
-+ if (!found) {
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+
-+ sk = mptcp_select_ack_sock(meta_sk);
-+ if (sk)
-+ tcp_send_ack(sk);
-+ full_mesh_create_subflows(meta_sk);
-+ }
-+ }
-+
-+removal:
-+#endif
-+
-+ /* Now, detect address-removals */
-+ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
-+ bool shall_remove = true;
-+
-+ if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
-+ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+ if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
-+ shall_remove = false;
-+ break;
-+ }
-+ }
-+ } else {
-+ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+ if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
-+ shall_remove = false;
-+ break;
-+ }
-+ }
-+ }
-+
-+ if (shall_remove) {
-+ /* Reinject, so that pf = 1 and so we
-+ * won't select this one as the
-+ * ack-sock.
-+ */
-+ mptcp_reinject_data(sk, 0);
-+
-+ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
-+ meta_sk);
-+
-+ mptcp_sub_force_close(sk);
-+ }
-+ }
-+
-+ /* Just call it optimistically. It actually cannot do any harm */
-+ update_addr_bitfields(meta_sk, mptcp_local);
-+
-+ rcu_read_unlock();
-+}
-+
-+static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio)
-+{
-+ struct mptcp_loc_addr *mptcp_local;
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+ int index, id = -1;
-+
-+ /* Handle the backup-flows */
-+ rcu_read_lock();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ index = mptcp_find_address(mptcp_local, family, addr);
-+
-+ if (index != -1) {
-+ if (family == AF_INET) {
-+ id = mptcp_local->locaddr4[index].loc4_id;
-+ *low_prio = mptcp_local->locaddr4[index].low_prio;
-+ } else {
-+ id = mptcp_local->locaddr6[index].loc6_id;
-+ *low_prio = mptcp_local->locaddr6[index].low_prio;
-+ }
-+ }
-+
-+
-+ rcu_read_unlock();
-+
-+ return id;
-+}
-+
-+static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
-+ struct tcp_out_options *opts,
-+ struct sk_buff *skb)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
-+ int remove_addr_len;
-+ u8 unannouncedv4 = 0, unannouncedv6 = 0;
-+ bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+ mpcb->addr_signal = 0;
-+
-+ if (likely(!fmp->add_addr))
-+ goto remove_addr;
-+
-+ rcu_read_lock();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+ goto skip_ipv4;
-+
-+ /* IPv4 */
-+ unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
-+ if (unannouncedv4 &&
-+ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
-+ int ind = mptcp_find_free_index(~unannouncedv4);
-+
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_ADD_ADDR;
-+ opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
-+ opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
-+ opts->add_addr_v4 = 1;
-+
-+ if (skb) {
-+ fmp->announced_addrs_v4 |= (1 << ind);
-+ fmp->add_addr--;
-+ }
-+ *size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
-+ }
-+
-+ if (meta_v4)
-+ goto skip_ipv6;
-+
-+skip_ipv4:
-+ /* IPv6 */
-+ unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
-+ if (unannouncedv6 &&
-+ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
-+ int ind = mptcp_find_free_index(~unannouncedv6);
-+
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_ADD_ADDR;
-+ opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
-+ opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
-+ opts->add_addr_v6 = 1;
-+
-+ if (skb) {
-+ fmp->announced_addrs_v6 |= (1 << ind);
-+ fmp->add_addr--;
-+ }
-+ *size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
-+ }
-+
-+skip_ipv6:
-+ rcu_read_unlock();
-+
-+ if (!unannouncedv4 && !unannouncedv6 && skb)
-+ fmp->add_addr--;
-+
-+remove_addr:
-+ if (likely(!fmp->remove_addrs))
-+ goto exit;
-+
-+ remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
-+ if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
-+ goto exit;
-+
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_REMOVE_ADDR;
-+ opts->remove_addrs = fmp->remove_addrs;
-+ *size += remove_addr_len;
-+ if (skb)
-+ fmp->remove_addrs = 0;
-+
-+exit:
-+ mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
-+}
-+
-+static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
-+{
-+ mptcp_v4_rem_raddress(mpcb, rem_id);
-+ mptcp_v6_rem_raddress(mpcb, rem_id);
-+}
-+
-+/* Output /proc/net/mptcp_fullmesh */
-+static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
-+{
-+ const struct net *net = seq->private;
-+ struct mptcp_loc_addr *mptcp_local;
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+ int i;
-+
-+ seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
-+
-+ rcu_read_lock_bh();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
-+
-+ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+ struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
-+
-+ seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
-+ loc4->low_prio, &loc4->addr);
-+ }
-+
-+ seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
-+
-+ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+ struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
-+
-+ seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
-+ loc6->low_prio, &loc6->addr);
-+ }
-+ rcu_read_unlock_bh();
-+
-+ return 0;
-+}
-+
-+static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
-+{
-+ return single_open_net(inode, file, mptcp_fm_seq_show);
-+}
-+
-+static const struct file_operations mptcp_fm_seq_fops = {
-+ .owner = THIS_MODULE,
-+ .open = mptcp_fm_seq_open,
-+ .read = seq_read,
-+ .llseek = seq_lseek,
-+ .release = single_release_net,
-+};
-+
-+static int mptcp_fm_init_net(struct net *net)
-+{
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_fm_ns *fm_ns;
-+ int err = 0;
-+
-+ fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
-+ if (!fm_ns)
-+ return -ENOBUFS;
-+
-+ mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
-+ if (!mptcp_local) {
-+ err = -ENOBUFS;
-+ goto err_mptcp_local;
-+ }
-+
-+ if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
-+ &mptcp_fm_seq_fops)) {
-+ err = -ENOMEM;
-+ goto err_seq_fops;
-+ }
-+
-+ mptcp_local->next_v4_index = 1;
-+
-+ rcu_assign_pointer(fm_ns->local, mptcp_local);
-+ INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
-+ INIT_LIST_HEAD(&fm_ns->events);
-+ spin_lock_init(&fm_ns->local_lock);
-+ fm_ns->net = net;
-+ net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
-+
-+ return 0;
-+err_seq_fops:
-+ kfree(mptcp_local);
-+err_mptcp_local:
-+ kfree(fm_ns);
-+ return err;
-+}
-+
-+static void mptcp_fm_exit_net(struct net *net)
-+{
-+ struct mptcp_addr_event *eventq, *tmp;
-+ struct mptcp_fm_ns *fm_ns;
-+ struct mptcp_loc_addr *mptcp_local;
-+
-+ fm_ns = fm_get_ns(net);
-+ cancel_delayed_work_sync(&fm_ns->address_worker);
-+
-+ rcu_read_lock_bh();
-+
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+ kfree(mptcp_local);
-+
-+ spin_lock(&fm_ns->local_lock);
-+ list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
-+ list_del(&eventq->list);
-+ kfree(eventq);
-+ }
-+ spin_unlock(&fm_ns->local_lock);
-+
-+ rcu_read_unlock_bh();
-+
-+ remove_proc_entry("mptcp_fullmesh", net->proc_net);
-+
-+ kfree(fm_ns);
-+}
-+
-+static struct pernet_operations full_mesh_net_ops = {
-+ .init = mptcp_fm_init_net,
-+ .exit = mptcp_fm_exit_net,
-+};
-+
-+static struct mptcp_pm_ops full_mesh __read_mostly = {
-+ .new_session = full_mesh_new_session,
-+ .release_sock = full_mesh_release_sock,
-+ .fully_established = full_mesh_create_subflows,
-+ .new_remote_address = full_mesh_create_subflows,
-+ .get_local_id = full_mesh_get_local_id,
-+ .addr_signal = full_mesh_addr_signal,
-+ .add_raddr = full_mesh_add_raddr,
-+ .rem_raddr = full_mesh_rem_raddr,
-+ .name = "fullmesh",
-+ .owner = THIS_MODULE,
-+};
-+
-+/* General initialization of MPTCP_PM */
-+static int __init full_mesh_register(void)
-+{
-+ int ret;
-+
-+ BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
-+
-+ ret = register_pernet_subsys(&full_mesh_net_ops);
-+ if (ret)
-+ goto out;
-+
-+ ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+ if (ret)
-+ goto err_reg_inetaddr;
-+ ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+ if (ret)
-+ goto err_reg_netdev;
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ ret = register_inet6addr_notifier(&inet6_addr_notifier);
-+ if (ret)
-+ goto err_reg_inet6addr;
-+#endif
-+
-+ ret = mptcp_register_path_manager(&full_mesh);
-+ if (ret)
-+ goto err_reg_pm;
-+
-+out:
-+ return ret;
-+
-+
-+err_reg_pm:
-+#if IS_ENABLED(CONFIG_IPV6)
-+ unregister_inet6addr_notifier(&inet6_addr_notifier);
-+err_reg_inet6addr:
-+#endif
-+ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+err_reg_netdev:
-+ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+err_reg_inetaddr:
-+ unregister_pernet_subsys(&full_mesh_net_ops);
-+ goto out;
-+}
-+
-+static void full_mesh_unregister(void)
-+{
-+#if IS_ENABLED(CONFIG_IPV6)
-+ unregister_inet6addr_notifier(&inet6_addr_notifier);
-+#endif
-+ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+ unregister_pernet_subsys(&full_mesh_net_ops);
-+ mptcp_unregister_path_manager(&full_mesh);
-+}
-+
-+module_init(full_mesh_register);
-+module_exit(full_mesh_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("Full-Mesh MPTCP");
-+MODULE_VERSION("0.88");
-diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
-new file mode 100644
-index 000000000000..43704ccb639e
---- /dev/null
-+++ b/net/mptcp/mptcp_input.c
-@@ -0,0 +1,2405 @@
-+/*
-+ * MPTCP implementation - Sending side
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <asm/unaligned.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-+
-+#include <linux/kconfig.h>
-+
-+/* is seq1 < seq2 ? */
-+static inline bool before64(const u64 seq1, const u64 seq2)
-+{
-+ return (s64)(seq1 - seq2) < 0;
-+}
-+
-+/* is seq1 > seq2 ? */
-+#define after64(seq1, seq2) before64(seq2, seq1)
-+
-+static inline void mptcp_become_fully_estab(struct sock *sk)
-+{
-+ tcp_sk(sk)->mptcp->fully_established = 1;
-+
-+ if (is_master_tp(tcp_sk(sk)) &&
-+ tcp_sk(sk)->mpcb->pm_ops->fully_established)
-+ tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
-+}
-+
-+/* Similar to tcp_tso_acked without any memory accounting */
-+static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
-+ struct sk_buff *skb)
-+{
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ u32 packets_acked, len;
-+
-+ BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
-+
-+ packets_acked = tcp_skb_pcount(skb);
-+
-+ if (skb_unclone(skb, GFP_ATOMIC))
-+ return 0;
-+
-+ len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
-+ __pskb_trim_head(skb, len);
-+
-+ TCP_SKB_CB(skb)->seq += len;
-+ skb->ip_summed = CHECKSUM_PARTIAL;
-+ skb->truesize -= len;
-+
-+ /* Any change of skb->len requires recalculation of tso factor. */
-+ if (tcp_skb_pcount(skb) > 1)
-+ tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
-+ packets_acked -= tcp_skb_pcount(skb);
-+
-+ if (packets_acked) {
-+ BUG_ON(tcp_skb_pcount(skb) == 0);
-+ BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
-+ }
-+
-+ return packets_acked;
-+}
-+
-+/**
-+ * Cleans the meta-socket retransmission queue and the reinject-queue.
-+ * @sk must be the metasocket.
-+ */
-+static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
-+{
-+ struct sk_buff *skb, *tmp;
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ bool acked = false;
-+ u32 acked_pcount;
-+
-+ while ((skb = tcp_write_queue_head(meta_sk)) &&
-+ skb != tcp_send_head(meta_sk)) {
-+ bool fully_acked = true;
-+
-+ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
-+ if (tcp_skb_pcount(skb) == 1 ||
-+ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
-+ break;
-+
-+ acked_pcount = tcp_tso_acked(meta_sk, skb);
-+ if (!acked_pcount)
-+ break;
-+
-+ fully_acked = false;
-+ } else {
-+ acked_pcount = tcp_skb_pcount(skb);
-+ }
-+
-+ acked = true;
-+ meta_tp->packets_out -= acked_pcount;
-+ meta_tp->retrans_stamp = 0;
-+
-+ if (!fully_acked)
-+ break;
-+
-+ tcp_unlink_write_queue(skb, meta_sk);
-+
-+ if (mptcp_is_data_fin(skb)) {
-+ struct sock *sk_it;
-+
-+ /* DATA_FIN has been acknowledged - now we can close
-+ * the subflows
-+ */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ unsigned long delay = 0;
-+
-+ /* If we are the passive closer, don't trigger
-+ * subflow-fin until the subflow has been finned
-+ * by the peer - thus we add a delay.
-+ */
-+ if (mpcb->passive_close &&
-+ sk_it->sk_state == TCP_ESTABLISHED)
-+ delay = inet_csk(sk_it)->icsk_rto << 3;
-+
-+ mptcp_sub_close(sk_it, delay);
-+ }
-+ }
-+ sk_wmem_free_skb(meta_sk, skb);
-+ }
-+ /* Remove acknowledged data from the reinject queue */
-+ skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
-+ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
-+ if (tcp_skb_pcount(skb) == 1 ||
-+ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
-+ break;
-+
-+ mptcp_tso_acked_reinject(meta_sk, skb);
-+ break;
-+ }
-+
-+ __skb_unlink(skb, &mpcb->reinject_queue);
-+ __kfree_skb(skb);
-+ }
-+
-+ if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
-+ meta_tp->snd_up = meta_tp->snd_una;
-+
-+ if (acked) {
-+ tcp_rearm_rto(meta_sk);
-+ /* Normally this is done in tcp_try_undo_loss - but MPTCP
-+ * does not call this function.
-+ */
-+ inet_csk(meta_sk)->icsk_retransmits = 0;
-+ }
-+}
-+
-+/* Inspired by tcp_rcv_state_process */
-+static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
-+ const struct sk_buff *skb, u32 data_seq,
-+ u16 data_len)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-+ const struct tcphdr *th = tcp_hdr(skb);
-+
-+ /* State-machine handling if FIN has been enqueued and he has
-+ * been acked (snd_una == write_seq) - it's important that this
-+ * here is after sk_wmem_free_skb because otherwise
-+ * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
-+ */
-+ switch (meta_sk->sk_state) {
-+ case TCP_FIN_WAIT1: {
-+ struct dst_entry *dst;
-+ int tmo;
-+
-+ if (meta_tp->snd_una != meta_tp->write_seq)
-+ break;
-+
-+ tcp_set_state(meta_sk, TCP_FIN_WAIT2);
-+ meta_sk->sk_shutdown |= SEND_SHUTDOWN;
-+
-+ dst = __sk_dst_get(sk);
-+ if (dst)
-+ dst_confirm(dst);
-+
-+ if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+ /* Wake up lingering close() */
-+ meta_sk->sk_state_change(meta_sk);
-+ break;
-+ }
-+
-+ if (meta_tp->linger2 < 0 ||
-+ (data_len &&
-+ after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
-+ meta_tp->rcv_nxt))) {
-+ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
-+ tcp_done(meta_sk);
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+ return 1;
-+ }
-+
-+ tmo = tcp_fin_time(meta_sk);
-+ if (tmo > TCP_TIMEWAIT_LEN) {
-+ inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
-+ } else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
-+ /* Bad case. We could lose such FIN otherwise.
-+ * It is not a big problem, but it looks confusing
-+ * and not so rare event. We still can lose it now,
-+ * if it spins in bh_lock_sock(), but it is really
-+ * marginal case.
-+ */
-+ inet_csk_reset_keepalive_timer(meta_sk, tmo);
-+ } else {
-+ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
-+ }
-+ break;
-+ }
-+ case TCP_CLOSING:
-+ case TCP_LAST_ACK:
-+ if (meta_tp->snd_una == meta_tp->write_seq) {
-+ tcp_done(meta_sk);
-+ return 1;
-+ }
-+ break;
-+ }
-+
-+ /* step 7: process the segment text */
-+ switch (meta_sk->sk_state) {
-+ case TCP_FIN_WAIT1:
-+ case TCP_FIN_WAIT2:
-+ /* RFC 793 says to queue data in these states,
-+ * RFC 1122 says we MUST send a reset.
-+ * BSD 4.4 also does reset.
-+ */
-+ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
-+ if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
-+ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
-+ !mptcp_is_data_fin2(skb, tp)) {
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
-+ tcp_reset(meta_sk);
-+ return 1;
-+ }
-+ }
-+ break;
-+ }
-+
-+ return 0;
-+}
-+
-+/**
-+ * @return:
-+ * i) 1: Everything's fine.
-+ * ii) -1: A reset has been sent on the subflow - csum-failure
-+ * iii) 0: csum-failure but no reset sent, because it's the last subflow.
-+ * Last packet should not be destroyed by the caller because it has
-+ * been done here.
-+ */
-+static int mptcp_verif_dss_csum(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sk_buff *tmp, *tmp1, *last = NULL;
-+ __wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
-+ int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
-+ int iter = 0;
-+
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
-+ unsigned int csum_len;
-+
-+ if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
-+ /* Mapping ends in the middle of the packet -
-+ * csum only these bytes
-+ */
-+ csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
-+ else
-+ csum_len = tmp->len;
-+
-+ offset = 0;
-+ if (overflowed) {
-+ char first_word[4];
-+ first_word[0] = 0;
-+ first_word[1] = 0;
-+ first_word[2] = 0;
-+ first_word[3] = *(tmp->data);
-+ csum_tcp = csum_partial(first_word, 4, csum_tcp);
-+ offset = 1;
-+ csum_len--;
-+ overflowed = 0;
-+ }
-+
-+ csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
-+
-+ /* Was it on an odd-length? Then we have to merge the next byte
-+ * correctly (see above)
-+ */
-+ if (csum_len != (csum_len & (~1)))
-+ overflowed = 1;
-+
-+ if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
-+ __be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
-+
-+ /* If a 64-bit dss is present, we increase the offset
-+ * by 4 bytes, as the high-order 64-bits will be added
-+ * in the final csum_partial-call.
-+ */
-+ u32 offset = skb_transport_offset(tmp) +
-+ TCP_SKB_CB(tmp)->dss_off;
-+ if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
-+ offset += 4;
-+
-+ csum_tcp = skb_checksum(tmp, offset,
-+ MPTCP_SUB_LEN_SEQ_CSUM,
-+ csum_tcp);
-+
-+ csum_tcp = csum_partial(&data_seq,
-+ sizeof(data_seq), csum_tcp);
-+
-+ dss_csum_added = 1; /* Just do it once */
-+ }
-+ last = tmp;
-+ iter++;
-+
-+ if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
-+ !before(TCP_SKB_CB(tmp1)->seq,
-+ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ break;
-+ }
-+
-+ /* Now, checksum must be 0 */
-+ if (unlikely(csum_fold(csum_tcp))) {
-+ pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
-+ __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
-+ dss_csum_added, overflowed, iter);
-+
-+ tp->mptcp->send_mp_fail = 1;
-+
-+ /* map_data_seq is the data-seq number of the
-+ * mapping we are currently checking
-+ */
-+ tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
-+
-+ if (tp->mpcb->cnt_subflows > 1) {
-+ mptcp_send_reset(sk);
-+ ans = -1;
-+ } else {
-+ tp->mpcb->send_infinite_mapping = 1;
-+
-+ /* Need to purge the rcv-queue as it's no more valid */
-+ while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
-+ tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
-+ kfree_skb(tmp);
-+ }
-+
-+ ans = 0;
-+ }
-+ }
-+
-+ return ans;
-+}
-+
-+static inline void mptcp_prepare_skb(struct sk_buff *skb,
-+ const struct sock *sk)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ u32 inc = 0;
-+
-+ /* If skb is the end of this mapping (end is always at mapping-boundary
-+ * thanks to the splitting/trimming), then we need to increase
-+ * data-end-seq by 1 if this here is a data-fin.
-+ *
-+ * We need to do -1 because end_seq includes the subflow-FIN.
-+ */
-+ if (tp->mptcp->map_data_fin &&
-+ (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
-+ (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
-+ inc = 1;
-+
-+ /* We manually set the fin-flag if it is a data-fin. For easy
-+ * processing in tcp_recvmsg.
-+ */
-+ tcp_hdr(skb)->fin = 1;
-+ } else {
-+ /* We may have a subflow-fin with data but without data-fin */
-+ tcp_hdr(skb)->fin = 0;
-+ }
-+
-+ /* Adapt data-seq's to the packet itself. We kinda transform the
-+ * dss-mapping to a per-packet granularity. This is necessary to
-+ * correctly handle overlapping mappings coming from different
-+ * subflows. Otherwise it would be a complete mess.
-+ */
-+ tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
-+ tcb->end_seq = tcb->seq + skb->len + inc;
-+}
-+
-+/**
-+ * @return: 1 if the segment has been eaten and can be suppressed,
-+ * otherwise 0.
-+ */
-+static inline int mptcp_direct_copy(const struct sk_buff *skb,
-+ struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
-+ int eaten = 0;
-+
-+ __set_current_state(TASK_RUNNING);
-+
-+ local_bh_enable();
-+ if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
-+ meta_tp->ucopy.len -= chunk;
-+ meta_tp->copied_seq += chunk;
-+ eaten = (chunk == skb->len);
-+ tcp_rcv_space_adjust(meta_sk);
-+ }
-+ local_bh_disable();
-+ return eaten;
-+}
-+
-+static inline void mptcp_reset_mapping(struct tcp_sock *tp)
-+{
-+ tp->mptcp->map_data_len = 0;
-+ tp->mptcp->map_data_seq = 0;
-+ tp->mptcp->map_subseq = 0;
-+ tp->mptcp->map_data_fin = 0;
-+ tp->mptcp->mapping_present = 0;
-+}
-+
-+/* The DSS-mapping received on the sk only covers the second half of the skb
-+ * (cut at seq). We trim the head from the skb.
-+ * Data will be freed upon kfree().
-+ *
-+ * Inspired by tcp_trim_head().
-+ */
-+static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
-+{
-+ int len = seq - TCP_SKB_CB(skb)->seq;
-+ u32 new_seq = TCP_SKB_CB(skb)->seq + len;
-+
-+ if (len < skb_headlen(skb))
-+ __skb_pull(skb, len);
-+ else
-+ __pskb_trim_head(skb, len - skb_headlen(skb));
-+
-+ TCP_SKB_CB(skb)->seq = new_seq;
-+
-+ skb->truesize -= len;
-+ atomic_sub(len, &sk->sk_rmem_alloc);
-+ sk_mem_uncharge(sk, len);
-+}
-+
-+/* The DSS-mapping received on the sk only covers the first half of the skb
-+ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
-+ * as further packets may resolve the mapping of the second half of data.
-+ *
-+ * Inspired by tcp_fragment().
-+ */
-+static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
-+{
-+ struct sk_buff *buff;
-+ int nsize;
-+ int nlen, len;
-+
-+ len = seq - TCP_SKB_CB(skb)->seq;
-+ nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
-+ if (nsize < 0)
-+ nsize = 0;
-+
-+ /* Get a new skb... force flag on. */
-+ buff = alloc_skb(nsize, GFP_ATOMIC);
-+ if (buff == NULL)
-+ return -ENOMEM;
-+
-+ skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
-+ skb_reset_transport_header(buff);
-+
-+ tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
-+ tcp_hdr(skb)->fin = 0;
-+
-+ /* We absolutly need to call skb_set_owner_r before refreshing the
-+ * truesize of buff, otherwise the moved data will account twice.
-+ */
-+ skb_set_owner_r(buff, sk);
-+ nlen = skb->len - len - nsize;
-+ buff->truesize += nlen;
-+ skb->truesize -= nlen;
-+
-+ /* Correct the sequence numbers. */
-+ TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
-+ TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
-+ TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
-+
-+ skb_split(skb, buff, len);
-+
-+ __skb_queue_after(&sk->sk_receive_queue, skb, buff);
-+
-+ return 0;
-+}
-+
-+/* @return: 0 everything is fine. Just continue processing
-+ * 1 subflow is broken stop everything
-+ * -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ /* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
-+ if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
-+ !tp->mpcb->infinite_mapping_rcv) {
-+ /* Remove a pure subflow-fin from the queue and increase
-+ * copied_seq.
-+ */
-+ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+ __skb_unlink(skb, &sk->sk_receive_queue);
-+ __kfree_skb(skb);
-+ return -1;
-+ }
-+
-+ /* If we are not yet fully established and do not know the mapping for
-+ * this segment, this path has to fallback to infinite or be torn down.
-+ */
-+ if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
-+ !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
-+ pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
-+ __func__, tp->mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index, __builtin_return_address(0),
-+ TCP_SKB_CB(skb)->seq);
-+
-+ if (!is_master_tp(tp)) {
-+ mptcp_send_reset(sk);
-+ return 1;
-+ }
-+
-+ tp->mpcb->infinite_mapping_snd = 1;
-+ tp->mpcb->infinite_mapping_rcv = 1;
-+ /* We do a seamless fallback and should not send a inf.mapping. */
-+ tp->mpcb->send_infinite_mapping = 0;
-+ tp->mptcp->fully_established = 1;
-+ }
-+
-+ /* Receiver-side becomes fully established when a whole rcv-window has
-+ * been received without the need to fallback due to the previous
-+ * condition.
-+ */
-+ if (!tp->mptcp->fully_established) {
-+ tp->mptcp->init_rcv_wnd -= skb->len;
-+ if (tp->mptcp->init_rcv_wnd < 0)
-+ mptcp_become_fully_estab(sk);
-+ }
-+
-+ return 0;
-+}
-+
-+/* @return: 0 everything is fine. Just continue processing
-+ * 1 subflow is broken stop everything
-+ * -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ u32 *ptr;
-+ u32 data_seq, sub_seq, data_len, tcp_end_seq;
-+
-+ /* If we are in infinite-mapping-mode, the subflow is guaranteed to be
-+ * in-order at the data-level. Thus data-seq-numbers can be inferred
-+ * from what is expected at the data-level.
-+ */
-+ if (mpcb->infinite_mapping_rcv) {
-+ tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
-+ tp->mptcp->map_subseq = tcb->seq;
-+ tp->mptcp->map_data_len = skb->len;
-+ tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
-+ tp->mptcp->mapping_present = 1;
-+ return 0;
-+ }
-+
-+ /* No mapping here? Exit - it is either already set or still on its way */
-+ if (!mptcp_is_data_seq(skb)) {
-+ /* Too many packets without a mapping - this subflow is broken */
-+ if (!tp->mptcp->mapping_present &&
-+ tp->rcv_nxt - tp->copied_seq > 65536) {
-+ mptcp_send_reset(sk);
-+ return 1;
-+ }
-+
-+ return 0;
-+ }
-+
-+ ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
-+ ptr++;
-+ sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
-+ ptr++;
-+ data_len = get_unaligned_be16(ptr);
-+
-+ /* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
-+ * The draft sets it to 0, but we really would like to have the
-+ * real value, to have an easy handling afterwards here in this
-+ * function.
-+ */
-+ if (mptcp_is_data_fin(skb) && skb->len == 0)
-+ sub_seq = TCP_SKB_CB(skb)->seq;
-+
-+ /* If there is already a mapping - we check if it maps with the current
-+ * one. If not - we reset.
-+ */
-+ if (tp->mptcp->mapping_present &&
-+ (data_seq != (u32)tp->mptcp->map_data_seq ||
-+ sub_seq != tp->mptcp->map_subseq ||
-+ data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
-+ mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
-+ /* Mapping in packet is different from what we want */
-+ pr_err("%s Mappings do not match!\n", __func__);
-+ pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
-+ __func__, data_seq, (u32)tp->mptcp->map_data_seq,
-+ sub_seq, tp->mptcp->map_subseq, data_len,
-+ tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
-+ tp->mptcp->map_data_fin);
-+ mptcp_send_reset(sk);
-+ return 1;
-+ }
-+
-+ /* If the previous check was good, the current mapping is valid and we exit. */
-+ if (tp->mptcp->mapping_present)
-+ return 0;
-+
-+ /* Mapping not yet set on this subflow - we set it here! */
-+
-+ if (!data_len) {
-+ mpcb->infinite_mapping_rcv = 1;
-+ tp->mptcp->fully_established = 1;
-+ /* We need to repeat mp_fail's until the sender felt
-+ * back to infinite-mapping - here we stop repeating it.
-+ */
-+ tp->mptcp->send_mp_fail = 0;
-+
-+ /* We have to fixup data_len - it must be the same as skb->len */
-+ data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
-+ sub_seq = tcb->seq;
-+
-+ /* TODO kill all other subflows than this one */
-+ /* data_seq and so on are set correctly */
-+
-+ /* At this point, the meta-ofo-queue has to be emptied,
-+ * as the following data is guaranteed to be in-order at
-+ * the data and subflow-level
-+ */
-+ mptcp_purge_ofo_queue(meta_tp);
-+ }
-+
-+ /* We are sending mp-fail's and thus are in fallback mode.
-+ * Ignore packets which do not announce the fallback and still
-+ * want to provide a mapping.
-+ */
-+ if (tp->mptcp->send_mp_fail) {
-+ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+ __skb_unlink(skb, &sk->sk_receive_queue);
-+ __kfree_skb(skb);
-+ return -1;
-+ }
-+
-+ /* FIN increased the mapping-length by 1 */
-+ if (mptcp_is_data_fin(skb))
-+ data_len--;
-+
-+ /* Subflow-sequences of packet must be
-+ * (at least partially) be part of the DSS-mapping's
-+ * subflow-sequence-space.
-+ *
-+ * Basically the mapping is not valid, if either of the
-+ * following conditions is true:
-+ *
-+ * 1. It's not a data_fin and
-+ * MPTCP-sub_seq >= TCP-end_seq
-+ *
-+ * 2. It's a data_fin and TCP-end_seq > TCP-seq and
-+ * MPTCP-sub_seq >= TCP-end_seq
-+ *
-+ * The previous two can be merged into:
-+ * TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
-+ * Because if it's not a data-fin, TCP-end_seq > TCP-seq
-+ *
-+ * 3. It's a data_fin and skb->len == 0 and
-+ * MPTCP-sub_seq > TCP-end_seq
-+ *
-+ * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
-+ * MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
-+ *
-+ * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
-+ */
-+
-+ /* subflow-fin is not part of the mapping - ignore it here ! */
-+ tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
-+ if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
-+ (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
-+ (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
-+ before(sub_seq, tp->copied_seq)) {
-+ /* Subflow-sequences of packet is different from what is in the
-+ * packet's dss-mapping. The peer is misbehaving - reset
-+ */
-+ pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
-+ "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
-+ "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
-+ skb->len, data_len, tp->copied_seq);
-+ mptcp_send_reset(sk);
-+ return 1;
-+ }
-+
-+ /* Does the DSS had 64-bit seqnum's ? */
-+ if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
-+ /* Wrapped around? */
-+ if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
-+ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
-+ } else {
-+ /* Else, access the default high-order bits */
-+ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
-+ }
-+ } else {
-+ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
-+
-+ if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
-+ /* We make sure that the data_seq is invalid.
-+ * It will be dropped later.
-+ */
-+ tp->mptcp->map_data_seq += 0xFFFFFFFF;
-+ tp->mptcp->map_data_seq += 0xFFFFFFFF;
-+ }
-+ }
-+
-+ tp->mptcp->map_data_len = data_len;
-+ tp->mptcp->map_subseq = sub_seq;
-+ tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
-+ tp->mptcp->mapping_present = 1;
-+
-+ return 0;
-+}
-+
-+/* Similar to tcp_sequence(...) */
-+static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
-+ u64 data_seq, u64 end_data_seq)
-+{
-+ const struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ u64 rcv_wup64;
-+
-+ /* Wrap-around? */
-+ if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
-+ rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
-+ meta_tp->rcv_wup;
-+ } else {
-+ rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
-+ meta_tp->rcv_wup);
-+ }
-+
-+ return !before64(end_data_seq, rcv_wup64) &&
-+ !after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
-+}
-+
-+/* @return: 0 everything is fine. Just continue processing
-+ * -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sk_buff *tmp, *tmp1;
-+ u32 tcp_end_seq;
-+
-+ if (!tp->mptcp->mapping_present)
-+ return 0;
-+
-+ /* either, the new skb gave us the mapping and the first segment
-+ * in the sub-rcv-queue has to be trimmed ...
-+ */
-+ tmp = skb_peek(&sk->sk_receive_queue);
-+ if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
-+ after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
-+ mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
-+
-+ /* ... or the new skb (tail) has to be split at the end. */
-+ tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
-+ if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
-+ u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
-+ if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
-+ /* TODO : maybe handle this here better.
-+ * We now just force meta-retransmission.
-+ */
-+ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+ __skb_unlink(skb, &sk->sk_receive_queue);
-+ __kfree_skb(skb);
-+ return -1;
-+ }
-+ }
-+
-+ /* Now, remove old sk_buff's from the receive-queue.
-+ * This may happen if the mapping has been lost for these segments and
-+ * the next mapping has already been received.
-+ */
-+ if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+ if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
-+ break;
-+
-+ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+ __skb_unlink(tmp1, &sk->sk_receive_queue);
-+
-+ /* Impossible that we could free skb here, because his
-+ * mapping is known to be valid from previous checks
-+ */
-+ __kfree_skb(tmp1);
-+ }
-+ }
-+
-+ return 0;
-+}
-+
-+/* @return: 0 everything is fine. Just continue processing
-+ * 1 subflow is broken stop everything
-+ * -1 this mapping has been put in the meta-receive-queue
-+ * -2 this mapping has been eaten by the application
-+ */
-+static int mptcp_queue_skb(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct sk_buff *tmp, *tmp1;
-+ u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
-+ bool data_queued = false;
-+
-+ /* Have we not yet received the full mapping? */
-+ if (!tp->mptcp->mapping_present ||
-+ before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ return 0;
-+
-+ /* Is this an overlapping mapping? rcv_nxt >= end_data_seq
-+ * OR
-+ * This mapping is out of window
-+ */
-+ if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
-+ !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
-+ tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+ __skb_unlink(tmp1, &sk->sk_receive_queue);
-+ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+ __kfree_skb(tmp1);
-+
-+ if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+ !before(TCP_SKB_CB(tmp)->seq,
-+ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ break;
-+ }
-+
-+ mptcp_reset_mapping(tp);
-+
-+ return -1;
-+ }
-+
-+ /* Record it, because we want to send our data_fin on the same path */
-+ if (tp->mptcp->map_data_fin) {
-+ mpcb->dfin_path_index = tp->mptcp->path_index;
-+ mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
-+ }
-+
-+ /* Verify the checksum */
-+ if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
-+ int ret = mptcp_verif_dss_csum(sk);
-+
-+ if (ret <= 0) {
-+ mptcp_reset_mapping(tp);
-+ return 1;
-+ }
-+ }
-+
-+ if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
-+ /* Seg's have to go to the meta-ofo-queue */
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+ mptcp_prepare_skb(tmp1, sk);
-+ __skb_unlink(tmp1, &sk->sk_receive_queue);
-+ /* MUST be done here, because fragstolen may be true later.
-+ * Then, kfree_skb_partial will not account the memory.
-+ */
-+ skb_orphan(tmp1);
-+
-+ if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
-+ mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
-+ else
-+ __kfree_skb(tmp1);
-+
-+ if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+ !before(TCP_SKB_CB(tmp)->seq,
-+ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ break;
-+ }
-+ tcp_enter_quickack_mode(sk);
-+ } else {
-+ /* Ready for the meta-rcv-queue */
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+ int eaten = 0;
-+ const bool copied_early = false;
-+ bool fragstolen = false;
-+ u32 old_rcv_nxt = meta_tp->rcv_nxt;
-+
-+ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+ mptcp_prepare_skb(tmp1, sk);
-+ __skb_unlink(tmp1, &sk->sk_receive_queue);
-+ /* MUST be done here, because fragstolen may be true.
-+ * Then, kfree_skb_partial will not account the memory.
-+ */
-+ skb_orphan(tmp1);
-+
-+ /* This segment has already been received */
-+ if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
-+ __kfree_skb(tmp1);
-+ goto next;
-+ }
-+
-+#ifdef CONFIG_NET_DMA
-+ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
-+ meta_tp->ucopy.task == current &&
-+ meta_tp->copied_seq == meta_tp->rcv_nxt &&
-+ tmp1->len <= meta_tp->ucopy.len &&
-+ sock_owned_by_user(meta_sk) &&
-+ tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
-+ copied_early = true;
-+ eaten = 1;
-+ }
-+#endif
-+
-+ /* Is direct copy possible ? */
-+ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
-+ meta_tp->ucopy.task == current &&
-+ meta_tp->copied_seq == meta_tp->rcv_nxt &&
-+ meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
-+ !copied_early)
-+ eaten = mptcp_direct_copy(tmp1, meta_sk);
-+
-+ if (mpcb->in_time_wait) /* In time-wait, do not receive data */
-+ eaten = 1;
-+
-+ if (!eaten)
-+ eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
-+
-+ meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
-+ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
-+
-+#ifdef CONFIG_NET_DMA
-+ if (copied_early)
-+ meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
-+#endif
-+
-+ if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
-+ mptcp_fin(meta_sk);
-+
-+ /* Check if this fills a gap in the ofo queue */
-+ if (!skb_queue_empty(&meta_tp->out_of_order_queue))
-+ mptcp_ofo_queue(meta_sk);
-+
-+#ifdef CONFIG_NET_DMA
-+ if (copied_early)
-+ __skb_queue_tail(&meta_sk->sk_async_wait_queue,
-+ tmp1);
-+ else
-+#endif
-+ if (eaten)
-+ kfree_skb_partial(tmp1, fragstolen);
-+
-+ data_queued = true;
-+next:
-+ if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+ !before(TCP_SKB_CB(tmp)->seq,
-+ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ break;
-+ }
-+ }
-+
-+ inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
-+ mptcp_reset_mapping(tp);
-+
-+ return data_queued ? -1 : -2;
-+}
-+
-+void mptcp_data_ready(struct sock *sk)
-+{
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct sk_buff *skb, *tmp;
-+ int queued = 0;
-+
-+ /* restart before the check, because mptcp_fin might have changed the
-+ * state.
-+ */
-+restart:
-+ /* If the meta cannot receive data, there is no point in pushing data.
-+ * If we are in time-wait, we may still be waiting for the final FIN.
-+ * So, we should proceed with the processing.
-+ */
-+ if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
-+ skb_queue_purge(&sk->sk_receive_queue);
-+ tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
-+ goto exit;
-+ }
-+
-+ /* Iterate over all segments, detect their mapping (if we don't have
-+ * one yet), validate them and push everything one level higher.
-+ */
-+ skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
-+ int ret;
-+ /* Pre-validation - e.g., early fallback */
-+ ret = mptcp_prevalidate_skb(sk, skb);
-+ if (ret < 0)
-+ goto restart;
-+ else if (ret > 0)
-+ break;
-+
-+ /* Set the current mapping */
-+ ret = mptcp_detect_mapping(sk, skb);
-+ if (ret < 0)
-+ goto restart;
-+ else if (ret > 0)
-+ break;
-+
-+ /* Validation */
-+ if (mptcp_validate_mapping(sk, skb) < 0)
-+ goto restart;
-+
-+ /* Push a level higher */
-+ ret = mptcp_queue_skb(sk);
-+ if (ret < 0) {
-+ if (ret == -1)
-+ queued = ret;
-+ goto restart;
-+ } else if (ret == 0) {
-+ continue;
-+ } else { /* ret == 1 */
-+ break;
-+ }
-+ }
-+
-+exit:
-+ if (tcp_sk(sk)->close_it) {
-+ tcp_send_ack(sk);
-+ tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
-+ }
-+
-+ if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
-+ meta_sk->sk_data_ready(meta_sk);
-+}
-+
-+
-+int mptcp_check_req(struct sk_buff *skb, struct net *net)
-+{
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ struct sock *meta_sk = NULL;
-+
-+ /* MPTCP structures not initialized */
-+ if (mptcp_init_failed)
-+ return 0;
-+
-+ if (skb->protocol == htons(ETH_P_IP))
-+ meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
-+ ip_hdr(skb)->daddr, net);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else /* IPv6 */
-+ meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
-+ &ipv6_hdr(skb)->daddr, net);
-+#endif /* CONFIG_IPV6 */
-+
-+ if (!meta_sk)
-+ return 0;
-+
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+
-+ bh_lock_sock_nested(meta_sk);
-+ if (sock_owned_by_user(meta_sk)) {
-+ skb->sk = meta_sk;
-+ if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+ bh_unlock_sock(meta_sk);
-+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-+ sock_put(meta_sk); /* Taken by mptcp_search_req */
-+ kfree_skb(skb);
-+ return 1;
-+ }
-+ } else if (skb->protocol == htons(ETH_P_IP)) {
-+ tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else { /* IPv6 */
-+ tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+ }
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
-+ return 1;
-+}
-+
-+struct mp_join *mptcp_find_join(const struct sk_buff *skb)
-+{
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ unsigned char *ptr;
-+ int length = (th->doff * 4) - sizeof(struct tcphdr);
-+
-+ /* Jump through the options to check whether JOIN is there */
-+ ptr = (unsigned char *)(th + 1);
-+ while (length > 0) {
-+ int opcode = *ptr++;
-+ int opsize;
-+
-+ switch (opcode) {
-+ case TCPOPT_EOL:
-+ return NULL;
-+ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
-+ length--;
-+ continue;
-+ default:
-+ opsize = *ptr++;
-+ if (opsize < 2) /* "silly options" */
-+ return NULL;
-+ if (opsize > length)
-+ return NULL; /* don't parse partial options */
-+ if (opcode == TCPOPT_MPTCP &&
-+ ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
-+ return (struct mp_join *)(ptr - 2);
-+ }
-+ ptr += opsize - 2;
-+ length -= opsize;
-+ }
-+ }
-+ return NULL;
-+}
-+
-+int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
-+{
-+ const struct mptcp_cb *mpcb;
-+ struct sock *meta_sk;
-+ u32 token;
-+ bool meta_v4;
-+ struct mp_join *join_opt = mptcp_find_join(skb);
-+ if (!join_opt)
-+ return 0;
-+
-+ /* MPTCP structures were not initialized, so return error */
-+ if (mptcp_init_failed)
-+ return -1;
-+
-+ token = join_opt->u.syn.token;
-+ meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
-+ if (!meta_sk) {
-+ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
-+ return -1;
-+ }
-+
-+ meta_v4 = meta_sk->sk_family == AF_INET;
-+ if (meta_v4) {
-+ if (skb->protocol == htons(ETH_P_IPV6)) {
-+ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+ } else if (skb->protocol == htons(ETH_P_IP) &&
-+ inet6_sk(meta_sk)->ipv6only) {
-+ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+
-+ mpcb = tcp_sk(meta_sk)->mpcb;
-+ if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
-+ /* We are in fallback-mode on the reception-side -
-+ * no new subflows!
-+ */
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+
-+ /* Coming from time-wait-sock processing in tcp_v4_rcv.
-+ * We have to deschedule it before continuing, because otherwise
-+ * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
-+ */
-+ if (tw) {
-+ inet_twsk_deschedule(tw, &tcp_death_row);
-+ inet_twsk_put(tw);
-+ }
-+
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+ /* OK, this is a new syn/join, let's create a new open request and
-+ * send syn+ack
-+ */
-+ bh_lock_sock_nested(meta_sk);
-+ if (sock_owned_by_user(meta_sk)) {
-+ skb->sk = meta_sk;
-+ if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+ bh_unlock_sock(meta_sk);
-+ NET_INC_STATS_BH(sock_net(meta_sk),
-+ LINUX_MIB_TCPBACKLOGDROP);
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ kfree_skb(skb);
-+ return 1;
-+ }
-+ } else if (skb->protocol == htons(ETH_P_IP)) {
-+ tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else {
-+ tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+ }
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return 1;
-+}
-+
-+int mptcp_do_join_short(struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt,
-+ struct net *net)
-+{
-+ struct sock *meta_sk;
-+ u32 token;
-+ bool meta_v4;
-+
-+ token = mopt->mptcp_rem_token;
-+ meta_sk = mptcp_hash_find(net, token);
-+ if (!meta_sk) {
-+ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
-+ return -1;
-+ }
-+
-+ meta_v4 = meta_sk->sk_family == AF_INET;
-+ if (meta_v4) {
-+ if (skb->protocol == htons(ETH_P_IPV6)) {
-+ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+ } else if (skb->protocol == htons(ETH_P_IP) &&
-+ inet6_sk(meta_sk)->ipv6only) {
-+ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+
-+ /* OK, this is a new syn/join, let's create a new open request and
-+ * send syn+ack
-+ */
-+ bh_lock_sock(meta_sk);
-+
-+ /* This check is also done in mptcp_vX_do_rcv. But, there we cannot
-+ * call tcp_vX_send_reset, because we hold already two socket-locks.
-+ * (the listener and the meta from above)
-+ *
-+ * And the send-reset will try to take yet another one (ip_send_reply).
-+ * Thus, we propagate the reset up to tcp_rcv_state_process.
-+ */
-+ if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
-+ tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
-+ meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+
-+ if (sock_owned_by_user(meta_sk)) {
-+ skb->sk = meta_sk;
-+ if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
-+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-+ else
-+ /* Must make sure that upper layers won't free the
-+ * skb if it is added to the backlog-queue.
-+ */
-+ skb_get(skb);
-+ } else {
-+ /* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
-+ * the skb will finally be freed by tcp_v4_do_rcv (where we are
-+ * coming from)
-+ */
-+ skb_get(skb);
-+ if (skb->protocol == htons(ETH_P_IP)) {
-+ tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else { /* IPv6 */
-+ tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+ }
-+ }
-+
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return 0;
-+}
-+
-+/**
-+ * Equivalent of tcp_fin() for MPTCP
-+ * Can be called only when the FIN is validly part
-+ * of the data seqnum space. Not before when we get holes.
-+ */
-+void mptcp_fin(struct sock *meta_sk)
-+{
-+ struct sock *sk = NULL, *sk_it;
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
-+ sk = sk_it;
-+ break;
-+ }
-+ }
-+
-+ if (!sk || sk->sk_state == TCP_CLOSE)
-+ sk = mptcp_select_ack_sock(meta_sk);
-+
-+ inet_csk_schedule_ack(sk);
-+
-+ meta_sk->sk_shutdown |= RCV_SHUTDOWN;
-+ sock_set_flag(meta_sk, SOCK_DONE);
-+
-+ switch (meta_sk->sk_state) {
-+ case TCP_SYN_RECV:
-+ case TCP_ESTABLISHED:
-+ /* Move to CLOSE_WAIT */
-+ tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
-+ inet_csk(sk)->icsk_ack.pingpong = 1;
-+ break;
-+
-+ case TCP_CLOSE_WAIT:
-+ case TCP_CLOSING:
-+ /* Received a retransmission of the FIN, do
-+ * nothing.
-+ */
-+ break;
-+ case TCP_LAST_ACK:
-+ /* RFC793: Remain in the LAST-ACK state. */
-+ break;
-+
-+ case TCP_FIN_WAIT1:
-+ /* This case occurs when a simultaneous close
-+ * happens, we must ack the received FIN and
-+ * enter the CLOSING state.
-+ */
-+ tcp_send_ack(sk);
-+ tcp_set_state(meta_sk, TCP_CLOSING);
-+ break;
-+ case TCP_FIN_WAIT2:
-+ /* Received a FIN -- send ACK and enter TIME_WAIT. */
-+ tcp_send_ack(sk);
-+ meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
-+ break;
-+ default:
-+ /* Only TCP_LISTEN and TCP_CLOSE are left, in these
-+ * cases we should never reach this piece of code.
-+ */
-+ pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
-+ meta_sk->sk_state);
-+ break;
-+ }
-+
-+ /* It _is_ possible, that we have something out-of-order _after_ FIN.
-+ * Probably, we should reset in this case. For now drop them.
-+ */
-+ mptcp_purge_ofo_queue(meta_tp);
-+ sk_mem_reclaim(meta_sk);
-+
-+ if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+ meta_sk->sk_state_change(meta_sk);
-+
-+ /* Do not send POLL_HUP for half duplex close. */
-+ if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
-+ meta_sk->sk_state == TCP_CLOSE)
-+ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
-+ else
-+ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
-+ }
-+
-+ return;
-+}
-+
-+static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sk_buff *skb;
-+
-+ if (!meta_tp->packets_out)
-+ return;
-+
-+ tcp_for_write_queue(skb, meta_sk) {
-+ if (skb == tcp_send_head(meta_sk))
-+ break;
-+
-+ if (mptcp_retransmit_skb(meta_sk, skb))
-+ return;
-+
-+ if (skb == tcp_write_queue_head(meta_sk))
-+ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
-+ inet_csk(meta_sk)->icsk_rto,
-+ TCP_RTO_MAX);
-+ }
-+}
-+
-+/* Handle the DATA_ACK */
-+static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
-+{
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ u32 prior_snd_una = meta_tp->snd_una;
-+ int prior_packets;
-+ u32 nwin, data_ack, data_seq;
-+ u16 data_len = 0;
-+
-+ /* A valid packet came in - subflow is operational again */
-+ tp->pf = 0;
-+
-+ /* Even if there is no data-ack, we stop retransmitting.
-+ * Except if this is a SYN/ACK. Then it is just a retransmission
-+ */
-+ if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
-+ tp->mptcp->pre_established = 0;
-+ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+ }
-+
-+ /* If we are in infinite mapping mode, rx_opt.data_ack has been
-+ * set by mptcp_clean_rtx_infinite.
-+ */
-+ if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
-+ goto exit;
-+
-+ data_ack = tp->mptcp->rx_opt.data_ack;
-+
-+ if (unlikely(!tp->mptcp->fully_established) &&
-+ tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
-+ /* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
-+ * includes a data-ack, we are fully established
-+ */
-+ mptcp_become_fully_estab(sk);
-+
-+ /* Get the data_seq */
-+ if (mptcp_is_data_seq(skb)) {
-+ data_seq = tp->mptcp->rx_opt.data_seq;
-+ data_len = tp->mptcp->rx_opt.data_len;
-+ } else {
-+ data_seq = meta_tp->snd_wl1;
-+ }
-+
-+ /* If the ack is older than previous acks
-+ * then we can probably ignore it.
-+ */
-+ if (before(data_ack, prior_snd_una))
-+ goto exit;
-+
-+ /* If the ack includes data we haven't sent yet, discard
-+ * this segment (RFC793 Section 3.9).
-+ */
-+ if (after(data_ack, meta_tp->snd_nxt))
-+ goto exit;
-+
-+ /*** Now, update the window - inspired by tcp_ack_update_window ***/
-+ nwin = ntohs(tcp_hdr(skb)->window);
-+
-+ if (likely(!tcp_hdr(skb)->syn))
-+ nwin <<= tp->rx_opt.snd_wscale;
-+
-+ if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
-+ tcp_update_wl(meta_tp, data_seq);
-+
-+ /* Draft v09, Section 3.3.5:
-+ * [...] It should only update its local receive window values
-+ * when the largest sequence number allowed (i.e. DATA_ACK +
-+ * receive window) increases. [...]
-+ */
-+ if (meta_tp->snd_wnd != nwin &&
-+ !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
-+ meta_tp->snd_wnd = nwin;
-+
-+ if (nwin > meta_tp->max_window)
-+ meta_tp->max_window = nwin;
-+ }
-+ }
-+ /*** Done, update the window ***/
-+
-+ /* We passed data and got it acked, remove any soft error
-+ * log. Something worked...
-+ */
-+ sk->sk_err_soft = 0;
-+ inet_csk(meta_sk)->icsk_probes_out = 0;
-+ meta_tp->rcv_tstamp = tcp_time_stamp;
-+ prior_packets = meta_tp->packets_out;
-+ if (!prior_packets)
-+ goto no_queue;
-+
-+ meta_tp->snd_una = data_ack;
-+
-+ mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
-+
-+ /* We are in loss-state, and something got acked, retransmit the whole
-+ * queue now!
-+ */
-+ if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
-+ after(data_ack, prior_snd_una)) {
-+ mptcp_xmit_retransmit_queue(meta_sk);
-+ inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
-+ }
-+
-+ /* Simplified version of tcp_new_space, because the snd-buffer
-+ * is handled by all the subflows.
-+ */
-+ if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
-+ sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
-+ if (meta_sk->sk_socket &&
-+ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
-+ meta_sk->sk_write_space(meta_sk);
-+ }
-+
-+ if (meta_sk->sk_state != TCP_ESTABLISHED &&
-+ mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
-+ return;
-+
-+exit:
-+ mptcp_push_pending_frames(meta_sk);
-+
-+ return;
-+
-+no_queue:
-+ if (tcp_send_head(meta_sk))
-+ tcp_ack_probe(meta_sk);
-+
-+ mptcp_push_pending_frames(meta_sk);
-+
-+ return;
-+}
-+
-+void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
-+
-+ if (!tp->mpcb->infinite_mapping_snd)
-+ return;
-+
-+ /* The difference between both write_seq's represents the offset between
-+ * data-sequence and subflow-sequence. As we are infinite, this must
-+ * match.
-+ *
-+ * Thus, from this difference we can infer the meta snd_una.
-+ */
-+ tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
-+ tp->snd_una;
-+
-+ mptcp_data_ack(sk, skb);
-+}
-+
-+/**** static functions used by mptcp_parse_options */
-+
-+static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
-+{
-+ struct sock *sk_it, *tmpsk;
-+
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+ if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
-+ mptcp_reinject_data(sk_it, 0);
-+ sk_it->sk_err = ECONNRESET;
-+ if (tcp_need_reset(sk_it->sk_state))
-+ tcp_sk(sk_it)->ops->send_active_reset(sk_it,
-+ GFP_ATOMIC);
-+ mptcp_sub_force_close(sk_it);
-+ }
-+ }
-+}
-+
-+void mptcp_parse_options(const uint8_t *ptr, int opsize,
-+ struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb)
-+{
-+ const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
-+
-+ /* If the socket is mp-capable we would have a mopt. */
-+ if (!mopt)
-+ return;
-+
-+ switch (mp_opt->sub) {
-+ case MPTCP_SUB_CAPABLE:
-+ {
-+ const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
-+
-+ if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
-+ opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
-+ mptcp_debug("%s: mp_capable: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ if (!sysctl_mptcp_enabled)
-+ break;
-+
-+ /* We only support MPTCP version 0 */
-+ if (mpcapable->ver != 0)
-+ break;
-+
-+ /* MPTCP-RFC 6824:
-+ * "If receiving a message with the 'B' flag set to 1, and this
-+ * is not understood, then this SYN MUST be silently ignored;
-+ */
-+ if (mpcapable->b) {
-+ mopt->drop_me = 1;
-+ break;
-+ }
-+
-+ /* MPTCP-RFC 6824:
-+ * "An implementation that only supports this method MUST set
-+ * bit "H" to 1, and bits "C" through "G" to 0."
-+ */
-+ if (!mpcapable->h)
-+ break;
-+
-+ mopt->saw_mpc = 1;
-+ mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
-+
-+ if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
-+ mopt->mptcp_key = mpcapable->sender_key;
-+
-+ break;
-+ }
-+ case MPTCP_SUB_JOIN:
-+ {
-+ const struct mp_join *mpjoin = (struct mp_join *)ptr;
-+
-+ if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
-+ opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
-+ opsize != MPTCP_SUB_LEN_JOIN_ACK) {
-+ mptcp_debug("%s: mp_join: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ /* saw_mpc must be set, because in tcp_check_req we assume that
-+ * it is set to support falling back to reg. TCP if a rexmitted
-+ * SYN has no MP_CAPABLE or MP_JOIN
-+ */
-+ switch (opsize) {
-+ case MPTCP_SUB_LEN_JOIN_SYN:
-+ mopt->is_mp_join = 1;
-+ mopt->saw_mpc = 1;
-+ mopt->low_prio = mpjoin->b;
-+ mopt->rem_id = mpjoin->addr_id;
-+ mopt->mptcp_rem_token = mpjoin->u.syn.token;
-+ mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
-+ break;
-+ case MPTCP_SUB_LEN_JOIN_SYNACK:
-+ mopt->saw_mpc = 1;
-+ mopt->low_prio = mpjoin->b;
-+ mopt->rem_id = mpjoin->addr_id;
-+ mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
-+ mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
-+ break;
-+ case MPTCP_SUB_LEN_JOIN_ACK:
-+ mopt->saw_mpc = 1;
-+ mopt->join_ack = 1;
-+ memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
-+ break;
-+ }
-+ break;
-+ }
-+ case MPTCP_SUB_DSS:
-+ {
-+ const struct mp_dss *mdss = (struct mp_dss *)ptr;
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+
-+ /* We check opsize for the csum and non-csum case. We do this,
-+ * because the draft says that the csum SHOULD be ignored if
-+ * it has not been negotiated in the MP_CAPABLE but still is
-+ * present in the data.
-+ *
-+ * It will get ignored later in mptcp_queue_skb.
-+ */
-+ if (opsize != mptcp_sub_len_dss(mdss, 0) &&
-+ opsize != mptcp_sub_len_dss(mdss, 1)) {
-+ mptcp_debug("%s: mp_dss: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ ptr += 4;
-+
-+ if (mdss->A) {
-+ tcb->mptcp_flags |= MPTCPHDR_ACK;
-+
-+ if (mdss->a) {
-+ mopt->data_ack = (u32) get_unaligned_be64(ptr);
-+ ptr += MPTCP_SUB_LEN_ACK_64;
-+ } else {
-+ mopt->data_ack = get_unaligned_be32(ptr);
-+ ptr += MPTCP_SUB_LEN_ACK;
-+ }
-+ }
-+
-+ tcb->dss_off = (ptr - skb_transport_header(skb));
-+
-+ if (mdss->M) {
-+ if (mdss->m) {
-+ u64 data_seq64 = get_unaligned_be64(ptr);
-+
-+ tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
-+ mopt->data_seq = (u32) data_seq64;
-+
-+ ptr += 12; /* 64-bit dseq + subseq */
-+ } else {
-+ mopt->data_seq = get_unaligned_be32(ptr);
-+ ptr += 8; /* 32-bit dseq + subseq */
-+ }
-+ mopt->data_len = get_unaligned_be16(ptr);
-+
-+ tcb->mptcp_flags |= MPTCPHDR_SEQ;
-+
-+ /* Is a check-sum present? */
-+ if (opsize == mptcp_sub_len_dss(mdss, 1))
-+ tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
-+
-+ /* DATA_FIN only possible with DSS-mapping */
-+ if (mdss->F)
-+ tcb->mptcp_flags |= MPTCPHDR_FIN;
-+ }
-+
-+ break;
-+ }
-+ case MPTCP_SUB_ADD_ADDR:
-+ {
-+#if IS_ENABLED(CONFIG_IPV6)
-+ const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+
-+ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
-+ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
-+#else
-+ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
-+#endif /* CONFIG_IPV6 */
-+ mptcp_debug("%s: mp_add_addr: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ /* We have to manually parse the options if we got two of them. */
-+ if (mopt->saw_add_addr) {
-+ mopt->more_add_addr = 1;
-+ break;
-+ }
-+ mopt->saw_add_addr = 1;
-+ mopt->add_addr_ptr = ptr;
-+ break;
-+ }
-+ case MPTCP_SUB_REMOVE_ADDR:
-+ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
-+ mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ if (mopt->saw_rem_addr) {
-+ mopt->more_rem_addr = 1;
-+ break;
-+ }
-+ mopt->saw_rem_addr = 1;
-+ mopt->rem_addr_ptr = ptr;
-+ break;
-+ case MPTCP_SUB_PRIO:
-+ {
-+ const struct mp_prio *mpprio = (struct mp_prio *)ptr;
-+
-+ if (opsize != MPTCP_SUB_LEN_PRIO &&
-+ opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
-+ mptcp_debug("%s: mp_prio: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ mopt->saw_low_prio = 1;
-+ mopt->low_prio = mpprio->b;
-+
-+ if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
-+ mopt->saw_low_prio = 2;
-+ mopt->prio_addr_id = mpprio->addr_id;
-+ }
-+ break;
-+ }
-+ case MPTCP_SUB_FAIL:
-+ if (opsize != MPTCP_SUB_LEN_FAIL) {
-+ mptcp_debug("%s: mp_fail: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+ mopt->mp_fail = 1;
-+ break;
-+ case MPTCP_SUB_FCLOSE:
-+ if (opsize != MPTCP_SUB_LEN_FCLOSE) {
-+ mptcp_debug("%s: mp_fclose: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ mopt->mp_fclose = 1;
-+ mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
-+
-+ break;
-+ default:
-+ mptcp_debug("%s: Received unkown subtype: %d\n",
-+ __func__, mp_opt->sub);
-+ break;
-+ }
-+}
-+
-+/** Parse only MPTCP options */
-+void tcp_parse_mptcp_options(const struct sk_buff *skb,
-+ struct mptcp_options_received *mopt)
-+{
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ int length = (th->doff * 4) - sizeof(struct tcphdr);
-+ const unsigned char *ptr = (const unsigned char *)(th + 1);
-+
-+ while (length > 0) {
-+ int opcode = *ptr++;
-+ int opsize;
-+
-+ switch (opcode) {
-+ case TCPOPT_EOL:
-+ return;
-+ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
-+ length--;
-+ continue;
-+ default:
-+ opsize = *ptr++;
-+ if (opsize < 2) /* "silly options" */
-+ return;
-+ if (opsize > length)
-+ return; /* don't parse partial options */
-+ if (opcode == TCPOPT_MPTCP)
-+ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
-+ }
-+ ptr += opsize - 2;
-+ length -= opsize;
-+ }
-+}
-+
-+int mptcp_check_rtt(const struct tcp_sock *tp, int time)
-+{
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct sock *sk;
-+ u32 rtt_max = 0;
-+
-+ /* In MPTCP, we take the max delay across all flows,
-+ * in order to take into account meta-reordering buffers.
-+ */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ if (!mptcp_sk_can_recv(sk))
-+ continue;
-+
-+ if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
-+ rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
-+ }
-+ if (time < (rtt_max >> 3) || !rtt_max)
-+ return 1;
-+
-+ return 0;
-+}
-+
-+static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
-+{
-+ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+ __be16 port = 0;
-+ union inet_addr addr;
-+ sa_family_t family;
-+
-+ if (mpadd->ipver == 4) {
-+ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
-+ port = mpadd->u.v4.port;
-+ family = AF_INET;
-+ addr.in = mpadd->u.v4.addr;
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else if (mpadd->ipver == 6) {
-+ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
-+ port = mpadd->u.v6.port;
-+ family = AF_INET6;
-+ addr.in6 = mpadd->u.v6.addr;
-+#endif /* CONFIG_IPV6 */
-+ } else {
-+ return;
-+ }
-+
-+ if (mpcb->pm_ops->add_raddr)
-+ mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
-+}
-+
-+static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
-+{
-+ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
-+ int i;
-+ u8 rem_id;
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+ for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
-+ rem_id = (&mprem->addrs_id)[i];
-+
-+ if (mpcb->pm_ops->rem_raddr)
-+ mpcb->pm_ops->rem_raddr(mpcb, rem_id);
-+ mptcp_send_reset_rem_id(mpcb, rem_id);
-+ }
-+}
-+
-+static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
-+{
-+ struct tcphdr *th = tcp_hdr(skb);
-+ unsigned char *ptr;
-+ int length = (th->doff * 4) - sizeof(struct tcphdr);
-+
-+ /* Jump through the options to check whether ADD_ADDR is there */
-+ ptr = (unsigned char *)(th + 1);
-+ while (length > 0) {
-+ int opcode = *ptr++;
-+ int opsize;
-+
-+ switch (opcode) {
-+ case TCPOPT_EOL:
-+ return;
-+ case TCPOPT_NOP:
-+ length--;
-+ continue;
-+ default:
-+ opsize = *ptr++;
-+ if (opsize < 2)
-+ return;
-+ if (opsize > length)
-+ return; /* don't parse partial options */
-+ if (opcode == TCPOPT_MPTCP &&
-+ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
-+#if IS_ENABLED(CONFIG_IPV6)
-+ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
-+ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
-+#else
-+ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
-+#endif /* CONFIG_IPV6 */
-+ goto cont;
-+
-+ mptcp_handle_add_addr(ptr, sk);
-+ }
-+ if (opcode == TCPOPT_MPTCP &&
-+ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
-+ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
-+ goto cont;
-+
-+ mptcp_handle_rem_addr(ptr, sk);
-+ }
-+cont:
-+ ptr += opsize - 2;
-+ length -= opsize;
-+ }
-+ }
-+ return;
-+}
-+
-+static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
-+{
-+ struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+ if (unlikely(mptcp->rx_opt.mp_fail)) {
-+ mptcp->rx_opt.mp_fail = 0;
-+
-+ if (!th->rst && !mpcb->infinite_mapping_snd) {
-+ struct sock *sk_it;
-+
-+ mpcb->send_infinite_mapping = 1;
-+ /* We resend everything that has not been acknowledged */
-+ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
-+
-+ /* We artificially restart the whole send-queue. Thus,
-+ * it is as if no packets are in flight
-+ */
-+ tcp_sk(meta_sk)->packets_out = 0;
-+
-+ /* If the snd_nxt already wrapped around, we have to
-+ * undo the wrapping, as we are restarting from snd_una
-+ * on.
-+ */
-+ if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
-+ mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
-+ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
-+ }
-+ tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
-+
-+ /* Trigger a sending on the meta. */
-+ mptcp_push_pending_frames(meta_sk);
-+
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ if (sk != sk_it)
-+ mptcp_sub_force_close(sk_it);
-+ }
-+ }
-+
-+ return 0;
-+ }
-+
-+ if (unlikely(mptcp->rx_opt.mp_fclose)) {
-+ struct sock *sk_it, *tmpsk;
-+
-+ mptcp->rx_opt.mp_fclose = 0;
-+ if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
-+ return 0;
-+
-+ if (tcp_need_reset(sk->sk_state))
-+ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
-+
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
-+ mptcp_sub_force_close(sk_it);
-+
-+ tcp_reset(meta_sk);
-+
-+ return 1;
-+ }
-+
-+ return 0;
-+}
-+
-+static inline void mptcp_path_array_check(struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+
-+ if (unlikely(mpcb->list_rcvd)) {
-+ mpcb->list_rcvd = 0;
-+ if (mpcb->pm_ops->new_remote_address)
-+ mpcb->pm_ops->new_remote_address(meta_sk);
-+ }
-+}
-+
-+int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
-+ const struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
-+
-+ if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
-+ return 0;
-+
-+ if (mptcp_mp_fail_rcvd(sk, th))
-+ return 1;
-+
-+ /* RFC 6824, Section 3.3:
-+ * If a checksum is not present when its use has been negotiated, the
-+ * receiver MUST close the subflow with a RST as it is considered broken.
-+ */
-+ if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
-+ !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
-+ if (tcp_need_reset(sk->sk_state))
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
-+
-+ mptcp_sub_force_close(sk);
-+ return 1;
-+ }
-+
-+ /* We have to acknowledge retransmissions of the third
-+ * ack.
-+ */
-+ if (mopt->join_ack) {
-+ tcp_send_delayed_ack(sk);
-+ mopt->join_ack = 0;
-+ }
-+
-+ if (mopt->saw_add_addr || mopt->saw_rem_addr) {
-+ if (mopt->more_add_addr || mopt->more_rem_addr) {
-+ mptcp_parse_addropt(skb, sk);
-+ } else {
-+ if (mopt->saw_add_addr)
-+ mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
-+ if (mopt->saw_rem_addr)
-+ mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
-+ }
-+
-+ mopt->more_add_addr = 0;
-+ mopt->saw_add_addr = 0;
-+ mopt->more_rem_addr = 0;
-+ mopt->saw_rem_addr = 0;
-+ }
-+ if (mopt->saw_low_prio) {
-+ if (mopt->saw_low_prio == 1) {
-+ tp->mptcp->rcv_low_prio = mopt->low_prio;
-+ } else {
-+ struct sock *sk_it;
-+ mptcp_for_each_sk(tp->mpcb, sk_it) {
-+ struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
-+ if (mptcp->rem_id == mopt->prio_addr_id)
-+ mptcp->rcv_low_prio = mopt->low_prio;
-+ }
-+ }
-+ mopt->saw_low_prio = 0;
-+ }
-+
-+ mptcp_data_ack(sk, skb);
-+
-+ mptcp_path_array_check(mptcp_meta_sk(sk));
-+ /* Socket may have been mp_killed by a REMOVE_ADDR */
-+ if (tp->mp_killed)
-+ return 1;
-+
-+ return 0;
-+}
-+
-+/* In case of fastopen, some data can already be in the write queue.
-+ * We need to update the sequence number of the segments as they
-+ * were initially TCP sequence numbers.
-+ */
-+static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
-+ struct sk_buff *skb;
-+ u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
-+
-+ /* There should only be one skb in write queue: the data not
-+ * acknowledged in the SYN+ACK. In this case, we need to map
-+ * this data to data sequence numbers.
-+ */
-+ skb_queue_walk(&meta_sk->sk_write_queue, skb) {
-+ /* If the server only acknowledges partially the data sent in
-+ * the SYN, we need to trim the acknowledged part because
-+ * we don't want to retransmit this already received data.
-+ * When we reach this point, tcp_ack() has already cleaned up
-+ * fully acked segments. However, tcp trims partially acked
-+ * segments only when retransmitting. Since MPTCP comes into
-+ * play only now, we will fake an initial transmit, and
-+ * retransmit_skb() will not be called. The following fragment
-+ * comes from __tcp_retransmit_skb().
-+ */
-+ if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
-+ BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
-+ master_tp->snd_una));
-+ /* tcp_trim_head can only returns ENOMEM if skb is
-+ * cloned. It is not the case here (see
-+ * tcp_send_syn_data).
-+ */
-+ BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
-+ TCP_SKB_CB(skb)->seq));
-+ }
-+
-+ TCP_SKB_CB(skb)->seq += new_mapping;
-+ TCP_SKB_CB(skb)->end_seq += new_mapping;
-+ }
-+
-+ /* We can advance write_seq by the number of bytes unacknowledged
-+ * and that were mapped in the previous loop.
-+ */
-+ meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
-+
-+ /* The packets from the master_sk will be entailed to it later
-+ * Until that time, its write queue is empty, and
-+ * write_seq must align with snd_una
-+ */
-+ master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
-+ master_tp->packets_out = 0;
-+
-+ /* Although these data have been sent already over the subsk,
-+ * They have never been sent over the meta_sk, so we rewind
-+ * the send_head so that tcp considers it as an initial send
-+ * (instead of retransmit).
-+ */
-+ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
-+}
-+
-+/* The skptr is needed, because if we become MPTCP-capable, we have to switch
-+ * from meta-socket to master-socket.
-+ *
-+ * @return: 1 - we want to reset this connection
-+ * 2 - we want to discard the received syn/ack
-+ * 0 - everything is fine - continue
-+ */
-+int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
-+ const struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (mptcp(tp)) {
-+ u8 hash_mac_check[20];
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+
-+ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
-+ (u8 *)&mpcb->mptcp_loc_key,
-+ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
-+ (u8 *)&tp->mptcp->mptcp_loc_nonce,
-+ (u32 *)hash_mac_check);
-+ if (memcmp(hash_mac_check,
-+ (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
-+ mptcp_sub_force_close(sk);
-+ return 1;
-+ }
-+
-+ /* Set this flag in order to postpone data sending
-+ * until the 4th ack arrives.
-+ */
-+ tp->mptcp->pre_established = 1;
-+ tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
-+
-+ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
-+ (u8 *)&mpcb->mptcp_rem_key,
-+ (u8 *)&tp->mptcp->mptcp_loc_nonce,
-+ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
-+ (u32 *)&tp->mptcp->sender_mac[0]);
-+
-+ } else if (mopt->saw_mpc) {
-+ struct sock *meta_sk = sk;
-+
-+ if (mptcp_create_master_sk(sk, mopt->mptcp_key,
-+ ntohs(tcp_hdr(skb)->window)))
-+ return 2;
-+
-+ sk = tcp_sk(sk)->mpcb->master_sk;
-+ *skptr = sk;
-+ tp = tcp_sk(sk);
-+
-+ /* If fastopen was used data might be in the send queue. We
-+ * need to update their sequence number to MPTCP-level seqno.
-+ * Note that it can happen in rare cases that fastopen_req is
-+ * NULL and syn_data is 0 but fastopen indeed occurred and
-+ * data has been queued in the write queue (but not sent).
-+ * Example of such rare cases: connect is non-blocking and
-+ * TFO is configured to work without cookies.
-+ */
-+ if (!skb_queue_empty(&meta_sk->sk_write_queue))
-+ mptcp_rcv_synsent_fastopen(meta_sk);
-+
-+ /* -1, because the SYN consumed 1 byte. In case of TFO, we
-+ * start the subflow-sequence number as if the data of the SYN
-+ * is not part of any mapping.
-+ */
-+ tp->mptcp->snt_isn = tp->snd_una - 1;
-+ tp->mpcb->dss_csum = mopt->dss_csum;
-+ tp->mptcp->include_mpc = 1;
-+
-+ /* Ensure that fastopen is handled at the meta-level. */
-+ tp->fastopen_req = NULL;
-+
-+ sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
-+ sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
-+
-+ /* hold in sk_clone_lock due to initialization to 2 */
-+ sock_put(sk);
-+ } else {
-+ tp->request_mptcp = 0;
-+
-+ if (tp->inside_tk_table)
-+ mptcp_hash_remove(tp);
-+ }
-+
-+ if (mptcp(tp))
-+ tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
-+
-+ return 0;
-+}
-+
-+bool mptcp_should_expand_sndbuf(const struct sock *sk)
-+{
-+ const struct sock *sk_it;
-+ const struct sock *meta_sk = mptcp_meta_sk(sk);
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ int cnt_backups = 0;
-+ int backup_available = 0;
-+
-+ /* We circumvent this check in tcp_check_space, because we want to
-+ * always call sk_write_space. So, we reproduce the check here.
-+ */
-+ if (!meta_sk->sk_socket ||
-+ !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
-+ return false;
-+
-+ /* If the user specified a specific send buffer setting, do
-+ * not modify it.
-+ */
-+ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-+ return false;
-+
-+ /* If we are under global TCP memory pressure, do not expand. */
-+ if (sk_under_memory_pressure(meta_sk))
-+ return false;
-+
-+ /* If we are under soft global TCP memory pressure, do not expand. */
-+ if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
-+ return false;
-+
-+
-+ /* For MPTCP we look for a subsocket that could send data.
-+ * If we found one, then we update the send-buffer.
-+ */
-+ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+ struct tcp_sock *tp_it = tcp_sk(sk_it);
-+
-+ if (!mptcp_sk_can_send(sk_it))
-+ continue;
-+
-+ /* Backup-flows have to be counted - if there is no other
-+ * subflow we take the backup-flow into account.
-+ */
-+ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
-+ cnt_backups++;
-+
-+ if (tp_it->packets_out < tp_it->snd_cwnd) {
-+ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
-+ backup_available = 1;
-+ continue;
-+ }
-+ return true;
-+ }
-+ }
-+
-+ /* Backup-flow is available for sending - update send-buffer */
-+ if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
-+ return true;
-+ return false;
-+}
-+
-+void mptcp_init_buffer_space(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ int space;
-+
-+ tcp_init_buffer_space(sk);
-+
-+ if (is_master_tp(tp)) {
-+ meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
-+ meta_tp->rcvq_space.time = tcp_time_stamp;
-+ meta_tp->rcvq_space.seq = meta_tp->copied_seq;
-+
-+ /* If there is only one subflow, we just use regular TCP
-+ * autotuning. User-locks are handled already by
-+ * tcp_init_buffer_space
-+ */
-+ meta_tp->window_clamp = tp->window_clamp;
-+ meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
-+ meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
-+ meta_sk->sk_sndbuf = sk->sk_sndbuf;
-+
-+ return;
-+ }
-+
-+ if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
-+ goto snd_buf;
-+
-+ /* Adding a new subflow to the rcv-buffer space. We make a simple
-+ * addition, to give some space to allow traffic on the new subflow.
-+ * Autotuning will increase it further later on.
-+ */
-+ space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
-+ if (space > meta_sk->sk_rcvbuf) {
-+ meta_tp->window_clamp += tp->window_clamp;
-+ meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
-+ meta_sk->sk_rcvbuf = space;
-+ }
-+
-+snd_buf:
-+ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-+ return;
-+
-+ /* Adding a new subflow to the send-buffer space. We make a simple
-+ * addition, to give some space to allow traffic on the new subflow.
-+ * Autotuning will increase it further later on.
-+ */
-+ space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
-+ if (space > meta_sk->sk_sndbuf) {
-+ meta_sk->sk_sndbuf = space;
-+ meta_sk->sk_write_space(meta_sk);
-+ }
-+}
-+
-+void mptcp_tcp_set_rto(struct sock *sk)
-+{
-+ tcp_set_rto(sk);
-+ mptcp_set_rto(sk);
-+}
-diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
-new file mode 100644
-index 000000000000..1183d1305d35
---- /dev/null
-+++ b/net/mptcp/mptcp_ipv4.c
-@@ -0,0 +1,483 @@
-+/*
-+ * MPTCP implementation - IPv4-specific functions
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/export.h>
-+#include <linux/ip.h>
-+#include <linux/list.h>
-+#include <linux/skbuff.h>
-+#include <linux/spinlock.h>
-+#include <linux/tcp.h>
-+
-+#include <net/inet_common.h>
-+#include <net/inet_connection_sock.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/request_sock.h>
-+#include <net/tcp.h>
-+
-+u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
-+{
-+ u32 hash[MD5_DIGEST_WORDS];
-+
-+ hash[0] = (__force u32)saddr;
-+ hash[1] = (__force u32)daddr;
-+ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-+ hash[3] = mptcp_seed++;
-+
-+ md5_transform(hash, mptcp_secret);
-+
-+ return hash[0];
-+}
-+
-+u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
-+{
-+ u32 hash[MD5_DIGEST_WORDS];
-+
-+ hash[0] = (__force u32)saddr;
-+ hash[1] = (__force u32)daddr;
-+ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-+ hash[3] = mptcp_seed++;
-+
-+ md5_transform(hash, mptcp_secret);
-+
-+ return *((u64 *)hash);
-+}
-+
-+
-+static void mptcp_v4_reqsk_destructor(struct request_sock *req)
-+{
-+ mptcp_reqsk_destructor(req);
-+
-+ tcp_v4_reqsk_destructor(req);
-+}
-+
-+static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
-+ mptcp_reqsk_init(req, skb);
-+
-+ return 0;
-+}
-+
-+static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+ union inet_addr addr;
-+ int loc_id;
-+ bool low_prio = false;
-+
-+ /* We need to do this as early as possible. Because, if we fail later
-+ * (e.g., get_local_id), then reqsk_free tries to remove the
-+ * request-socket from the htb in mptcp_hash_request_remove as pprev
-+ * may be different from NULL.
-+ */
-+ mtreq->hash_entry.pprev = NULL;
-+
-+ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
-+
-+ mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
-+ ip_hdr(skb)->daddr,
-+ tcp_hdr(skb)->source,
-+ tcp_hdr(skb)->dest);
-+ addr.ip = inet_rsk(req)->ir_loc_addr;
-+ loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
-+ if (loc_id == -1)
-+ return -1;
-+ mtreq->loc_id = loc_id;
-+ mtreq->low_prio = low_prio;
-+
-+ mptcp_join_reqsk_init(mpcb, req, skb);
-+
-+ return 0;
-+}
-+
-+/* Similar to tcp_request_sock_ops */
-+struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
-+ .family = PF_INET,
-+ .obj_size = sizeof(struct mptcp_request_sock),
-+ .rtx_syn_ack = tcp_rtx_synack,
-+ .send_ack = tcp_v4_reqsk_send_ack,
-+ .destructor = mptcp_v4_reqsk_destructor,
-+ .send_reset = tcp_v4_send_reset,
-+ .syn_ack_timeout = tcp_syn_ack_timeout,
-+};
-+
-+static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
-+ struct request_sock *req,
-+ const unsigned long timeout)
-+{
-+ const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
-+ inet_rsk(req)->ir_rmt_port,
-+ 0, MPTCP_HASH_SIZE);
-+ /* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
-+ * want to reset the keepalive-timer (responsible for retransmitting
-+ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
-+ * overload the keepalive timer. Also, it's not a big deal, because the
-+ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
-+ * if the third ACK gets lost, the client will handle the retransmission
-+ * anyways. If our SYN/ACK gets lost, the client will retransmit the
-+ * SYN.
-+ */
-+ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
-+ const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
-+ inet_rsk(req)->ir_rmt_port,
-+ lopt->hash_rnd, lopt->nr_table_entries);
-+
-+ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
-+ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
-+ mptcp_reset_synack_timer(meta_sk, timeout);
-+
-+ rcu_read_lock();
-+ spin_lock(&mptcp_reqsk_hlock);
-+ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
-+ spin_unlock(&mptcp_reqsk_hlock);
-+ rcu_read_unlock();
-+}
-+
-+/* Similar to tcp_v4_conn_request */
-+static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ return tcp_conn_request(&mptcp_request_sock_ops,
-+ &mptcp_join_request_sock_ipv4_ops,
-+ meta_sk, skb);
-+}
-+
-+/* We only process join requests here. (either the SYN or the final ACK) */
-+int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *child, *rsk = NULL;
-+ int ret;
-+
-+ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
-+ struct tcphdr *th = tcp_hdr(skb);
-+ const struct iphdr *iph = ip_hdr(skb);
-+ struct sock *sk;
-+
-+ sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
-+ iph->saddr, th->source, iph->daddr,
-+ th->dest, inet_iif(skb));
-+
-+ if (!sk) {
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+ if (is_meta_sk(sk)) {
-+ WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
-+ kfree_skb(skb);
-+ sock_put(sk);
-+ return 0;
-+ }
-+
-+ if (sk->sk_state == TCP_TIME_WAIT) {
-+ inet_twsk_put(inet_twsk(sk));
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+
-+ ret = tcp_v4_do_rcv(sk, skb);
-+ sock_put(sk);
-+
-+ return ret;
-+ }
-+ TCP_SKB_CB(skb)->mptcp_flags = 0;
-+
-+ /* Has been removed from the tk-table. Thus, no new subflows.
-+ *
-+ * Check for close-state is necessary, because we may have been closed
-+ * without passing by mptcp_close().
-+ *
-+ * When falling back, no new subflows are allowed either.
-+ */
-+ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
-+ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
-+ goto reset_and_discard;
-+
-+ child = tcp_v4_hnd_req(meta_sk, skb);
-+
-+ if (!child)
-+ goto discard;
-+
-+ if (child != meta_sk) {
-+ sock_rps_save_rxhash(child, skb);
-+ /* We don't call tcp_child_process here, because we hold
-+ * already the meta-sk-lock and are sure that it is not owned
-+ * by the user.
-+ */
-+ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
-+ bh_unlock_sock(child);
-+ sock_put(child);
-+ if (ret) {
-+ rsk = child;
-+ goto reset_and_discard;
-+ }
-+ } else {
-+ if (tcp_hdr(skb)->syn) {
-+ mptcp_v4_join_request(meta_sk, skb);
-+ goto discard;
-+ }
-+ goto reset_and_discard;
-+ }
-+ return 0;
-+
-+reset_and_discard:
-+ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ const struct iphdr *iph = ip_hdr(skb);
-+ struct request_sock **prev, *req;
-+ /* If we end up here, it means we should not have matched on the
-+ * request-socket. But, because the request-sock queue is only
-+ * destroyed in mptcp_close, the socket may actually already be
-+ * in close-state (e.g., through shutdown()) while still having
-+ * pending request sockets.
-+ */
-+ req = inet_csk_search_req(meta_sk, &prev, th->source,
-+ iph->saddr, iph->daddr);
-+ if (req) {
-+ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
-+ req);
-+ reqsk_free(req);
-+ }
-+ }
-+
-+ tcp_v4_send_reset(rsk, skb);
-+discard:
-+ kfree_skb(skb);
-+ return 0;
-+}
-+
-+/* After this, the ref count of the meta_sk associated with the request_sock
-+ * is incremented. Thus it is the responsibility of the caller
-+ * to call sock_put() when the reference is not needed anymore.
-+ */
-+struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
-+ const __be32 laddr, const struct net *net)
-+{
-+ const struct mptcp_request_sock *mtreq;
-+ struct sock *meta_sk = NULL;
-+ const struct hlist_nulls_node *node;
-+ const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
-+
-+ rcu_read_lock();
-+begin:
-+ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
-+ hash_entry) {
-+ struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
-+ meta_sk = mtreq->mptcp_mpcb->meta_sk;
-+
-+ if (ireq->ir_rmt_port == rport &&
-+ ireq->ir_rmt_addr == raddr &&
-+ ireq->ir_loc_addr == laddr &&
-+ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
-+ net_eq(net, sock_net(meta_sk)))
-+ goto found;
-+ meta_sk = NULL;
-+ }
-+ /* A request-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
-+ goto begin;
-+
-+found:
-+ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+ meta_sk = NULL;
-+ rcu_read_unlock();
-+
-+ return meta_sk;
-+}
-+
-+/* Create a new IPv4 subflow.
-+ *
-+ * We are in user-context and meta-sock-lock is hold.
-+ */
-+int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
-+ struct mptcp_rem4 *rem)
-+{
-+ struct tcp_sock *tp;
-+ struct sock *sk;
-+ struct sockaddr_in loc_in, rem_in;
-+ struct socket sock;
-+ int ret;
-+
-+ /** First, create and prepare the new socket */
-+
-+ sock.type = meta_sk->sk_socket->type;
-+ sock.state = SS_UNCONNECTED;
-+ sock.wq = meta_sk->sk_socket->wq;
-+ sock.file = meta_sk->sk_socket->file;
-+ sock.ops = NULL;
-+
-+ ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
-+ if (unlikely(ret < 0)) {
-+ mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
-+ return ret;
-+ }
-+
-+ sk = sock.sk;
-+ tp = tcp_sk(sk);
-+
-+ /* All subsockets need the MPTCP-lock-class */
-+ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
-+ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
-+
-+ if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
-+ goto error;
-+
-+ tp->mptcp->slave_sk = 1;
-+ tp->mptcp->low_prio = loc->low_prio;
-+
-+ /* Initializing the timer for an MPTCP subflow */
-+ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
-+
-+ /** Then, connect the socket to the peer */
-+ loc_in.sin_family = AF_INET;
-+ rem_in.sin_family = AF_INET;
-+ loc_in.sin_port = 0;
-+ if (rem->port)
-+ rem_in.sin_port = rem->port;
-+ else
-+ rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
-+ loc_in.sin_addr = loc->addr;
-+ rem_in.sin_addr = rem->addr;
-+
-+ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
-+ if (ret < 0) {
-+ mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
-+ __func__, ret);
-+ goto error;
-+ }
-+
-+ mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
-+ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index, &loc_in.sin_addr,
-+ ntohs(loc_in.sin_port), &rem_in.sin_addr,
-+ ntohs(rem_in.sin_port));
-+
-+ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
-+ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
-+
-+ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
-+ sizeof(struct sockaddr_in), O_NONBLOCK);
-+ if (ret < 0 && ret != -EINPROGRESS) {
-+ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
-+ __func__, ret);
-+ goto error;
-+ }
-+
-+ sk_set_socket(sk, meta_sk->sk_socket);
-+ sk->sk_wq = meta_sk->sk_wq;
-+
-+ return 0;
-+
-+error:
-+ /* May happen if mptcp_add_sock fails first */
-+ if (!mptcp(tp)) {
-+ tcp_close(sk, 0);
-+ } else {
-+ local_bh_disable();
-+ mptcp_sub_force_close(sk);
-+ local_bh_enable();
-+ }
-+ return ret;
-+}
-+EXPORT_SYMBOL(mptcp_init4_subsockets);
-+
-+const struct inet_connection_sock_af_ops mptcp_v4_specific = {
-+ .queue_xmit = ip_queue_xmit,
-+ .send_check = tcp_v4_send_check,
-+ .rebuild_header = inet_sk_rebuild_header,
-+ .sk_rx_dst_set = inet_sk_rx_dst_set,
-+ .conn_request = mptcp_conn_request,
-+ .syn_recv_sock = tcp_v4_syn_recv_sock,
-+ .net_header_len = sizeof(struct iphdr),
-+ .setsockopt = ip_setsockopt,
-+ .getsockopt = ip_getsockopt,
-+ .addr2sockaddr = inet_csk_addr2sockaddr,
-+ .sockaddr_len = sizeof(struct sockaddr_in),
-+ .bind_conflict = inet_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+ .compat_setsockopt = compat_ip_setsockopt,
-+ .compat_getsockopt = compat_ip_getsockopt,
-+#endif
-+};
-+
-+struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
-+struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
-+
-+/* General initialization of IPv4 for MPTCP */
-+int mptcp_pm_v4_init(void)
-+{
-+ int ret = 0;
-+ struct request_sock_ops *ops = &mptcp_request_sock_ops;
-+
-+ mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-+ mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
-+
-+ mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-+ mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
-+ mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
-+
-+ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
-+ if (ops->slab_name == NULL) {
-+ ret = -ENOMEM;
-+ goto out;
-+ }
-+
-+ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
-+ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+ NULL);
-+
-+ if (ops->slab == NULL) {
-+ ret = -ENOMEM;
-+ goto err_reqsk_create;
-+ }
-+
-+out:
-+ return ret;
-+
-+err_reqsk_create:
-+ kfree(ops->slab_name);
-+ ops->slab_name = NULL;
-+ goto out;
-+}
-+
-+void mptcp_pm_v4_undo(void)
-+{
-+ kmem_cache_destroy(mptcp_request_sock_ops.slab);
-+ kfree(mptcp_request_sock_ops.slab_name);
-+}
-diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
-new file mode 100644
-index 000000000000..1036973aa855
---- /dev/null
-+++ b/net/mptcp/mptcp_ipv6.c
-@@ -0,0 +1,518 @@
-+/*
-+ * MPTCP implementation - IPv6-specific functions
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/export.h>
-+#include <linux/in6.h>
-+#include <linux/kernel.h>
-+
-+#include <net/addrconf.h>
-+#include <net/flow.h>
-+#include <net/inet6_connection_sock.h>
-+#include <net/inet6_hashtables.h>
-+#include <net/inet_common.h>
-+#include <net/ipv6.h>
-+#include <net/ip6_checksum.h>
-+#include <net/ip6_route.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v6.h>
-+#include <net/tcp.h>
-+#include <net/transp_v6.h>
-+
-+__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
-+ __be16 sport, __be16 dport)
-+{
-+ u32 secret[MD5_MESSAGE_BYTES / 4];
-+ u32 hash[MD5_DIGEST_WORDS];
-+ u32 i;
-+
-+ memcpy(hash, saddr, 16);
-+ for (i = 0; i < 4; i++)
-+ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
-+ secret[4] = mptcp_secret[4] +
-+ (((__force u16)sport << 16) + (__force u16)dport);
-+ secret[5] = mptcp_seed++;
-+ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
-+ secret[i] = mptcp_secret[i];
-+
-+ md5_transform(hash, secret);
-+
-+ return hash[0];
-+}
-+
-+u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
-+ __be16 sport, __be16 dport)
-+{
-+ u32 secret[MD5_MESSAGE_BYTES / 4];
-+ u32 hash[MD5_DIGEST_WORDS];
-+ u32 i;
-+
-+ memcpy(hash, saddr, 16);
-+ for (i = 0; i < 4; i++)
-+ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
-+ secret[4] = mptcp_secret[4] +
-+ (((__force u16)sport << 16) + (__force u16)dport);
-+ secret[5] = mptcp_seed++;
-+ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
-+ secret[i] = mptcp_secret[i];
-+
-+ md5_transform(hash, secret);
-+
-+ return *((u64 *)hash);
-+}
-+
-+static void mptcp_v6_reqsk_destructor(struct request_sock *req)
-+{
-+ mptcp_reqsk_destructor(req);
-+
-+ tcp_v6_reqsk_destructor(req);
-+}
-+
-+static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
-+ mptcp_reqsk_init(req, skb);
-+
-+ return 0;
-+}
-+
-+static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+ union inet_addr addr;
-+ int loc_id;
-+ bool low_prio = false;
-+
-+ /* We need to do this as early as possible. Because, if we fail later
-+ * (e.g., get_local_id), then reqsk_free tries to remove the
-+ * request-socket from the htb in mptcp_hash_request_remove as pprev
-+ * may be different from NULL.
-+ */
-+ mtreq->hash_entry.pprev = NULL;
-+
-+ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
-+
-+ mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
-+ ipv6_hdr(skb)->daddr.s6_addr32,
-+ tcp_hdr(skb)->source,
-+ tcp_hdr(skb)->dest);
-+ addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
-+ loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
-+ if (loc_id == -1)
-+ return -1;
-+ mtreq->loc_id = loc_id;
-+ mtreq->low_prio = low_prio;
-+
-+ mptcp_join_reqsk_init(mpcb, req, skb);
-+
-+ return 0;
-+}
-+
-+/* Similar to tcp6_request_sock_ops */
-+struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
-+ .family = AF_INET6,
-+ .obj_size = sizeof(struct mptcp_request_sock),
-+ .rtx_syn_ack = tcp_v6_rtx_synack,
-+ .send_ack = tcp_v6_reqsk_send_ack,
-+ .destructor = mptcp_v6_reqsk_destructor,
-+ .send_reset = tcp_v6_send_reset,
-+ .syn_ack_timeout = tcp_syn_ack_timeout,
-+};
-+
-+static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
-+ struct request_sock *req,
-+ const unsigned long timeout)
-+{
-+ const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
-+ inet_rsk(req)->ir_rmt_port,
-+ 0, MPTCP_HASH_SIZE);
-+ /* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
-+ * want to reset the keepalive-timer (responsible for retransmitting
-+ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
-+ * overload the keepalive timer. Also, it's not a big deal, because the
-+ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
-+ * if the third ACK gets lost, the client will handle the retransmission
-+ * anyways. If our SYN/ACK gets lost, the client will retransmit the
-+ * SYN.
-+ */
-+ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
-+ const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
-+ inet_rsk(req)->ir_rmt_port,
-+ lopt->hash_rnd, lopt->nr_table_entries);
-+
-+ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
-+ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
-+ mptcp_reset_synack_timer(meta_sk, timeout);
-+
-+ rcu_read_lock();
-+ spin_lock(&mptcp_reqsk_hlock);
-+ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
-+ spin_unlock(&mptcp_reqsk_hlock);
-+ rcu_read_unlock();
-+}
-+
-+static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ return tcp_conn_request(&mptcp6_request_sock_ops,
-+ &mptcp_join_request_sock_ipv6_ops,
-+ meta_sk, skb);
-+}
-+
-+int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *child, *rsk = NULL;
-+ int ret;
-+
-+ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
-+ struct tcphdr *th = tcp_hdr(skb);
-+ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
-+ struct sock *sk;
-+
-+ sk = __inet6_lookup_established(sock_net(meta_sk),
-+ &tcp_hashinfo,
-+ &ip6h->saddr, th->source,
-+ &ip6h->daddr, ntohs(th->dest),
-+ inet6_iif(skb));
-+
-+ if (!sk) {
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+ if (is_meta_sk(sk)) {
-+ WARN("%s Did not find a sub-sk!\n", __func__);
-+ kfree_skb(skb);
-+ sock_put(sk);
-+ return 0;
-+ }
-+
-+ if (sk->sk_state == TCP_TIME_WAIT) {
-+ inet_twsk_put(inet_twsk(sk));
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+
-+ ret = tcp_v6_do_rcv(sk, skb);
-+ sock_put(sk);
-+
-+ return ret;
-+ }
-+ TCP_SKB_CB(skb)->mptcp_flags = 0;
-+
-+ /* Has been removed from the tk-table. Thus, no new subflows.
-+ *
-+ * Check for close-state is necessary, because we may have been closed
-+ * without passing by mptcp_close().
-+ *
-+ * When falling back, no new subflows are allowed either.
-+ */
-+ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
-+ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
-+ goto reset_and_discard;
-+
-+ child = tcp_v6_hnd_req(meta_sk, skb);
-+
-+ if (!child)
-+ goto discard;
-+
-+ if (child != meta_sk) {
-+ sock_rps_save_rxhash(child, skb);
-+ /* We don't call tcp_child_process here, because we hold
-+ * already the meta-sk-lock and are sure that it is not owned
-+ * by the user.
-+ */
-+ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
-+ bh_unlock_sock(child);
-+ sock_put(child);
-+ if (ret) {
-+ rsk = child;
-+ goto reset_and_discard;
-+ }
-+ } else {
-+ if (tcp_hdr(skb)->syn) {
-+ mptcp_v6_join_request(meta_sk, skb);
-+ goto discard;
-+ }
-+ goto reset_and_discard;
-+ }
-+ return 0;
-+
-+reset_and_discard:
-+ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ struct request_sock **prev, *req;
-+ /* If we end up here, it means we should not have matched on the
-+ * request-socket. But, because the request-sock queue is only
-+ * destroyed in mptcp_close, the socket may actually already be
-+ * in close-state (e.g., through shutdown()) while still having
-+ * pending request sockets.
-+ */
-+ req = inet6_csk_search_req(meta_sk, &prev, th->source,
-+ &ipv6_hdr(skb)->saddr,
-+ &ipv6_hdr(skb)->daddr, inet6_iif(skb));
-+ if (req) {
-+ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
-+ req);
-+ reqsk_free(req);
-+ }
-+ }
-+
-+ tcp_v6_send_reset(rsk, skb);
-+discard:
-+ kfree_skb(skb);
-+ return 0;
-+}
-+
-+/* After this, the ref count of the meta_sk associated with the request_sock
-+ * is incremented. Thus it is the responsibility of the caller
-+ * to call sock_put() when the reference is not needed anymore.
-+ */
-+struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
-+ const struct in6_addr *laddr, const struct net *net)
-+{
-+ const struct mptcp_request_sock *mtreq;
-+ struct sock *meta_sk = NULL;
-+ const struct hlist_nulls_node *node;
-+ const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
-+
-+ rcu_read_lock();
-+begin:
-+ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
-+ hash_entry) {
-+ struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
-+ meta_sk = mtreq->mptcp_mpcb->meta_sk;
-+
-+ if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
-+ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
-+ ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
-+ ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
-+ net_eq(net, sock_net(meta_sk)))
-+ goto found;
-+ meta_sk = NULL;
-+ }
-+ /* A request-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
-+ goto begin;
-+
-+found:
-+ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+ meta_sk = NULL;
-+ rcu_read_unlock();
-+
-+ return meta_sk;
-+}
-+
-+/* Create a new IPv6 subflow.
-+ *
-+ * We are in user-context and meta-sock-lock is hold.
-+ */
-+int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
-+ struct mptcp_rem6 *rem)
-+{
-+ struct tcp_sock *tp;
-+ struct sock *sk;
-+ struct sockaddr_in6 loc_in, rem_in;
-+ struct socket sock;
-+ int ret;
-+
-+ /** First, create and prepare the new socket */
-+
-+ sock.type = meta_sk->sk_socket->type;
-+ sock.state = SS_UNCONNECTED;
-+ sock.wq = meta_sk->sk_socket->wq;
-+ sock.file = meta_sk->sk_socket->file;
-+ sock.ops = NULL;
-+
-+ ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
-+ if (unlikely(ret < 0)) {
-+ mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
-+ return ret;
-+ }
-+
-+ sk = sock.sk;
-+ tp = tcp_sk(sk);
-+
-+ /* All subsockets need the MPTCP-lock-class */
-+ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
-+ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
-+
-+ if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
-+ goto error;
-+
-+ tp->mptcp->slave_sk = 1;
-+ tp->mptcp->low_prio = loc->low_prio;
-+
-+ /* Initializing the timer for an MPTCP subflow */
-+ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
-+
-+ /** Then, connect the socket to the peer */
-+ loc_in.sin6_family = AF_INET6;
-+ rem_in.sin6_family = AF_INET6;
-+ loc_in.sin6_port = 0;
-+ if (rem->port)
-+ rem_in.sin6_port = rem->port;
-+ else
-+ rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
-+ loc_in.sin6_addr = loc->addr;
-+ rem_in.sin6_addr = rem->addr;
-+
-+ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
-+ if (ret < 0) {
-+ mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
-+ __func__, ret);
-+ goto error;
-+ }
-+
-+ mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
-+ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index, &loc_in.sin6_addr,
-+ ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
-+ ntohs(rem_in.sin6_port));
-+
-+ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
-+ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
-+
-+ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
-+ sizeof(struct sockaddr_in6), O_NONBLOCK);
-+ if (ret < 0 && ret != -EINPROGRESS) {
-+ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
-+ __func__, ret);
-+ goto error;
-+ }
-+
-+ sk_set_socket(sk, meta_sk->sk_socket);
-+ sk->sk_wq = meta_sk->sk_wq;
-+
-+ return 0;
-+
-+error:
-+ /* May happen if mptcp_add_sock fails first */
-+ if (!mptcp(tp)) {
-+ tcp_close(sk, 0);
-+ } else {
-+ local_bh_disable();
-+ mptcp_sub_force_close(sk);
-+ local_bh_enable();
-+ }
-+ return ret;
-+}
-+EXPORT_SYMBOL(mptcp_init6_subsockets);
-+
-+const struct inet_connection_sock_af_ops mptcp_v6_specific = {
-+ .queue_xmit = inet6_csk_xmit,
-+ .send_check = tcp_v6_send_check,
-+ .rebuild_header = inet6_sk_rebuild_header,
-+ .sk_rx_dst_set = inet6_sk_rx_dst_set,
-+ .conn_request = mptcp_conn_request,
-+ .syn_recv_sock = tcp_v6_syn_recv_sock,
-+ .net_header_len = sizeof(struct ipv6hdr),
-+ .net_frag_header_len = sizeof(struct frag_hdr),
-+ .setsockopt = ipv6_setsockopt,
-+ .getsockopt = ipv6_getsockopt,
-+ .addr2sockaddr = inet6_csk_addr2sockaddr,
-+ .sockaddr_len = sizeof(struct sockaddr_in6),
-+ .bind_conflict = inet6_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+ .compat_setsockopt = compat_ipv6_setsockopt,
-+ .compat_getsockopt = compat_ipv6_getsockopt,
-+#endif
-+};
-+
-+const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
-+ .queue_xmit = ip_queue_xmit,
-+ .send_check = tcp_v4_send_check,
-+ .rebuild_header = inet_sk_rebuild_header,
-+ .sk_rx_dst_set = inet_sk_rx_dst_set,
-+ .conn_request = mptcp_conn_request,
-+ .syn_recv_sock = tcp_v6_syn_recv_sock,
-+ .net_header_len = sizeof(struct iphdr),
-+ .setsockopt = ipv6_setsockopt,
-+ .getsockopt = ipv6_getsockopt,
-+ .addr2sockaddr = inet6_csk_addr2sockaddr,
-+ .sockaddr_len = sizeof(struct sockaddr_in6),
-+ .bind_conflict = inet6_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+ .compat_setsockopt = compat_ipv6_setsockopt,
-+ .compat_getsockopt = compat_ipv6_getsockopt,
-+#endif
-+};
-+
-+struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
-+struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
-+
-+int mptcp_pm_v6_init(void)
-+{
-+ int ret = 0;
-+ struct request_sock_ops *ops = &mptcp6_request_sock_ops;
-+
-+ mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
-+ mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
-+
-+ mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
-+ mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
-+ mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
-+
-+ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
-+ if (ops->slab_name == NULL) {
-+ ret = -ENOMEM;
-+ goto out;
-+ }
-+
-+ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
-+ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+ NULL);
-+
-+ if (ops->slab == NULL) {
-+ ret = -ENOMEM;
-+ goto err_reqsk_create;
-+ }
-+
-+out:
-+ return ret;
-+
-+err_reqsk_create:
-+ kfree(ops->slab_name);
-+ ops->slab_name = NULL;
-+ goto out;
-+}
-+
-+void mptcp_pm_v6_undo(void)
-+{
-+ kmem_cache_destroy(mptcp6_request_sock_ops.slab);
-+ kfree(mptcp6_request_sock_ops.slab_name);
-+}
-diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
-new file mode 100644
-index 000000000000..6f5087983175
---- /dev/null
-+++ b/net/mptcp/mptcp_ndiffports.c
-@@ -0,0 +1,161 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#endif
-+
-+struct ndiffports_priv {
-+ /* Worker struct for subflow establishment */
-+ struct work_struct subflow_work;
-+
-+ struct mptcp_cb *mpcb;
-+};
-+
-+static int num_subflows __read_mostly = 2;
-+module_param(num_subflows, int, 0644);
-+MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+ const struct ndiffports_priv *pm_priv = container_of(work,
-+ struct ndiffports_priv,
-+ subflow_work);
-+ struct mptcp_cb *mpcb = pm_priv->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ int iter = 0;
-+
-+next_subflow:
-+ if (iter) {
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+
-+ cond_resched();
-+ }
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ iter++;
-+
-+ if (sock_flag(meta_sk, SOCK_DEAD))
-+ goto exit;
-+
-+ if (mpcb->master_sk &&
-+ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+ goto exit;
-+
-+ if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
-+ if (meta_sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(meta_sk)) {
-+ struct mptcp_loc4 loc;
-+ struct mptcp_rem4 rem;
-+
-+ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
-+ loc.loc4_id = 0;
-+ loc.low_prio = 0;
-+
-+ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
-+ rem.port = inet_sk(meta_sk)->inet_dport;
-+ rem.rem4_id = 0; /* Default 0 */
-+
-+ mptcp_init4_subsockets(meta_sk, &loc, &rem);
-+ } else {
-+#if IS_ENABLED(CONFIG_IPV6)
-+ struct mptcp_loc6 loc;
-+ struct mptcp_rem6 rem;
-+
-+ loc.addr = inet6_sk(meta_sk)->saddr;
-+ loc.loc6_id = 0;
-+ loc.low_prio = 0;
-+
-+ rem.addr = meta_sk->sk_v6_daddr;
-+ rem.port = inet_sk(meta_sk)->inet_dport;
-+ rem.rem6_id = 0; /* Default 0 */
-+
-+ mptcp_init6_subsockets(meta_sk, &loc, &rem);
-+#endif
-+ }
-+ goto next_subflow;
-+ }
-+
-+exit:
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk);
-+}
-+
-+static void ndiffports_new_session(const struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
-+
-+ /* Initialize workqueue-struct */
-+ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+ fmp->mpcb = mpcb;
-+}
-+
-+static void ndiffports_create_subflows(struct sock *meta_sk)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
-+
-+ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+ mpcb->send_infinite_mapping ||
-+ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+ return;
-+
-+ if (!work_pending(&pm_priv->subflow_work)) {
-+ sock_hold(meta_sk);
-+ queue_work(mptcp_wq, &pm_priv->subflow_work);
-+ }
-+}
-+
-+static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio)
-+{
-+ return 0;
-+}
-+
-+static struct mptcp_pm_ops ndiffports __read_mostly = {
-+ .new_session = ndiffports_new_session,
-+ .fully_established = ndiffports_create_subflows,
-+ .get_local_id = ndiffports_get_local_id,
-+ .name = "ndiffports",
-+ .owner = THIS_MODULE,
-+};
-+
-+/* General initialization of MPTCP_PM */
-+static int __init ndiffports_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
-+
-+ if (mptcp_register_path_manager(&ndiffports))
-+ goto exit;
-+
-+ return 0;
-+
-+exit:
-+ return -1;
-+}
-+
-+static void ndiffports_unregister(void)
-+{
-+ mptcp_unregister_path_manager(&ndiffports);
-+}
-+
-+module_init(ndiffports_register);
-+module_exit(ndiffports_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
-+MODULE_VERSION("0.88");
-diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
-new file mode 100644
-index 000000000000..ec4e98622637
---- /dev/null
-+++ b/net/mptcp/mptcp_ofo_queue.c
-@@ -0,0 +1,295 @@
-+/*
-+ * MPTCP implementation - Fast algorithm for MPTCP meta-reordering
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/skbuff.h>
-+#include <linux/slab.h>
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+ const struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp;
-+
-+ mptcp_for_each_tp(mpcb, tp) {
-+ if (tp->mptcp->shortcut_ofoqueue == skb) {
-+ tp->mptcp->shortcut_ofoqueue = NULL;
-+ return;
-+ }
-+ }
-+}
-+
-+/* Does 'skb' fits after 'here' in the queue 'head' ?
-+ * If yes, we queue it and return 1
-+ */
-+static int mptcp_ofo_queue_after(struct sk_buff_head *head,
-+ struct sk_buff *skb, struct sk_buff *here,
-+ const struct tcp_sock *tp)
-+{
-+ struct sock *meta_sk = tp->meta_sk;
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ u32 seq = TCP_SKB_CB(skb)->seq;
-+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+ /* We want to queue skb after here, thus seq >= end_seq */
-+ if (before(seq, TCP_SKB_CB(here)->end_seq))
-+ return 0;
-+
-+ if (seq == TCP_SKB_CB(here)->end_seq) {
-+ bool fragstolen = false;
-+
-+ if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
-+ __skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
-+ return 1;
-+ } else {
-+ kfree_skb_partial(skb, fragstolen);
-+ return -1;
-+ }
-+ }
-+
-+ /* If here is the last one, we can always queue it */
-+ if (skb_queue_is_last(head, here)) {
-+ __skb_queue_after(head, here, skb);
-+ return 1;
-+ } else {
-+ struct sk_buff *skb1 = skb_queue_next(head, here);
-+ /* It's not the last one, but does it fits between 'here' and
-+ * the one after 'here' ? Thus, does end_seq <= after_here->seq
-+ */
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
-+ __skb_queue_after(head, here, skb);
-+ return 1;
-+ }
-+ }
-+
-+ return 0;
-+}
-+
-+static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
-+ struct sk_buff_head *head, struct tcp_sock *tp)
-+{
-+ struct sock *meta_sk = tp->meta_sk;
-+ struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sk_buff *skb1, *best_shortcut = NULL;
-+ u32 seq = TCP_SKB_CB(skb)->seq;
-+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+ u32 distance = 0xffffffff;
-+
-+ /* First, check the tp's shortcut */
-+ if (!shortcut) {
-+ if (skb_queue_empty(head)) {
-+ __skb_queue_head(head, skb);
-+ goto end;
-+ }
-+ } else {
-+ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
-+ /* Does the tp's shortcut is a hit? If yes, we insert. */
-+
-+ if (ret) {
-+ skb = (ret > 0) ? skb : NULL;
-+ goto end;
-+ }
-+ }
-+
-+ /* Check the shortcuts of the other subsockets. */
-+ mptcp_for_each_tp(mpcb, tp_it) {
-+ shortcut = tp_it->mptcp->shortcut_ofoqueue;
-+ /* Can we queue it here? If yes, do so! */
-+ if (shortcut) {
-+ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
-+
-+ if (ret) {
-+ skb = (ret > 0) ? skb : NULL;
-+ goto end;
-+ }
-+ }
-+
-+ /* Could not queue it, check if we are close.
-+ * We are looking for a shortcut, close enough to seq to
-+ * set skb1 prematurely and thus improve the subsequent lookup,
-+ * which tries to find a skb1 so that skb1->seq <= seq.
-+ *
-+ * So, here we only take shortcuts, whose shortcut->seq > seq,
-+ * and minimize the distance between shortcut->seq and seq and
-+ * set best_shortcut to this one with the minimal distance.
-+ *
-+ * That way, the subsequent while-loop is shortest.
-+ */
-+ if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
-+ /* Are we closer than the current best shortcut? */
-+ if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
-+ distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
-+ best_shortcut = shortcut;
-+ }
-+ }
-+ }
-+
-+ if (best_shortcut)
-+ skb1 = best_shortcut;
-+ else
-+ skb1 = skb_peek_tail(head);
-+
-+ if (seq == TCP_SKB_CB(skb1)->end_seq) {
-+ bool fragstolen = false;
-+
-+ if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
-+ __skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
-+ } else {
-+ kfree_skb_partial(skb, fragstolen);
-+ skb = NULL;
-+ }
-+
-+ goto end;
-+ }
-+
-+ /* Find the insertion point, starting from best_shortcut if available.
-+ *
-+ * Inspired from tcp_data_queue_ofo.
-+ */
-+ while (1) {
-+ /* skb1->seq <= seq */
-+ if (!after(TCP_SKB_CB(skb1)->seq, seq))
-+ break;
-+ if (skb_queue_is_first(head, skb1)) {
-+ skb1 = NULL;
-+ break;
-+ }
-+ skb1 = skb_queue_prev(head, skb1);
-+ }
-+
-+ /* Do skb overlap to previous one? */
-+ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ /* All the bits are present. */
-+ __kfree_skb(skb);
-+ skb = NULL;
-+ goto end;
-+ }
-+ if (seq == TCP_SKB_CB(skb1)->seq) {
-+ if (skb_queue_is_first(head, skb1))
-+ skb1 = NULL;
-+ else
-+ skb1 = skb_queue_prev(head, skb1);
-+ }
-+ }
-+ if (!skb1)
-+ __skb_queue_head(head, skb);
-+ else
-+ __skb_queue_after(head, skb1, skb);
-+
-+ /* And clean segments covered by new one as whole. */
-+ while (!skb_queue_is_last(head, skb)) {
-+ skb1 = skb_queue_next(head, skb);
-+
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-+ break;
-+
-+ __skb_unlink(skb1, head);
-+ mptcp_remove_shortcuts(mpcb, skb1);
-+ __kfree_skb(skb1);
-+ }
-+
-+end:
-+ if (skb) {
-+ skb_set_owner_r(skb, meta_sk);
-+ tp->mptcp->shortcut_ofoqueue = skb;
-+ }
-+
-+ return;
-+}
-+
-+/**
-+ * @sk: the subflow that received this skb.
-+ */
-+void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
-+ struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
-+ &tcp_sk(meta_sk)->out_of_order_queue, tp);
-+}
-+
-+bool mptcp_prune_ofo_queue(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ bool res = false;
-+
-+ if (!skb_queue_empty(&tp->out_of_order_queue)) {
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-+ mptcp_purge_ofo_queue(tp);
-+
-+ /* No sack at the mptcp-level */
-+ sk_mem_reclaim(sk);
-+ res = true;
-+ }
-+
-+ return res;
-+}
-+
-+void mptcp_ofo_queue(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sk_buff *skb;
-+
-+ while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
-+ u32 old_rcv_nxt = meta_tp->rcv_nxt;
-+ if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
-+ break;
-+
-+ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
-+ __skb_unlink(skb, &meta_tp->out_of_order_queue);
-+ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+ __kfree_skb(skb);
-+ continue;
-+ }
-+
-+ __skb_unlink(skb, &meta_tp->out_of_order_queue);
-+ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+
-+ __skb_queue_tail(&meta_sk->sk_receive_queue, skb);
-+ meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-+ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
-+
-+ if (tcp_hdr(skb)->fin)
-+ mptcp_fin(meta_sk);
-+ }
-+}
-+
-+void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
-+{
-+ struct sk_buff_head *head = &meta_tp->out_of_order_queue;
-+ struct sk_buff *skb, *tmp;
-+
-+ skb_queue_walk_safe(head, skb, tmp) {
-+ __skb_unlink(skb, head);
-+ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+ kfree_skb(skb);
-+ }
-+}
-diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
-new file mode 100644
-index 000000000000..53f5c43bb488
---- /dev/null
-+++ b/net/mptcp/mptcp_olia.c
-@@ -0,0 +1,311 @@
-+/*
-+ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
-+ *
-+ * Algorithm design:
-+ * Ramin Khalili <ramin.khalili@epfl.ch>
-+ * Nicolas Gast <nicolas.gast@epfl.ch>
-+ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
-+ *
-+ * Implementation:
-+ * Ramin Khalili <ramin.khalili@epfl.ch>
-+ *
-+ * Ported to the official MPTCP-kernel:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+#include <linux/module.h>
-+
-+static int scale = 10;
-+
-+struct mptcp_olia {
-+ u32 mptcp_loss1;
-+ u32 mptcp_loss2;
-+ u32 mptcp_loss3;
-+ int epsilon_num;
-+ u32 epsilon_den;
-+ int mptcp_snd_cwnd_cnt;
-+};
-+
-+static inline int mptcp_olia_sk_can_send(const struct sock *sk)
-+{
-+ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
-+}
-+
-+static inline u64 mptcp_olia_scale(u64 val, int scale)
-+{
-+ return (u64) val << scale;
-+}
-+
-+/* take care of artificially inflate (see RFC5681)
-+ * of cwnd during fast-retransmit phase
-+ */
-+static u32 mptcp_get_crt_cwnd(struct sock *sk)
-+{
-+ const struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+ if (icsk->icsk_ca_state == TCP_CA_Recovery)
-+ return tcp_sk(sk)->snd_ssthresh;
-+ else
-+ return tcp_sk(sk)->snd_cwnd;
-+}
-+
-+/* return the dominator of the first term of the increasing term */
-+static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
-+{
-+ struct sock *sk;
-+ u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ u64 scaled_num;
-+ u32 tmp_cwnd;
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+ scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
-+ rate += div_u64(scaled_num , tp->srtt_us);
-+ }
-+ rate *= rate;
-+ return rate;
-+}
-+
-+/* find the maximum cwnd, used to find set M */
-+static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
-+{
-+ struct sock *sk;
-+ u32 best_cwnd = 0;
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ u32 tmp_cwnd;
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+ if (tmp_cwnd > best_cwnd)
-+ best_cwnd = tmp_cwnd;
-+ }
-+ return best_cwnd;
-+}
-+
-+static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
-+{
-+ struct mptcp_olia *ca;
-+ struct tcp_sock *tp;
-+ struct sock *sk;
-+ u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
-+ u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
-+ u8 M = 0, B_not_M = 0;
-+
-+ /* TODO - integrate this in the following loop - we just want to iterate once */
-+
-+ max_cwnd = mptcp_get_max_cwnd(mpcb);
-+
-+ /* find the best path */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ tp = tcp_sk(sk);
-+ ca = inet_csk_ca(sk);
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+ /* TODO - check here and rename variables */
-+ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+ ca->mptcp_loss2 - ca->mptcp_loss1);
-+
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+ if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
-+ best_rtt = tmp_rtt;
-+ best_int = tmp_int;
-+ best_cwnd = tmp_cwnd;
-+ }
-+ }
-+
-+ /* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
-+ /* find the size of M and B_not_M */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ tp = tcp_sk(sk);
-+ ca = inet_csk_ca(sk);
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+ if (tmp_cwnd == max_cwnd) {
-+ M++;
-+ } else {
-+ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+ ca->mptcp_loss2 - ca->mptcp_loss1);
-+
-+ if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
-+ B_not_M++;
-+ }
-+ }
-+
-+ /* check if the path is in M or B_not_M and set the value of epsilon accordingly */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ tp = tcp_sk(sk);
-+ ca = inet_csk_ca(sk);
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ if (B_not_M == 0) {
-+ ca->epsilon_num = 0;
-+ ca->epsilon_den = 1;
-+ } else {
-+ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+ ca->mptcp_loss2 - ca->mptcp_loss1);
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+
-+ if (tmp_cwnd < max_cwnd &&
-+ (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
-+ ca->epsilon_num = 1;
-+ ca->epsilon_den = mpcb->cnt_established * B_not_M;
-+ } else if (tmp_cwnd == max_cwnd) {
-+ ca->epsilon_num = -1;
-+ ca->epsilon_den = mpcb->cnt_established * M;
-+ } else {
-+ ca->epsilon_num = 0;
-+ ca->epsilon_den = 1;
-+ }
-+ }
-+ }
-+}
-+
-+/* setting the initial values */
-+static void mptcp_olia_init(struct sock *sk)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_olia *ca = inet_csk_ca(sk);
-+
-+ if (mptcp(tp)) {
-+ ca->mptcp_loss1 = tp->snd_una;
-+ ca->mptcp_loss2 = tp->snd_una;
-+ ca->mptcp_loss3 = tp->snd_una;
-+ ca->mptcp_snd_cwnd_cnt = 0;
-+ ca->epsilon_num = 0;
-+ ca->epsilon_den = 1;
-+ }
-+}
-+
-+/* updating inter-loss distance and ssthresh */
-+static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
-+{
-+ if (!mptcp(tcp_sk(sk)))
-+ return;
-+
-+ if (new_state == TCP_CA_Loss ||
-+ new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
-+ struct mptcp_olia *ca = inet_csk_ca(sk);
-+
-+ if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
-+ !inet_csk(sk)->icsk_retransmits) {
-+ ca->mptcp_loss1 = ca->mptcp_loss2;
-+ ca->mptcp_loss2 = ca->mptcp_loss3;
-+ }
-+ }
-+}
-+
-+/* main algorithm */
-+static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_olia *ca = inet_csk_ca(sk);
-+ const struct mptcp_cb *mpcb = tp->mpcb;
-+
-+ u64 inc_num, inc_den, rate, cwnd_scaled;
-+
-+ if (!mptcp(tp)) {
-+ tcp_reno_cong_avoid(sk, ack, acked);
-+ return;
-+ }
-+
-+ ca->mptcp_loss3 = tp->snd_una;
-+
-+ if (!tcp_is_cwnd_limited(sk))
-+ return;
-+
-+ /* slow start if it is in the safe area */
-+ if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+ tcp_slow_start(tp, acked);
-+ return;
-+ }
-+
-+ mptcp_get_epsilon(mpcb);
-+ rate = mptcp_get_rate(mpcb, tp->srtt_us);
-+ cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
-+ inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
-+
-+ /* calculate the increasing term, scaling is used to reduce the rounding effect */
-+ if (ca->epsilon_num == -1) {
-+ if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
-+ inc_num = rate - ca->epsilon_den *
-+ cwnd_scaled * cwnd_scaled;
-+ ca->mptcp_snd_cwnd_cnt -= div64_u64(
-+ mptcp_olia_scale(inc_num , scale) , inc_den);
-+ } else {
-+ inc_num = ca->epsilon_den *
-+ cwnd_scaled * cwnd_scaled - rate;
-+ ca->mptcp_snd_cwnd_cnt += div64_u64(
-+ mptcp_olia_scale(inc_num , scale) , inc_den);
-+ }
-+ } else {
-+ inc_num = ca->epsilon_num * rate +
-+ ca->epsilon_den * cwnd_scaled * cwnd_scaled;
-+ ca->mptcp_snd_cwnd_cnt += div64_u64(
-+ mptcp_olia_scale(inc_num , scale) , inc_den);
-+ }
-+
-+
-+ if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
-+ if (tp->snd_cwnd < tp->snd_cwnd_clamp)
-+ tp->snd_cwnd++;
-+ ca->mptcp_snd_cwnd_cnt = 0;
-+ } else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
-+ tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
-+ ca->mptcp_snd_cwnd_cnt = 0;
-+ }
-+}
-+
-+static struct tcp_congestion_ops mptcp_olia = {
-+ .init = mptcp_olia_init,
-+ .ssthresh = tcp_reno_ssthresh,
-+ .cong_avoid = mptcp_olia_cong_avoid,
-+ .set_state = mptcp_olia_set_state,
-+ .owner = THIS_MODULE,
-+ .name = "olia",
-+};
-+
-+static int __init mptcp_olia_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
-+ return tcp_register_congestion_control(&mptcp_olia);
-+}
-+
-+static void __exit mptcp_olia_unregister(void)
-+{
-+ tcp_unregister_congestion_control(&mptcp_olia);
-+}
-+
-+module_init(mptcp_olia_register);
-+module_exit(mptcp_olia_unregister);
-+
-+MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
-new file mode 100644
-index 000000000000..400ea254c078
---- /dev/null
-+++ b/net/mptcp/mptcp_output.c
-@@ -0,0 +1,1743 @@
-+/*
-+ * MPTCP implementation - Sending side
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/kconfig.h>
-+#include <linux/skbuff.h>
-+#include <linux/tcp.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-+#include <net/sock.h>
-+
-+static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
-+ MPTCP_SUB_LEN_ACK_ALIGN +
-+ MPTCP_SUB_LEN_SEQ_ALIGN;
-+
-+static inline int mptcp_sub_len_remove_addr(u16 bitfield)
-+{
-+ unsigned int c;
-+ for (c = 0; bitfield; c++)
-+ bitfield &= bitfield - 1;
-+ return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
-+}
-+
-+int mptcp_sub_len_remove_addr_align(u16 bitfield)
-+{
-+ return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
-+}
-+EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
-+
-+/* get the data-seq and end-data-seq and store them again in the
-+ * tcp_skb_cb
-+ */
-+static int mptcp_reconstruct_mapping(struct sk_buff *skb)
-+{
-+ const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
-+ u32 *p32;
-+ u16 *p16;
-+
-+ if (!mpdss->M)
-+ return 1;
-+
-+ /* Move the pointer to the data-seq */
-+ p32 = (u32 *)mpdss;
-+ p32++;
-+ if (mpdss->A) {
-+ p32++;
-+ if (mpdss->a)
-+ p32++;
-+ }
-+
-+ TCP_SKB_CB(skb)->seq = ntohl(*p32);
-+
-+ /* Get the data_len to calculate the end_data_seq */
-+ p32++;
-+ p32++;
-+ p16 = (u16 *)p32;
-+ TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
-+
-+ return 0;
-+}
-+
-+static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ struct sk_buff *skb_it;
-+
-+ skb_it = tcp_write_queue_head(meta_sk);
-+
-+ tcp_for_write_queue_from(skb_it, meta_sk) {
-+ if (skb_it == tcp_send_head(meta_sk))
-+ break;
-+
-+ if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
-+ TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
-+ break;
-+ }
-+ }
-+}
-+
-+/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
-+ * coming from the meta-retransmit-timer
-+ */
-+static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
-+ struct sock *sk, int clone_it)
-+{
-+ struct sk_buff *skb, *skb1;
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ u32 seq, end_seq;
-+
-+ if (clone_it) {
-+ /* pskb_copy is necessary here, because the TCP/IP-headers
-+ * will be changed when it's going to be reinjected on another
-+ * subflow.
-+ */
-+ skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
-+ } else {
-+ __skb_unlink(orig_skb, &sk->sk_write_queue);
-+ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
-+ sk->sk_wmem_queued -= orig_skb->truesize;
-+ sk_mem_uncharge(sk, orig_skb->truesize);
-+ skb = orig_skb;
-+ }
-+ if (unlikely(!skb))
-+ return;
-+
-+ if (sk && mptcp_reconstruct_mapping(skb)) {
-+ __kfree_skb(skb);
-+ return;
-+ }
-+
-+ skb->sk = meta_sk;
-+
-+ /* If it reached already the destination, we don't have to reinject it */
-+ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
-+ __kfree_skb(skb);
-+ return;
-+ }
-+
-+ /* Only reinject segments that are fully covered by the mapping */
-+ if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
-+ TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
-+ u32 seq = TCP_SKB_CB(skb)->seq;
-+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+ __kfree_skb(skb);
-+
-+ /* Ok, now we have to look for the full mapping in the meta
-+ * send-queue :S
-+ */
-+ tcp_for_write_queue(skb, meta_sk) {
-+ /* Not yet at the mapping? */
-+ if (before(TCP_SKB_CB(skb)->seq, seq))
-+ continue;
-+ /* We have passed by the mapping */
-+ if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
-+ return;
-+
-+ __mptcp_reinject_data(skb, meta_sk, NULL, 1);
-+ }
-+ return;
-+ }
-+
-+ /* Segment goes back to the MPTCP-layer. So, we need to zero the
-+ * path_mask/dss.
-+ */
-+ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
-+
-+ /* We need to find out the path-mask from the meta-write-queue
-+ * to properly select a subflow.
-+ */
-+ mptcp_find_and_set_pathmask(meta_sk, skb);
-+
-+ /* If it's empty, just add */
-+ if (skb_queue_empty(&mpcb->reinject_queue)) {
-+ skb_queue_head(&mpcb->reinject_queue, skb);
-+ return;
-+ }
-+
-+ /* Find place to insert skb - or even we can 'drop' it, as the
-+ * data is already covered by other skb's in the reinject-queue.
-+ *
-+ * This is inspired by code from tcp_data_queue.
-+ */
-+
-+ skb1 = skb_peek_tail(&mpcb->reinject_queue);
-+ seq = TCP_SKB_CB(skb)->seq;
-+ while (1) {
-+ if (!after(TCP_SKB_CB(skb1)->seq, seq))
-+ break;
-+ if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
-+ skb1 = NULL;
-+ break;
-+ }
-+ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
-+ }
-+
-+ /* Do skb overlap to previous one? */
-+ end_seq = TCP_SKB_CB(skb)->end_seq;
-+ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ /* All the bits are present. Don't reinject */
-+ __kfree_skb(skb);
-+ return;
-+ }
-+ if (seq == TCP_SKB_CB(skb1)->seq) {
-+ if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
-+ skb1 = NULL;
-+ else
-+ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
-+ }
-+ }
-+ if (!skb1)
-+ __skb_queue_head(&mpcb->reinject_queue, skb);
-+ else
-+ __skb_queue_after(&mpcb->reinject_queue, skb1, skb);
-+
-+ /* And clean segments covered by new one as whole. */
-+ while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
-+ skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
-+
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-+ break;
-+
-+ __skb_unlink(skb1, &mpcb->reinject_queue);
-+ __kfree_skb(skb1);
-+ }
-+ return;
-+}
-+
-+/* Inserts data into the reinject queue */
-+void mptcp_reinject_data(struct sock *sk, int clone_it)
-+{
-+ struct sk_buff *skb_it, *tmp;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = tp->meta_sk;
-+
-+ /* It has already been closed - there is really no point in reinjecting */
-+ if (meta_sk->sk_state == TCP_CLOSE)
-+ return;
-+
-+ skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
-+ /* Subflow syn's and fin's are not reinjected.
-+ *
-+ * As well as empty subflow-fins with a data-fin.
-+ * They are reinjected below (without the subflow-fin-flag)
-+ */
-+ if (tcb->tcp_flags & TCPHDR_SYN ||
-+ (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
-+ (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
-+ continue;
-+
-+ __mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
-+ }
-+
-+ skb_it = tcp_write_queue_tail(meta_sk);
-+ /* If sk has sent the empty data-fin, we have to reinject it too. */
-+ if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
-+ TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
-+ __mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
-+ }
-+
-+ mptcp_push_pending_frames(meta_sk);
-+
-+ tp->pf = 1;
-+}
-+EXPORT_SYMBOL(mptcp_reinject_data);
-+
-+static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
-+ struct sock *subsk)
-+{
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sock *sk_it;
-+ int all_empty = 1, all_acked;
-+
-+ /* In infinite mapping we always try to combine */
-+ if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
-+ subsk->sk_shutdown |= SEND_SHUTDOWN;
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+ return;
-+ }
-+
-+ /* Don't combine, if they didn't combine - otherwise we end up in
-+ * TIME_WAIT, even if our app is smart enough to avoid it
-+ */
-+ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
-+ if (!mpcb->dfin_combined)
-+ return;
-+ }
-+
-+ /* If no other subflow has data to send, we can combine */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ if (!mptcp_sk_can_send(sk_it))
-+ continue;
-+
-+ if (!tcp_write_queue_empty(sk_it))
-+ all_empty = 0;
-+ }
-+
-+ /* If all data has been DATA_ACKed, we can combine.
-+ * -1, because the data_fin consumed one byte
-+ */
-+ all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
-+
-+ if ((all_empty || all_acked) && tcp_close_state(subsk)) {
-+ subsk->sk_shutdown |= SEND_SHUTDOWN;
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+ }
-+}
-+
-+static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ __be32 *ptr)
-+{
-+ const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ __be32 *start = ptr;
-+ __u16 data_len;
-+
-+ *ptr++ = htonl(tcb->seq); /* data_seq */
-+
-+ /* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
-+ if (mptcp_is_data_fin(skb) && skb->len == 0)
-+ *ptr++ = 0; /* subseq */
-+ else
-+ *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
-+
-+ if (tcb->mptcp_flags & MPTCPHDR_INF)
-+ data_len = 0;
-+ else
-+ data_len = tcb->end_seq - tcb->seq;
-+
-+ if (tp->mpcb->dss_csum && data_len) {
-+ __be16 *p16 = (__be16 *)ptr;
-+ __be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
-+ __wsum csum;
-+
-+ *ptr = htonl(((data_len) << 16) |
-+ (TCPOPT_EOL << 8) |
-+ (TCPOPT_EOL));
-+ csum = csum_partial(ptr - 2, 12, skb->csum);
-+ p16++;
-+ *p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
-+ } else {
-+ *ptr++ = htonl(((data_len) << 16) |
-+ (TCPOPT_NOP << 8) |
-+ (TCPOPT_NOP));
-+ }
-+
-+ return ptr - start;
-+}
-+
-+static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ __be32 *ptr)
-+{
-+ struct mp_dss *mdss = (struct mp_dss *)ptr;
-+ __be32 *start = ptr;
-+
-+ mdss->kind = TCPOPT_MPTCP;
-+ mdss->sub = MPTCP_SUB_DSS;
-+ mdss->rsv1 = 0;
-+ mdss->rsv2 = 0;
-+ mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
-+ mdss->m = 0;
-+ mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
-+ mdss->a = 0;
-+ mdss->A = 1;
-+ mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
-+ ptr++;
-+
-+ *ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
-+
-+ return ptr - start;
-+}
-+
-+/* RFC6824 states that once a particular subflow mapping has been sent
-+ * out it must never be changed. However, packets may be split while
-+ * they are in the retransmission queue (due to SACK or ACKs) and that
-+ * arguably means that we would change the mapping (e.g. it splits it,
-+ * our sends out a subset of the initial mapping).
-+ *
-+ * Furthermore, the skb checksum is not always preserved across splits
-+ * (e.g. mptcp_fragment) which would mean that we need to recompute
-+ * the DSS checksum in this case.
-+ *
-+ * To avoid this we save the initial DSS mapping which allows us to
-+ * send the same DSS mapping even for fragmented retransmits.
-+ */
-+static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
-+{
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ __be32 *ptr = (__be32 *)tcb->dss;
-+
-+ tcb->mptcp_flags |= MPTCPHDR_SEQ;
-+
-+ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-+ ptr += mptcp_write_dss_mapping(tp, skb, ptr);
-+}
-+
-+/* Write the saved DSS mapping to the header */
-+static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
-+ __be32 *ptr)
-+{
-+ __be32 *start = ptr;
-+
-+ memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
-+
-+ /* update the data_ack */
-+ start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
-+
-+ /* dss is in a union with inet_skb_parm and
-+ * the IP layer expects zeroed IPCB fields.
-+ */
-+ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
-+
-+ return mptcp_dss_len/sizeof(*ptr);
-+}
-+
-+static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ const struct sock *meta_sk = mptcp_meta_sk(sk);
-+ const struct mptcp_cb *mpcb = tp->mpcb;
-+ struct tcp_skb_cb *tcb;
-+ struct sk_buff *subskb = NULL;
-+
-+ if (!reinject)
-+ TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
-+ MPTCPHDR_SEQ64_INDEX : 0);
-+
-+ subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
-+ if (!subskb)
-+ return false;
-+
-+ /* At the subflow-level we need to call again tcp_init_tso_segs. We
-+ * force this, by setting gso_segs to 0. It has been set to 1 prior to
-+ * the call to mptcp_skb_entail.
-+ */
-+ skb_shinfo(subskb)->gso_segs = 0;
-+
-+ TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
-+
-+ if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
-+ skb->ip_summed == CHECKSUM_PARTIAL) {
-+ subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
-+ subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
-+ }
-+
-+ tcb = TCP_SKB_CB(subskb);
-+
-+ if (tp->mpcb->send_infinite_mapping &&
-+ !tp->mpcb->infinite_mapping_snd &&
-+ !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
-+ tp->mptcp->fully_established = 1;
-+ tp->mpcb->infinite_mapping_snd = 1;
-+ tp->mptcp->infinite_cutoff_seq = tp->write_seq;
-+ tcb->mptcp_flags |= MPTCPHDR_INF;
-+ }
-+
-+ if (mptcp_is_data_fin(subskb))
-+ mptcp_combine_dfin(subskb, meta_sk, sk);
-+
-+ mptcp_save_dss_data_seq(tp, subskb);
-+
-+ tcb->seq = tp->write_seq;
-+ tcb->sacked = 0; /* reset the sacked field: from the point of view
-+ * of this subflow, we are sending a brand new
-+ * segment
-+ */
-+ /* Take into account seg len */
-+ tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
-+ tcb->end_seq = tp->write_seq;
-+
-+ /* If it's a non-payload DATA_FIN (also no subflow-fin), the
-+ * segment is not part of the subflow but on a meta-only-level.
-+ */
-+ if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
-+ tcp_add_write_queue_tail(sk, subskb);
-+ sk->sk_wmem_queued += subskb->truesize;
-+ sk_mem_charge(sk, subskb->truesize);
-+ } else {
-+ int err;
-+
-+ /* Necessary to initialize for tcp_transmit_skb. mss of 1, as
-+ * skb->len = 0 will force tso_segs to 1.
-+ */
-+ tcp_init_tso_segs(sk, subskb, 1);
-+ /* Empty data-fins are sent immediatly on the subflow */
-+ TCP_SKB_CB(subskb)->when = tcp_time_stamp;
-+ err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
-+
-+ /* It has not been queued, we can free it now. */
-+ kfree_skb(subskb);
-+
-+ if (err)
-+ return false;
-+ }
-+
-+ if (!tp->mptcp->fully_established) {
-+ tp->mptcp->second_packet = 1;
-+ tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
-+ }
-+
-+ return true;
-+}
-+
-+/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
-+ * might need to undo some operations done by tcp_fragment.
-+ */
-+static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
-+ gfp_t gfp, int reinject)
-+{
-+ int ret, diff, old_factor;
-+ struct sk_buff *buff;
-+ u8 flags;
-+
-+ if (skb_headlen(skb) < len)
-+ diff = skb->len - len;
-+ else
-+ diff = skb->data_len;
-+ old_factor = tcp_skb_pcount(skb);
-+
-+ /* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
-+ * At the MPTCP-level we do not care about the absolute value. All we
-+ * care about is that it is set to 1 for accurate packets_out
-+ * accounting.
-+ */
-+ ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
-+ if (ret)
-+ return ret;
-+
-+ buff = skb->next;
-+
-+ flags = TCP_SKB_CB(skb)->mptcp_flags;
-+ TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
-+ TCP_SKB_CB(buff)->mptcp_flags = flags;
-+ TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
-+
-+ /* If reinject == 1, the buff will be added to the reinject
-+ * queue, which is currently not part of memory accounting. So
-+ * undo the changes done by tcp_fragment and update the
-+ * reinject queue. Also, undo changes to the packet counters.
-+ */
-+ if (reinject == 1) {
-+ int undo = buff->truesize - diff;
-+ meta_sk->sk_wmem_queued -= undo;
-+ sk_mem_uncharge(meta_sk, undo);
-+
-+ tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
-+ meta_sk->sk_write_queue.qlen--;
-+
-+ if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
-+ undo = old_factor - tcp_skb_pcount(skb) -
-+ tcp_skb_pcount(buff);
-+ if (undo)
-+ tcp_adjust_pcount(meta_sk, skb, -undo);
-+ }
-+ }
-+
-+ return 0;
-+}
-+
-+/* Inspired by tcp_write_wakeup */
-+int mptcp_write_wakeup(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sk_buff *skb;
-+ struct sock *sk_it;
-+ int ans = 0;
-+
-+ if (meta_sk->sk_state == TCP_CLOSE)
-+ return -1;
-+
-+ skb = tcp_send_head(meta_sk);
-+ if (skb &&
-+ before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
-+ unsigned int mss;
-+ unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
-+ struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
-+ struct tcp_sock *subtp;
-+ if (!subsk)
-+ goto window_probe;
-+ subtp = tcp_sk(subsk);
-+ mss = tcp_current_mss(subsk);
-+
-+ seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
-+ tcp_wnd_end(subtp) - subtp->write_seq);
-+
-+ if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
-+ meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+ /* We are probing the opening of a window
-+ * but the window size is != 0
-+ * must have been a result SWS avoidance ( sender )
-+ */
-+ if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
-+ skb->len > mss) {
-+ seg_size = min(seg_size, mss);
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
-+ if (mptcp_fragment(meta_sk, skb, seg_size,
-+ GFP_ATOMIC, 0))
-+ return -1;
-+ } else if (!tcp_skb_pcount(skb)) {
-+ /* see mptcp_write_xmit on why we use UINT_MAX */
-+ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
-+ }
-+
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
-+ if (!mptcp_skb_entail(subsk, skb, 0))
-+ return -1;
-+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+ mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
-+ TCP_SKB_CB(skb)->seq);
-+ tcp_event_new_data_sent(meta_sk, skb);
-+
-+ __tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
-+
-+ return 0;
-+ } else {
-+window_probe:
-+ if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
-+ meta_tp->snd_una + 0xFFFF)) {
-+ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+ if (mptcp_sk_can_send_ack(sk_it))
-+ tcp_xmit_probe_skb(sk_it, 1);
-+ }
-+ }
-+
-+ /* At least one of the tcp_xmit_probe_skb's has to succeed */
-+ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+ int ret;
-+
-+ if (!mptcp_sk_can_send_ack(sk_it))
-+ continue;
-+
-+ ret = tcp_xmit_probe_skb(sk_it, 0);
-+ if (unlikely(ret > 0))
-+ ans = ret;
-+ }
-+ return ans;
-+ }
-+}
-+
-+bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
-+ int push_one, gfp_t gfp)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
-+ struct sock *subsk = NULL;
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sk_buff *skb;
-+ unsigned int sent_pkts;
-+ int reinject = 0;
-+ unsigned int sublimit;
-+
-+ sent_pkts = 0;
-+
-+ while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
-+ &sublimit))) {
-+ unsigned int limit;
-+
-+ subtp = tcp_sk(subsk);
-+ mss_now = tcp_current_mss(subsk);
-+
-+ if (reinject == 1) {
-+ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
-+ /* Segment already reached the peer, take the next one */
-+ __skb_unlink(skb, &mpcb->reinject_queue);
-+ __kfree_skb(skb);
-+ continue;
-+ }
-+ }
-+
-+ /* If the segment was cloned (e.g. a meta retransmission),
-+ * the header must be expanded/copied so that there is no
-+ * corruption of TSO information.
-+ */
-+ if (skb_unclone(skb, GFP_ATOMIC))
-+ break;
-+
-+ if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
-+ break;
-+
-+ /* Force tso_segs to 1 by using UINT_MAX.
-+ * We actually don't care about the exact number of segments
-+ * emitted on the subflow. We need just to set tso_segs, because
-+ * we still need an accurate packets_out count in
-+ * tcp_event_new_data_sent.
-+ */
-+ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
-+
-+ /* Check for nagle, irregardless of tso_segs. If the segment is
-+ * actually larger than mss_now (TSO segment), then
-+ * tcp_nagle_check will have partial == false and always trigger
-+ * the transmission.
-+ * tcp_write_xmit has a TSO-level nagle check which is not
-+ * subject to the MPTCP-level. It is based on the properties of
-+ * the subflow, not the MPTCP-level.
-+ */
-+ if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
-+ (tcp_skb_is_last(meta_sk, skb) ?
-+ nonagle : TCP_NAGLE_PUSH))))
-+ break;
-+
-+ limit = mss_now;
-+ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
-+ * tcp_write_xmit. Otherwise split-point would return 0.
-+ */
-+ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
-+ /* We limit the size of the skb so that it fits into the
-+ * window. Call tcp_mss_split_point to avoid duplicating
-+ * code.
-+ * We really only care about fitting the skb into the
-+ * window. That's why we use UINT_MAX. If the skb does
-+ * not fit into the cwnd_quota or the NIC's max-segs
-+ * limitation, it will be split by the subflow's
-+ * tcp_write_xmit which does the appropriate call to
-+ * tcp_mss_split_point.
-+ */
-+ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
-+ UINT_MAX / mss_now,
-+ nonagle);
-+
-+ if (sublimit)
-+ limit = min(limit, sublimit);
-+
-+ if (skb->len > limit &&
-+ unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
-+ break;
-+
-+ if (!mptcp_skb_entail(subsk, skb, reinject))
-+ break;
-+ /* Nagle is handled at the MPTCP-layer, so
-+ * always push on the subflow
-+ */
-+ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
-+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+ if (!reinject) {
-+ mptcp_check_sndseq_wrap(meta_tp,
-+ TCP_SKB_CB(skb)->end_seq -
-+ TCP_SKB_CB(skb)->seq);
-+ tcp_event_new_data_sent(meta_sk, skb);
-+ }
-+
-+ tcp_minshall_update(meta_tp, mss_now, skb);
-+ sent_pkts += tcp_skb_pcount(skb);
-+
-+ if (reinject > 0) {
-+ __skb_unlink(skb, &mpcb->reinject_queue);
-+ kfree_skb(skb);
-+ }
-+
-+ if (push_one)
-+ break;
-+ }
-+
-+ return !meta_tp->packets_out && tcp_send_head(meta_sk);
-+}
-+
-+void mptcp_write_space(struct sock *sk)
-+{
-+ mptcp_push_pending_frames(mptcp_meta_sk(sk));
-+}
-+
-+u32 __mptcp_select_window(struct sock *sk)
-+{
-+ struct inet_connection_sock *icsk = inet_csk(sk);
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+ int mss, free_space, full_space, window;
-+
-+ /* MSS for the peer's data. Previous versions used mss_clamp
-+ * here. I don't know if the value based on our guesses
-+ * of peer's MSS is better for the performance. It's more correct
-+ * but may be worse for the performance because of rcv_mss
-+ * fluctuations. --SAW 1998/11/1
-+ */
-+ mss = icsk->icsk_ack.rcv_mss;
-+ free_space = tcp_space(sk);
-+ full_space = min_t(int, meta_tp->window_clamp,
-+ tcp_full_space(sk));
-+
-+ if (mss > full_space)
-+ mss = full_space;
-+
-+ if (free_space < (full_space >> 1)) {
-+ icsk->icsk_ack.quick = 0;
-+
-+ if (tcp_memory_pressure)
-+ /* TODO this has to be adapted when we support different
-+ * MSS's among the subflows.
-+ */
-+ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
-+ 4U * meta_tp->advmss);
-+
-+ if (free_space < mss)
-+ return 0;
-+ }
-+
-+ if (free_space > meta_tp->rcv_ssthresh)
-+ free_space = meta_tp->rcv_ssthresh;
-+
-+ /* Don't do rounding if we are using window scaling, since the
-+ * scaled window will not line up with the MSS boundary anyway.
-+ */
-+ window = meta_tp->rcv_wnd;
-+ if (tp->rx_opt.rcv_wscale) {
-+ window = free_space;
-+
-+ /* Advertise enough space so that it won't get scaled away.
-+ * Import case: prevent zero window announcement if
-+ * 1<<rcv_wscale > mss.
-+ */
-+ if (((window >> tp->rx_opt.rcv_wscale) << tp->
-+ rx_opt.rcv_wscale) != window)
-+ window = (((window >> tp->rx_opt.rcv_wscale) + 1)
-+ << tp->rx_opt.rcv_wscale);
-+ } else {
-+ /* Get the largest window that is a nice multiple of mss.
-+ * Window clamp already applied above.
-+ * If our current window offering is within 1 mss of the
-+ * free space we just keep it. This prevents the divide
-+ * and multiply from happening most of the time.
-+ * We also don't do any window rounding when the free space
-+ * is too small.
-+ */
-+ if (window <= free_space - mss || window > free_space)
-+ window = (free_space / mss) * mss;
-+ else if (mss == full_space &&
-+ free_space > window + (full_space >> 1))
-+ window = free_space;
-+ }
-+
-+ return window;
-+}
-+
-+void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
-+ unsigned *remaining)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+
-+ opts->options |= OPTION_MPTCP;
-+ if (is_master_tp(tp)) {
-+ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
-+ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
-+ opts->mp_capable.sender_key = tp->mptcp_loc_key;
-+ opts->dss_csum = !!sysctl_mptcp_checksum;
-+ } else {
-+ const struct mptcp_cb *mpcb = tp->mpcb;
-+
-+ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
-+ *remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
-+ opts->mp_join_syns.token = mpcb->mptcp_rem_token;
-+ opts->mp_join_syns.low_prio = tp->mptcp->low_prio;
-+ opts->addr_id = tp->mptcp->loc_id;
-+ opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
-+ }
-+}
-+
-+void mptcp_synack_options(struct request_sock *req,
-+ struct tcp_out_options *opts, unsigned *remaining)
-+{
-+ struct mptcp_request_sock *mtreq;
-+ mtreq = mptcp_rsk(req);
-+
-+ opts->options |= OPTION_MPTCP;
-+ /* MPCB not yet set - thus it's a new MPTCP-session */
-+ if (!mtreq->is_sub) {
-+ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
-+ opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
-+ opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
-+ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
-+ } else {
-+ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
-+ opts->mp_join_syns.sender_truncated_mac =
-+ mtreq->mptcp_hash_tmac;
-+ opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
-+ opts->mp_join_syns.low_prio = mtreq->low_prio;
-+ opts->addr_id = mtreq->loc_id;
-+ *remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
-+ }
-+}
-+
-+void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
-+ struct tcp_out_options *opts, unsigned *size)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
-+
-+ /* We are coming from tcp_current_mss with the meta_sk as an argument.
-+ * It does not make sense to check for the options, because when the
-+ * segment gets sent, another subflow will be chosen.
-+ */
-+ if (!skb && is_meta_sk(sk))
-+ return;
-+
-+ /* In fallback mp_fail-mode, we have to repeat it until the fallback
-+ * has been done by the sender
-+ */
-+ if (unlikely(tp->mptcp->send_mp_fail)) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_FAIL;
-+ *size += MPTCP_SUB_LEN_FAIL;
-+ return;
-+ }
-+
-+ if (unlikely(tp->send_mp_fclose)) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_FCLOSE;
-+ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
-+ *size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
-+ return;
-+ }
-+
-+ /* 1. If we are the sender of the infinite-mapping, we need the
-+ * MPTCPHDR_INF-flag, because a retransmission of the
-+ * infinite-announcment still needs the mptcp-option.
-+ *
-+ * We need infinite_cutoff_seq, because retransmissions from before
-+ * the infinite-cutoff-moment still need the MPTCP-signalling to stay
-+ * consistent.
-+ *
-+ * 2. If we are the receiver of the infinite-mapping, we always skip
-+ * mptcp-options, because acknowledgments from before the
-+ * infinite-mapping point have already been sent out.
-+ *
-+ * I know, the whole infinite-mapping stuff is ugly...
-+ *
-+ * TODO: Handle wrapped data-sequence numbers
-+ * (even if it's very unlikely)
-+ */
-+ if (unlikely(mpcb->infinite_mapping_snd) &&
-+ ((mpcb->send_infinite_mapping && tcb &&
-+ mptcp_is_data_seq(skb) &&
-+ !(tcb->mptcp_flags & MPTCPHDR_INF) &&
-+ !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
-+ !mpcb->send_infinite_mapping))
-+ return;
-+
-+ if (unlikely(tp->mptcp->include_mpc)) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_CAPABLE |
-+ OPTION_TYPE_ACK;
-+ *size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
-+ opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
-+ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
-+ opts->dss_csum = mpcb->dss_csum;
-+
-+ if (skb)
-+ tp->mptcp->include_mpc = 0;
-+ }
-+ if (unlikely(tp->mptcp->pre_established)) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
-+ *size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
-+ }
-+
-+ if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_DATA_ACK;
-+ /* If !skb, we come from tcp_current_mss and thus we always
-+ * assume that the DSS-option will be set for the data-packet.
-+ */
-+ if (skb && !mptcp_is_data_seq(skb)) {
-+ *size += MPTCP_SUB_LEN_ACK_ALIGN;
-+ } else {
-+ /* Doesn't matter, if csum included or not. It will be
-+ * either 10 or 12, and thus aligned = 12
-+ */
-+ *size += MPTCP_SUB_LEN_ACK_ALIGN +
-+ MPTCP_SUB_LEN_SEQ_ALIGN;
-+ }
-+
-+ *size += MPTCP_SUB_LEN_DSS_ALIGN;
-+ }
-+
-+ if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
-+ mpcb->pm_ops->addr_signal(sk, size, opts, skb);
-+
-+ if (unlikely(tp->mptcp->send_mp_prio) &&
-+ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_PRIO;
-+ if (skb)
-+ tp->mptcp->send_mp_prio = 0;
-+ *size += MPTCP_SUB_LEN_PRIO_ALIGN;
-+ }
-+
-+ return;
-+}
-+
-+u16 mptcp_select_window(struct sock *sk)
-+{
-+ u16 new_win = tcp_select_window(sk);
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
-+
-+ meta_tp->rcv_wnd = tp->rcv_wnd;
-+ meta_tp->rcv_wup = meta_tp->rcv_nxt;
-+
-+ return new_win;
-+}
-+
-+void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+ const struct tcp_out_options *opts,
-+ struct sk_buff *skb)
-+{
-+ if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
-+ struct mp_capable *mpc = (struct mp_capable *)ptr;
-+
-+ mpc->kind = TCPOPT_MPTCP;
-+
-+ if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
-+ (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
-+ mpc->sender_key = opts->mp_capable.sender_key;
-+ mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
-+ ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
-+ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
-+ mpc->sender_key = opts->mp_capable.sender_key;
-+ mpc->receiver_key = opts->mp_capable.receiver_key;
-+ mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
-+ ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
-+ }
-+
-+ mpc->sub = MPTCP_SUB_CAPABLE;
-+ mpc->ver = 0;
-+ mpc->a = opts->dss_csum;
-+ mpc->b = 0;
-+ mpc->rsv = 0;
-+ mpc->h = 1;
-+ }
-+
-+ if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
-+ struct mp_join *mpj = (struct mp_join *)ptr;
-+
-+ mpj->kind = TCPOPT_MPTCP;
-+ mpj->sub = MPTCP_SUB_JOIN;
-+ mpj->rsv = 0;
-+
-+ if (OPTION_TYPE_SYN & opts->mptcp_options) {
-+ mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
-+ mpj->u.syn.token = opts->mp_join_syns.token;
-+ mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
-+ mpj->b = opts->mp_join_syns.low_prio;
-+ mpj->addr_id = opts->addr_id;
-+ ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
-+ } else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
-+ mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
-+ mpj->u.synack.mac =
-+ opts->mp_join_syns.sender_truncated_mac;
-+ mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
-+ mpj->b = opts->mp_join_syns.low_prio;
-+ mpj->addr_id = opts->addr_id;
-+ ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
-+ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
-+ mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
-+ mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
-+ memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
-+ ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
-+ }
-+ }
-+ if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
-+ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+
-+ mpadd->kind = TCPOPT_MPTCP;
-+ if (opts->add_addr_v4) {
-+ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
-+ mpadd->sub = MPTCP_SUB_ADD_ADDR;
-+ mpadd->ipver = 4;
-+ mpadd->addr_id = opts->add_addr4.addr_id;
-+ mpadd->u.v4.addr = opts->add_addr4.addr;
-+ ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
-+ } else if (opts->add_addr_v6) {
-+ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
-+ mpadd->sub = MPTCP_SUB_ADD_ADDR;
-+ mpadd->ipver = 6;
-+ mpadd->addr_id = opts->add_addr6.addr_id;
-+ memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
-+ sizeof(mpadd->u.v6.addr));
-+ ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
-+ }
-+ }
-+ if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
-+ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
-+ u8 *addrs_id;
-+ int id, len, len_align;
-+
-+ len = mptcp_sub_len_remove_addr(opts->remove_addrs);
-+ len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
-+
-+ mprem->kind = TCPOPT_MPTCP;
-+ mprem->len = len;
-+ mprem->sub = MPTCP_SUB_REMOVE_ADDR;
-+ mprem->rsv = 0;
-+ addrs_id = &mprem->addrs_id;
-+
-+ mptcp_for_each_bit_set(opts->remove_addrs, id)
-+ *(addrs_id++) = id;
-+
-+ /* Fill the rest with NOP's */
-+ if (len_align > len) {
-+ int i;
-+ for (i = 0; i < len_align - len; i++)
-+ *(addrs_id++) = TCPOPT_NOP;
-+ }
-+
-+ ptr += len_align >> 2;
-+ }
-+ if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
-+ struct mp_fail *mpfail = (struct mp_fail *)ptr;
-+
-+ mpfail->kind = TCPOPT_MPTCP;
-+ mpfail->len = MPTCP_SUB_LEN_FAIL;
-+ mpfail->sub = MPTCP_SUB_FAIL;
-+ mpfail->rsv1 = 0;
-+ mpfail->rsv2 = 0;
-+ mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
-+
-+ ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
-+ }
-+ if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
-+ struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
-+
-+ mpfclose->kind = TCPOPT_MPTCP;
-+ mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
-+ mpfclose->sub = MPTCP_SUB_FCLOSE;
-+ mpfclose->rsv1 = 0;
-+ mpfclose->rsv2 = 0;
-+ mpfclose->key = opts->mp_capable.receiver_key;
-+
-+ ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
-+ }
-+
-+ if (OPTION_DATA_ACK & opts->mptcp_options) {
-+ if (!mptcp_is_data_seq(skb))
-+ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-+ else
-+ ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
-+ }
-+ if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
-+ struct mp_prio *mpprio = (struct mp_prio *)ptr;
-+
-+ mpprio->kind = TCPOPT_MPTCP;
-+ mpprio->len = MPTCP_SUB_LEN_PRIO;
-+ mpprio->sub = MPTCP_SUB_PRIO;
-+ mpprio->rsv = 0;
-+ mpprio->b = tp->mptcp->low_prio;
-+ mpprio->addr_id = TCPOPT_NOP;
-+
-+ ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
-+ }
-+}
-+
-+/* Sends the datafin */
-+void mptcp_send_fin(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
-+ int mss_now;
-+
-+ if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
-+ meta_tp->mpcb->passive_close = 1;
-+
-+ /* Optimization, tack on the FIN if we have a queue of
-+ * unsent frames. But be careful about outgoing SACKS
-+ * and IP options.
-+ */
-+ mss_now = mptcp_current_mss(meta_sk);
-+
-+ if (tcp_send_head(meta_sk) != NULL) {
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
-+ TCP_SKB_CB(skb)->end_seq++;
-+ meta_tp->write_seq++;
-+ } else {
-+ /* Socket is locked, keep trying until memory is available. */
-+ for (;;) {
-+ skb = alloc_skb_fclone(MAX_TCP_HEADER,
-+ meta_sk->sk_allocation);
-+ if (skb)
-+ break;
-+ yield();
-+ }
-+ /* Reserve space for headers and prepare control bits. */
-+ skb_reserve(skb, MAX_TCP_HEADER);
-+
-+ tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
-+ TCP_SKB_CB(skb)->end_seq++;
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
-+ tcp_queue_skb(meta_sk, skb);
-+ }
-+ __tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
-+}
-+
-+void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
-+
-+ if (!mpcb->cnt_subflows)
-+ return;
-+
-+ WARN_ON(meta_tp->send_mp_fclose);
-+
-+ /* First - select a socket */
-+ sk = mptcp_select_ack_sock(meta_sk);
-+
-+ /* May happen if no subflow is in an appropriate state */
-+ if (!sk)
-+ return;
-+
-+ /* We are in infinite mode - just send a reset */
-+ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
-+ sk->sk_err = ECONNRESET;
-+ if (tcp_need_reset(sk->sk_state))
-+ tcp_send_active_reset(sk, priority);
-+ mptcp_sub_force_close(sk);
-+ return;
-+ }
-+
-+
-+ tcp_sk(sk)->send_mp_fclose = 1;
-+ /** Reset all other subflows */
-+
-+ /* tcp_done must be handled with bh disabled */
-+ if (!in_serving_softirq())
-+ local_bh_disable();
-+
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+ if (tcp_sk(sk_it)->send_mp_fclose)
-+ continue;
-+
-+ sk_it->sk_err = ECONNRESET;
-+ if (tcp_need_reset(sk_it->sk_state))
-+ tcp_send_active_reset(sk_it, GFP_ATOMIC);
-+ mptcp_sub_force_close(sk_it);
-+ }
-+
-+ if (!in_serving_softirq())
-+ local_bh_enable();
-+
-+ tcp_send_ack(sk);
-+ inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
-+
-+ meta_tp->send_mp_fclose = 1;
-+}
-+
-+static void mptcp_ack_retransmit_timer(struct sock *sk)
-+{
-+ struct sk_buff *skb;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+ if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
-+ goto out; /* Routing failure or similar */
-+
-+ if (!tp->retrans_stamp)
-+ tp->retrans_stamp = tcp_time_stamp ? : 1;
-+
-+ if (tcp_write_timeout(sk)) {
-+ tp->mptcp->pre_established = 0;
-+ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
-+ goto out;
-+ }
-+
-+ skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
-+ if (skb == NULL) {
-+ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+ jiffies + icsk->icsk_rto);
-+ return;
-+ }
-+
-+ /* Reserve space for headers and prepare control bits */
-+ skb_reserve(skb, MAX_TCP_HEADER);
-+ tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
-+
-+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+ if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
-+ /* Retransmission failed because of local congestion,
-+ * do not backoff.
-+ */
-+ if (!icsk->icsk_retransmits)
-+ icsk->icsk_retransmits = 1;
-+ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+ jiffies + icsk->icsk_rto);
-+ return;
-+ }
-+
-+
-+ icsk->icsk_retransmits++;
-+ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-+ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+ jiffies + icsk->icsk_rto);
-+ if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
-+ __sk_dst_reset(sk);
-+
-+out:;
-+}
-+
-+void mptcp_ack_handler(unsigned long data)
-+{
-+ struct sock *sk = (struct sock *)data;
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ bh_lock_sock(meta_sk);
-+ if (sock_owned_by_user(meta_sk)) {
-+ /* Try again later */
-+ sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
-+ jiffies + (HZ / 20));
-+ goto out_unlock;
-+ }
-+
-+ if (sk->sk_state == TCP_CLOSE)
-+ goto out_unlock;
-+ if (!tcp_sk(sk)->mptcp->pre_established)
-+ goto out_unlock;
-+
-+ mptcp_ack_retransmit_timer(sk);
-+
-+ sk_mem_reclaim(sk);
-+
-+out_unlock:
-+ bh_unlock_sock(meta_sk);
-+ sock_put(sk);
-+}
-+
-+/* Similar to tcp_retransmit_skb
-+ *
-+ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
-+ * meta-level.
-+ */
-+int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sock *subsk;
-+ unsigned int limit, mss_now;
-+ int err = -1;
-+
-+ /* Do not sent more than we queued. 1/4 is reserved for possible
-+ * copying overhead: fragmentation, tunneling, mangling etc.
-+ *
-+ * This is a meta-retransmission thus we check on the meta-socket.
-+ */
-+ if (atomic_read(&meta_sk->sk_wmem_alloc) >
-+ min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
-+ return -EAGAIN;
-+ }
-+
-+ /* We need to make sure that the retransmitted segment can be sent on a
-+ * subflow right now. If it is too big, it needs to be fragmented.
-+ */
-+ subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
-+ if (!subsk) {
-+ /* We want to increase icsk_retransmits, thus return 0, so that
-+ * mptcp_retransmit_timer enters the desired branch.
-+ */
-+ err = 0;
-+ goto failed;
-+ }
-+ mss_now = tcp_current_mss(subsk);
-+
-+ /* If the segment was cloned (e.g. a meta retransmission), the header
-+ * must be expanded/copied so that there is no corruption of TSO
-+ * information.
-+ */
-+ if (skb_unclone(skb, GFP_ATOMIC)) {
-+ err = -ENOMEM;
-+ goto failed;
-+ }
-+
-+ /* Must have been set by mptcp_write_xmit before */
-+ BUG_ON(!tcp_skb_pcount(skb));
-+
-+ limit = mss_now;
-+ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
-+ * tcp_write_xmit. Otherwise split-point would return 0.
-+ */
-+ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
-+ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
-+ UINT_MAX / mss_now,
-+ TCP_NAGLE_OFF);
-+
-+ if (skb->len > limit &&
-+ unlikely(mptcp_fragment(meta_sk, skb, limit,
-+ GFP_ATOMIC, 0)))
-+ goto failed;
-+
-+ if (!mptcp_skb_entail(subsk, skb, -1))
-+ goto failed;
-+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+ /* Update global TCP statistics. */
-+ TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
-+
-+ /* Diff to tcp_retransmit_skb */
-+
-+ /* Save stamp of the first retransmit. */
-+ if (!meta_tp->retrans_stamp)
-+ meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
-+
-+ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
-+
-+ return 0;
-+
-+failed:
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
-+ return err;
-+}
-+
-+/* Similar to tcp_retransmit_timer
-+ *
-+ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
-+ * and that we don't have an srtt estimation at the meta-level.
-+ */
-+void mptcp_retransmit_timer(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+ int err;
-+
-+ /* In fallback, retransmission is handled at the subflow-level */
-+ if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
-+ mpcb->send_infinite_mapping)
-+ return;
-+
-+ WARN_ON(tcp_write_queue_empty(meta_sk));
-+
-+ if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
-+ !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
-+ /* Receiver dastardly shrinks window. Our retransmits
-+ * become zero probes, but we should not timeout this
-+ * connection. If the socket is an orphan, time it out,
-+ * we cannot allow such beasts to hang infinitely.
-+ */
-+ struct inet_sock *meta_inet = inet_sk(meta_sk);
-+ if (meta_sk->sk_family == AF_INET) {
-+ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-+ &meta_inet->inet_daddr,
-+ ntohs(meta_inet->inet_dport),
-+ meta_inet->inet_num, meta_tp->snd_una,
-+ meta_tp->snd_nxt);
-+ }
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else if (meta_sk->sk_family == AF_INET6) {
-+ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-+ &meta_sk->sk_v6_daddr,
-+ ntohs(meta_inet->inet_dport),
-+ meta_inet->inet_num, meta_tp->snd_una,
-+ meta_tp->snd_nxt);
-+ }
-+#endif
-+ if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
-+ tcp_write_err(meta_sk);
-+ return;
-+ }
-+
-+ mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
-+ goto out_reset_timer;
-+ }
-+
-+ if (tcp_write_timeout(meta_sk))
-+ return;
-+
-+ if (meta_icsk->icsk_retransmits == 0)
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
-+
-+ meta_icsk->icsk_ca_state = TCP_CA_Loss;
-+
-+ err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
-+ if (err > 0) {
-+ /* Retransmission failed because of local congestion,
-+ * do not backoff.
-+ */
-+ if (!meta_icsk->icsk_retransmits)
-+ meta_icsk->icsk_retransmits = 1;
-+ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
-+ min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
-+ TCP_RTO_MAX);
-+ return;
-+ }
-+
-+ /* Increase the timeout each time we retransmit. Note that
-+ * we do not increase the rtt estimate. rto is initialized
-+ * from rtt, but increases here. Jacobson (SIGCOMM 88) suggests
-+ * that doubling rto each time is the least we can get away with.
-+ * In KA9Q, Karn uses this for the first few times, and then
-+ * goes to quadratic. netBSD doubles, but only goes up to *64,
-+ * and clamps at 1 to 64 sec afterwards. Note that 120 sec is
-+ * defined in the protocol as the maximum possible RTT. I guess
-+ * we'll have to use something other than TCP to talk to the
-+ * University of Mars.
-+ *
-+ * PAWS allows us longer timeouts and large windows, so once
-+ * implemented ftp to mars will work nicely. We will have to fix
-+ * the 120 second clamps though!
-+ */
-+ meta_icsk->icsk_backoff++;
-+ meta_icsk->icsk_retransmits++;
-+
-+out_reset_timer:
-+ /* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
-+ * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
-+ * might be increased if the stream oscillates between thin and thick,
-+ * thus the old value might already be too high compared to the value
-+ * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
-+ * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
-+ * exponential backoff behaviour to avoid continue hammering
-+ * linear-timeout retransmissions into a black hole
-+ */
-+ if (meta_sk->sk_state == TCP_ESTABLISHED &&
-+ (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
-+ tcp_stream_is_thin(meta_tp) &&
-+ meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
-+ meta_icsk->icsk_backoff = 0;
-+ /* We cannot do the same as in tcp_write_timer because the
-+ * srtt is not set here.
-+ */
-+ mptcp_set_rto(meta_sk);
-+ } else {
-+ /* Use normal (exponential) backoff */
-+ meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
-+ }
-+ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
-+
-+ return;
-+}
-+
-+/* Modify values to an mptcp-level for the initial window of new subflows */
-+void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
-+ __u32 *window_clamp, int wscale_ok,
-+ __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+ const struct sock *sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+ *window_clamp = mpcb->orig_window_clamp;
-+ __space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
-+
-+ tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
-+ wscale_ok, rcv_wscale, init_rcv_wnd, sk);
-+}
-+
-+static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
-+ unsigned int (*mss_cb)(struct sock *sk))
-+{
-+ struct sock *sk;
-+ u64 rate = 0;
-+
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ int this_mss;
-+ u64 this_rate;
-+
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ /* Do not consider subflows without a RTT estimation yet
-+ * otherwise this_rate >>> rate.
-+ */
-+ if (unlikely(!tp->srtt_us))
-+ continue;
-+
-+ this_mss = mss_cb(sk);
-+
-+ /* If this_mss is smaller than mss, it means that a segment will
-+ * be splitted in two (or more) when pushed on this subflow. If
-+ * you consider that mss = 1428 and this_mss = 1420 then two
-+ * segments will be generated: a 1420-byte and 8-byte segment.
-+ * The latter will introduce a large overhead as for a single
-+ * data segment 2 slots will be used in the congestion window.
-+ * Therefore reducing by ~2 the potential throughput of this
-+ * subflow. Indeed, 1428 will be send while 2840 could have been
-+ * sent if mss == 1420 reducing the throughput by 2840 / 1428.
-+ *
-+ * The following algorithm take into account this overhead
-+ * when computing the potential throughput that MPTCP can
-+ * achieve when generating mss-byte segments.
-+ *
-+ * The formulae is the following:
-+ * \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
-+ * Where ratio is computed as follows:
-+ * \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
-+ *
-+ * ratio gives the reduction factor of the theoretical
-+ * throughput a subflow can achieve if MPTCP uses a specific
-+ * MSS value.
-+ */
-+ this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
-+ max(tp->snd_cwnd, tp->packets_out),
-+ (u64)tp->srtt_us *
-+ DIV_ROUND_UP(mss, this_mss) * this_mss);
-+ rate += this_rate;
-+ }
-+
-+ return rate;
-+}
-+
-+static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
-+ unsigned int (*mss_cb)(struct sock *sk))
-+{
-+ unsigned int mss = 0;
-+ u64 rate = 0;
-+ struct sock *sk;
-+
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ int this_mss;
-+ u64 this_rate;
-+
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ this_mss = mss_cb(sk);
-+
-+ /* Same mss values will produce the same throughput. */
-+ if (this_mss == mss)
-+ continue;
-+
-+ /* See whether using this mss value can theoretically improve
-+ * the performances.
-+ */
-+ this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
-+ if (this_rate >= rate) {
-+ mss = this_mss;
-+ rate = this_rate;
-+ }
-+ }
-+
-+ return mss;
-+}
-+
-+unsigned int mptcp_current_mss(struct sock *meta_sk)
-+{
-+ unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
-+
-+ /* If no subflow is available, we take a default-mss from the
-+ * meta-socket.
-+ */
-+ return !mss ? tcp_current_mss(meta_sk) : mss;
-+}
-+
-+static unsigned int mptcp_select_size_mss(struct sock *sk)
-+{
-+ return tcp_sk(sk)->mss_cache;
-+}
-+
-+int mptcp_select_size(const struct sock *meta_sk, bool sg)
-+{
-+ unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
-+
-+ if (sg) {
-+ if (mptcp_sk_can_gso(meta_sk)) {
-+ mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
-+ } else {
-+ int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
-+
-+ if (mss >= pgbreak &&
-+ mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
-+ mss = pgbreak;
-+ }
-+ }
-+
-+ return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
-+}
-+
-+int mptcp_check_snd_buf(const struct tcp_sock *tp)
-+{
-+ const struct sock *sk;
-+ u32 rtt_max = tp->srtt_us;
-+ u64 bw_est;
-+
-+ if (!tp->srtt_us)
-+ return tp->reordering + 1;
-+
-+ mptcp_for_each_sk(tp->mpcb, sk) {
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ if (rtt_max < tcp_sk(sk)->srtt_us)
-+ rtt_max = tcp_sk(sk)->srtt_us;
-+ }
-+
-+ bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
-+ (u64)tp->srtt_us);
-+
-+ return max_t(unsigned int, (u32)(bw_est >> 16),
-+ tp->reordering + 1);
-+}
-+
-+unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
-+ int large_allowed)
-+{
-+ struct sock *sk;
-+ u32 xmit_size_goal = 0;
-+
-+ if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ int this_size_goal;
-+
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
-+ if (this_size_goal > xmit_size_goal)
-+ xmit_size_goal = this_size_goal;
-+ }
-+ }
-+
-+ return max(xmit_size_goal, mss_now);
-+}
-+
-+/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
-+int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
-+{
-+ if (skb_cloned(skb)) {
-+ if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
-+ return -ENOMEM;
-+ }
-+
-+ __pskb_trim_head(skb, len);
-+
-+ TCP_SKB_CB(skb)->seq += len;
-+ skb->ip_summed = CHECKSUM_PARTIAL;
-+
-+ skb->truesize -= len;
-+ sk->sk_wmem_queued -= len;
-+ sk_mem_uncharge(sk, len);
-+ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
-+
-+ /* Any change of skb->len requires recalculation of tso factor. */
-+ if (tcp_skb_pcount(skb) > 1)
-+ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
-+
-+ return 0;
-+}
-diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
-new file mode 100644
-index 000000000000..9542f950729f
---- /dev/null
-+++ b/net/mptcp/mptcp_pm.c
-@@ -0,0 +1,169 @@
-+/*
-+ * MPTCP implementation - MPTCP-subflow-management
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static DEFINE_SPINLOCK(mptcp_pm_list_lock);
-+static LIST_HEAD(mptcp_pm_list);
-+
-+static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio)
-+{
-+ return 0;
-+}
-+
-+struct mptcp_pm_ops mptcp_pm_default = {
-+ .get_local_id = mptcp_default_id, /* We do not care */
-+ .name = "default",
-+ .owner = THIS_MODULE,
-+};
-+
-+static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
-+{
-+ struct mptcp_pm_ops *e;
-+
-+ list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
-+ if (strcmp(e->name, name) == 0)
-+ return e;
-+ }
-+
-+ return NULL;
-+}
-+
-+int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
-+{
-+ int ret = 0;
-+
-+ if (!pm->get_local_id)
-+ return -EINVAL;
-+
-+ spin_lock(&mptcp_pm_list_lock);
-+ if (mptcp_pm_find(pm->name)) {
-+ pr_notice("%s already registered\n", pm->name);
-+ ret = -EEXIST;
-+ } else {
-+ list_add_tail_rcu(&pm->list, &mptcp_pm_list);
-+ pr_info("%s registered\n", pm->name);
-+ }
-+ spin_unlock(&mptcp_pm_list_lock);
-+
-+ return ret;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
-+
-+void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
-+{
-+ spin_lock(&mptcp_pm_list_lock);
-+ list_del_rcu(&pm->list);
-+ spin_unlock(&mptcp_pm_list_lock);
-+}
-+EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
-+
-+void mptcp_get_default_path_manager(char *name)
-+{
-+ struct mptcp_pm_ops *pm;
-+
-+ BUG_ON(list_empty(&mptcp_pm_list));
-+
-+ rcu_read_lock();
-+ pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
-+ strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
-+ rcu_read_unlock();
-+}
-+
-+int mptcp_set_default_path_manager(const char *name)
-+{
-+ struct mptcp_pm_ops *pm;
-+ int ret = -ENOENT;
-+
-+ spin_lock(&mptcp_pm_list_lock);
-+ pm = mptcp_pm_find(name);
-+#ifdef CONFIG_MODULES
-+ if (!pm && capable(CAP_NET_ADMIN)) {
-+ spin_unlock(&mptcp_pm_list_lock);
-+
-+ request_module("mptcp_%s", name);
-+ spin_lock(&mptcp_pm_list_lock);
-+ pm = mptcp_pm_find(name);
-+ }
-+#endif
-+
-+ if (pm) {
-+ list_move(&pm->list, &mptcp_pm_list);
-+ ret = 0;
-+ } else {
-+ pr_info("%s is not available\n", name);
-+ }
-+ spin_unlock(&mptcp_pm_list_lock);
-+
-+ return ret;
-+}
-+
-+void mptcp_init_path_manager(struct mptcp_cb *mpcb)
-+{
-+ struct mptcp_pm_ops *pm;
-+
-+ rcu_read_lock();
-+ list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
-+ if (try_module_get(pm->owner)) {
-+ mpcb->pm_ops = pm;
-+ break;
-+ }
-+ }
-+ rcu_read_unlock();
-+}
-+
-+/* Manage refcounts on socket close. */
-+void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
-+{
-+ module_put(mpcb->pm_ops->owner);
-+}
-+
-+/* Fallback to the default path-manager. */
-+void mptcp_fallback_default(struct mptcp_cb *mpcb)
-+{
-+ struct mptcp_pm_ops *pm;
-+
-+ mptcp_cleanup_path_manager(mpcb);
-+ pm = mptcp_pm_find("default");
-+
-+ /* Cannot fail - it's the default module */
-+ try_module_get(pm->owner);
-+ mpcb->pm_ops = pm;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_fallback_default);
-+
-+/* Set default value from kernel configuration at bootup */
-+static int __init mptcp_path_manager_default(void)
-+{
-+ return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
-+}
-+late_initcall(mptcp_path_manager_default);
-diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
-new file mode 100644
-index 000000000000..93278f684069
---- /dev/null
-+++ b/net/mptcp/mptcp_rr.c
-@@ -0,0 +1,301 @@
-+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static unsigned char num_segments __read_mostly = 1;
-+module_param(num_segments, byte, 0644);
-+MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
-+
-+static bool cwnd_limited __read_mostly = 1;
-+module_param(cwnd_limited, bool, 0644);
-+MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
-+
-+struct rrsched_priv {
-+ unsigned char quota;
-+};
-+
-+static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
-+{
-+ return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
-+}
-+
-+/* If the sub-socket sk available to send the skb? */
-+static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
-+ bool zero_wnd_test, bool cwnd_test)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ unsigned int space, in_flight;
-+
-+ /* Set of states for which we are allowed to send data */
-+ if (!mptcp_sk_can_send(sk))
-+ return false;
-+
-+ /* We do not send data on this subflow unless it is
-+ * fully established, i.e. the 4th ack has been received.
-+ */
-+ if (tp->mptcp->pre_established)
-+ return false;
-+
-+ if (tp->pf)
-+ return false;
-+
-+ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
-+ /* If SACK is disabled, and we got a loss, TCP does not exit
-+ * the loss-state until something above high_seq has been acked.
-+ * (see tcp_try_undo_recovery)
-+ *
-+ * high_seq is the snd_nxt at the moment of the RTO. As soon
-+ * as we have an RTO, we won't push data on the subflow.
-+ * Thus, snd_una can never go beyond high_seq.
-+ */
-+ if (!tcp_is_reno(tp))
-+ return false;
-+ else if (tp->snd_una != tp->high_seq)
-+ return false;
-+ }
-+
-+ if (!tp->mptcp->fully_established) {
-+ /* Make sure that we send in-order data */
-+ if (skb && tp->mptcp->second_packet &&
-+ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
-+ return false;
-+ }
-+
-+ if (!cwnd_test)
-+ goto zero_wnd_test;
-+
-+ in_flight = tcp_packets_in_flight(tp);
-+ /* Not even a single spot in the cwnd */
-+ if (in_flight >= tp->snd_cwnd)
-+ return false;
-+
-+ /* Now, check if what is queued in the subflow's send-queue
-+ * already fills the cwnd.
-+ */
-+ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
-+
-+ if (tp->write_seq - tp->snd_nxt > space)
-+ return false;
-+
-+zero_wnd_test:
-+ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
-+ return false;
-+
-+ return true;
-+}
-+
-+/* Are we not allowed to reinject this skb on tp? */
-+static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
-+{
-+ /* If the skb has already been enqueued in this sk, try to find
-+ * another one.
-+ */
-+ return skb &&
-+ /* Has the skb already been enqueued into this subsocket? */
-+ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
-+}
-+
-+/* We just look for any subflow that is available */
-+static struct sock *rr_get_available_subflow(struct sock *meta_sk,
-+ struct sk_buff *skb,
-+ bool zero_wnd_test)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *sk, *bestsk = NULL, *backupsk = NULL;
-+
-+ /* Answer data_fin on same subflow!!! */
-+ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
-+ skb && mptcp_is_data_fin(skb)) {
-+ mptcp_for_each_sk(mpcb, sk) {
-+ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
-+ mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
-+ return sk;
-+ }
-+ }
-+
-+ /* First, find the best subflow */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
-+ continue;
-+
-+ if (mptcp_rr_dont_reinject_skb(tp, skb)) {
-+ backupsk = sk;
-+ continue;
-+ }
-+
-+ bestsk = sk;
-+ }
-+
-+ if (bestsk) {
-+ sk = bestsk;
-+ } else if (backupsk) {
-+ /* It has been sent on all subflows once - let's give it a
-+ * chance again by restarting its pathmask.
-+ */
-+ if (skb)
-+ TCP_SKB_CB(skb)->path_mask = 0;
-+ sk = backupsk;
-+ }
-+
-+ return sk;
-+}
-+
-+/* Returns the next segment to be sent from the mptcp meta-queue.
-+ * (chooses the reinject queue if any segment is waiting in it, otherwise,
-+ * chooses the normal write queue).
-+ * Sets *@reinject to 1 if the returned segment comes from the
-+ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
-+ * and sets it to -1 if it is a meta-level retransmission to optimize the
-+ * receive-buffer.
-+ */
-+static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sk_buff *skb = NULL;
-+
-+ *reinject = 0;
-+
-+ /* If we are in fallback-mode, just take from the meta-send-queue */
-+ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
-+ return tcp_send_head(meta_sk);
-+
-+ skb = skb_peek(&mpcb->reinject_queue);
-+
-+ if (skb)
-+ *reinject = 1;
-+ else
-+ skb = tcp_send_head(meta_sk);
-+ return skb;
-+}
-+
-+static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
-+ int *reinject,
-+ struct sock **subsk,
-+ unsigned int *limit)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *sk_it, *choose_sk = NULL;
-+ struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
-+ unsigned char split = num_segments;
-+ unsigned char iter = 0, full_subs = 0;
-+
-+ /* As we set it, we have to reset it as well. */
-+ *limit = 0;
-+
-+ if (!skb)
-+ return NULL;
-+
-+ if (*reinject) {
-+ *subsk = rr_get_available_subflow(meta_sk, skb, false);
-+ if (!*subsk)
-+ return NULL;
-+
-+ return skb;
-+ }
-+
-+retry:
-+
-+ /* First, we look for a subflow who is currently being used */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ struct tcp_sock *tp_it = tcp_sk(sk_it);
-+ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
-+
-+ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
-+ continue;
-+
-+ iter++;
-+
-+ /* Is this subflow currently being used? */
-+ if (rsp->quota > 0 && rsp->quota < num_segments) {
-+ split = num_segments - rsp->quota;
-+ choose_sk = sk_it;
-+ goto found;
-+ }
-+
-+ /* Or, it's totally unused */
-+ if (!rsp->quota) {
-+ split = num_segments;
-+ choose_sk = sk_it;
-+ }
-+
-+ /* Or, it must then be fully used */
-+ if (rsp->quota == num_segments)
-+ full_subs++;
-+ }
-+
-+ /* All considered subflows have a full quota, and we considered at
-+ * least one.
-+ */
-+ if (iter && iter == full_subs) {
-+ /* So, we restart this round by setting quota to 0 and retry
-+ * to find a subflow.
-+ */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ struct tcp_sock *tp_it = tcp_sk(sk_it);
-+ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
-+
-+ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
-+ continue;
-+
-+ rsp->quota = 0;
-+ }
-+
-+ goto retry;
-+ }
-+
-+found:
-+ if (choose_sk) {
-+ unsigned int mss_now;
-+ struct tcp_sock *choose_tp = tcp_sk(choose_sk);
-+ struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
-+
-+ if (!mptcp_rr_is_available(choose_sk, skb, false, true))
-+ return NULL;
-+
-+ *subsk = choose_sk;
-+ mss_now = tcp_current_mss(*subsk);
-+ *limit = split * mss_now;
-+
-+ if (skb->len > mss_now)
-+ rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
-+ else
-+ rsp->quota++;
-+
-+ return skb;
-+ }
-+
-+ return NULL;
-+}
-+
-+static struct mptcp_sched_ops mptcp_sched_rr = {
-+ .get_subflow = rr_get_available_subflow,
-+ .next_segment = mptcp_rr_next_segment,
-+ .name = "roundrobin",
-+ .owner = THIS_MODULE,
-+};
-+
-+static int __init rr_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
-+
-+ if (mptcp_register_scheduler(&mptcp_sched_rr))
-+ return -1;
-+
-+ return 0;
-+}
-+
-+static void rr_unregister(void)
-+{
-+ mptcp_unregister_scheduler(&mptcp_sched_rr);
-+}
-+
-+module_init(rr_register);
-+module_exit(rr_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
-+MODULE_VERSION("0.89");
-diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
-new file mode 100644
-index 000000000000..6c7ff4eceac1
---- /dev/null
-+++ b/net/mptcp/mptcp_sched.c
-@@ -0,0 +1,493 @@
-+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static DEFINE_SPINLOCK(mptcp_sched_list_lock);
-+static LIST_HEAD(mptcp_sched_list);
-+
-+struct defsched_priv {
-+ u32 last_rbuf_opti;
-+};
-+
-+static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
-+{
-+ return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
-+}
-+
-+/* If the sub-socket sk available to send the skb? */
-+static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
-+ bool zero_wnd_test)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ unsigned int mss_now, space, in_flight;
-+
-+ /* Set of states for which we are allowed to send data */
-+ if (!mptcp_sk_can_send(sk))
-+ return false;
-+
-+ /* We do not send data on this subflow unless it is
-+ * fully established, i.e. the 4th ack has been received.
-+ */
-+ if (tp->mptcp->pre_established)
-+ return false;
-+
-+ if (tp->pf)
-+ return false;
-+
-+ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
-+ /* If SACK is disabled, and we got a loss, TCP does not exit
-+ * the loss-state until something above high_seq has been acked.
-+ * (see tcp_try_undo_recovery)
-+ *
-+ * high_seq is the snd_nxt at the moment of the RTO. As soon
-+ * as we have an RTO, we won't push data on the subflow.
-+ * Thus, snd_una can never go beyond high_seq.
-+ */
-+ if (!tcp_is_reno(tp))
-+ return false;
-+ else if (tp->snd_una != tp->high_seq)
-+ return false;
-+ }
-+
-+ if (!tp->mptcp->fully_established) {
-+ /* Make sure that we send in-order data */
-+ if (skb && tp->mptcp->second_packet &&
-+ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
-+ return false;
-+ }
-+
-+ /* If TSQ is already throttling us, do not send on this subflow. When
-+ * TSQ gets cleared the subflow becomes eligible again.
-+ */
-+ if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
-+ return false;
-+
-+ in_flight = tcp_packets_in_flight(tp);
-+ /* Not even a single spot in the cwnd */
-+ if (in_flight >= tp->snd_cwnd)
-+ return false;
-+
-+ /* Now, check if what is queued in the subflow's send-queue
-+ * already fills the cwnd.
-+ */
-+ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
-+
-+ if (tp->write_seq - tp->snd_nxt > space)
-+ return false;
-+
-+ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
-+ return false;
-+
-+ mss_now = tcp_current_mss(sk);
-+
-+ /* Don't send on this subflow if we bypass the allowed send-window at
-+ * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
-+ * calculated end_seq (because here at this point end_seq is still at
-+ * the meta-level).
-+ */
-+ if (skb && !zero_wnd_test &&
-+ after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
-+ return false;
-+
-+ return true;
-+}
-+
-+/* Are we not allowed to reinject this skb on tp? */
-+static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
-+{
-+ /* If the skb has already been enqueued in this sk, try to find
-+ * another one.
-+ */
-+ return skb &&
-+ /* Has the skb already been enqueued into this subsocket? */
-+ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
-+}
-+
-+/* This is the scheduler. This function decides on which flow to send
-+ * a given MSS. If all subflows are found to be busy, NULL is returned
-+ * The flow is selected based on the shortest RTT.
-+ * If all paths have full cong windows, we simply return NULL.
-+ *
-+ * Additionally, this function is aware of the backup-subflows.
-+ */
-+static struct sock *get_available_subflow(struct sock *meta_sk,
-+ struct sk_buff *skb,
-+ bool zero_wnd_test)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
-+ u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
-+ int cnt_backups = 0;
-+
-+ /* if there is only one subflow, bypass the scheduling function */
-+ if (mpcb->cnt_subflows == 1) {
-+ bestsk = (struct sock *)mpcb->connection_list;
-+ if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
-+ bestsk = NULL;
-+ return bestsk;
-+ }
-+
-+ /* Answer data_fin on same subflow!!! */
-+ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
-+ skb && mptcp_is_data_fin(skb)) {
-+ mptcp_for_each_sk(mpcb, sk) {
-+ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
-+ mptcp_is_available(sk, skb, zero_wnd_test))
-+ return sk;
-+ }
-+ }
-+
-+ /* First, find the best subflow */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
-+ cnt_backups++;
-+
-+ if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
-+ tp->srtt_us < lowprio_min_time_to_peer) {
-+ if (!mptcp_is_available(sk, skb, zero_wnd_test))
-+ continue;
-+
-+ if (mptcp_dont_reinject_skb(tp, skb)) {
-+ backupsk = sk;
-+ continue;
-+ }
-+
-+ lowprio_min_time_to_peer = tp->srtt_us;
-+ lowpriosk = sk;
-+ } else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
-+ tp->srtt_us < min_time_to_peer) {
-+ if (!mptcp_is_available(sk, skb, zero_wnd_test))
-+ continue;
-+
-+ if (mptcp_dont_reinject_skb(tp, skb)) {
-+ backupsk = sk;
-+ continue;
-+ }
-+
-+ min_time_to_peer = tp->srtt_us;
-+ bestsk = sk;
-+ }
-+ }
-+
-+ if (mpcb->cnt_established == cnt_backups && lowpriosk) {
-+ sk = lowpriosk;
-+ } else if (bestsk) {
-+ sk = bestsk;
-+ } else if (backupsk) {
-+ /* It has been sent on all subflows once - let's give it a
-+ * chance again by restarting its pathmask.
-+ */
-+ if (skb)
-+ TCP_SKB_CB(skb)->path_mask = 0;
-+ sk = backupsk;
-+ }
-+
-+ return sk;
-+}
-+
-+static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
-+{
-+ struct sock *meta_sk;
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_sock *tp_it;
-+ struct sk_buff *skb_head;
-+ struct defsched_priv *dsp = defsched_get_priv(tp);
-+
-+ if (tp->mpcb->cnt_subflows == 1)
-+ return NULL;
-+
-+ meta_sk = mptcp_meta_sk(sk);
-+ skb_head = tcp_write_queue_head(meta_sk);
-+
-+ if (!skb_head || skb_head == tcp_send_head(meta_sk))
-+ return NULL;
-+
-+ /* If penalization is optional (coming from mptcp_next_segment() and
-+ * We are not send-buffer-limited we do not penalize. The retransmission
-+ * is just an optimization to fix the idle-time due to the delay before
-+ * we wake up the application.
-+ */
-+ if (!penal && sk_stream_memory_free(meta_sk))
-+ goto retrans;
-+
-+ /* Only penalize again after an RTT has elapsed */
-+ if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
-+ goto retrans;
-+
-+ /* Half the cwnd of the slow flow */
-+ mptcp_for_each_tp(tp->mpcb, tp_it) {
-+ if (tp_it != tp &&
-+ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
-+ if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
-+ tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
-+ if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
-+ tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
-+
-+ dsp->last_rbuf_opti = tcp_time_stamp;
-+ }
-+ break;
-+ }
-+ }
-+
-+retrans:
-+
-+ /* Segment not yet injected into this path? Take it!!! */
-+ if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
-+ bool do_retrans = false;
-+ mptcp_for_each_tp(tp->mpcb, tp_it) {
-+ if (tp_it != tp &&
-+ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
-+ if (tp_it->snd_cwnd <= 4) {
-+ do_retrans = true;
-+ break;
-+ }
-+
-+ if (4 * tp->srtt_us >= tp_it->srtt_us) {
-+ do_retrans = false;
-+ break;
-+ } else {
-+ do_retrans = true;
-+ }
-+ }
-+ }
-+
-+ if (do_retrans && mptcp_is_available(sk, skb_head, false))
-+ return skb_head;
-+ }
-+ return NULL;
-+}
-+
-+/* Returns the next segment to be sent from the mptcp meta-queue.
-+ * (chooses the reinject queue if any segment is waiting in it, otherwise,
-+ * chooses the normal write queue).
-+ * Sets *@reinject to 1 if the returned segment comes from the
-+ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
-+ * and sets it to -1 if it is a meta-level retransmission to optimize the
-+ * receive-buffer.
-+ */
-+static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sk_buff *skb = NULL;
-+
-+ *reinject = 0;
-+
-+ /* If we are in fallback-mode, just take from the meta-send-queue */
-+ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
-+ return tcp_send_head(meta_sk);
-+
-+ skb = skb_peek(&mpcb->reinject_queue);
-+
-+ if (skb) {
-+ *reinject = 1;
-+ } else {
-+ skb = tcp_send_head(meta_sk);
-+
-+ if (!skb && meta_sk->sk_socket &&
-+ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
-+ sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
-+ struct sock *subsk = get_available_subflow(meta_sk, NULL,
-+ false);
-+ if (!subsk)
-+ return NULL;
-+
-+ skb = mptcp_rcv_buf_optimization(subsk, 0);
-+ if (skb)
-+ *reinject = -1;
-+ }
-+ }
-+ return skb;
-+}
-+
-+static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
-+ int *reinject,
-+ struct sock **subsk,
-+ unsigned int *limit)
-+{
-+ struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
-+ unsigned int mss_now;
-+ struct tcp_sock *subtp;
-+ u16 gso_max_segs;
-+ u32 max_len, max_segs, window, needed;
-+
-+ /* As we set it, we have to reset it as well. */
-+ *limit = 0;
-+
-+ if (!skb)
-+ return NULL;
-+
-+ *subsk = get_available_subflow(meta_sk, skb, false);
-+ if (!*subsk)
-+ return NULL;
-+
-+ subtp = tcp_sk(*subsk);
-+ mss_now = tcp_current_mss(*subsk);
-+
-+ if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
-+ skb = mptcp_rcv_buf_optimization(*subsk, 1);
-+ if (skb)
-+ *reinject = -1;
-+ else
-+ return NULL;
-+ }
-+
-+ /* No splitting required, as we will only send one single segment */
-+ if (skb->len <= mss_now)
-+ return skb;
-+
-+ /* The following is similar to tcp_mss_split_point, but
-+ * we do not care about nagle, because we will anyways
-+ * use TCP_NAGLE_PUSH, which overrides this.
-+ *
-+ * So, we first limit according to the cwnd/gso-size and then according
-+ * to the subflow's window.
-+ */
-+
-+ gso_max_segs = (*subsk)->sk_gso_max_segs;
-+ if (!gso_max_segs) /* No gso supported on the subflow's NIC */
-+ gso_max_segs = 1;
-+ max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
-+ if (!max_segs)
-+ return NULL;
-+
-+ max_len = mss_now * max_segs;
-+ window = tcp_wnd_end(subtp) - subtp->write_seq;
-+
-+ needed = min(skb->len, window);
-+ if (max_len <= skb->len)
-+ /* Take max_win, which is actually the cwnd/gso-size */
-+ *limit = max_len;
-+ else
-+ /* Or, take the window */
-+ *limit = needed;
-+
-+ return skb;
-+}
-+
-+static void defsched_init(struct sock *sk)
-+{
-+ struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
-+
-+ dsp->last_rbuf_opti = tcp_time_stamp;
-+}
-+
-+struct mptcp_sched_ops mptcp_sched_default = {
-+ .get_subflow = get_available_subflow,
-+ .next_segment = mptcp_next_segment,
-+ .init = defsched_init,
-+ .name = "default",
-+ .owner = THIS_MODULE,
-+};
-+
-+static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
-+{
-+ struct mptcp_sched_ops *e;
-+
-+ list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
-+ if (strcmp(e->name, name) == 0)
-+ return e;
-+ }
-+
-+ return NULL;
-+}
-+
-+int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
-+{
-+ int ret = 0;
-+
-+ if (!sched->get_subflow || !sched->next_segment)
-+ return -EINVAL;
-+
-+ spin_lock(&mptcp_sched_list_lock);
-+ if (mptcp_sched_find(sched->name)) {
-+ pr_notice("%s already registered\n", sched->name);
-+ ret = -EEXIST;
-+ } else {
-+ list_add_tail_rcu(&sched->list, &mptcp_sched_list);
-+ pr_info("%s registered\n", sched->name);
-+ }
-+ spin_unlock(&mptcp_sched_list_lock);
-+
-+ return ret;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
-+
-+void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
-+{
-+ spin_lock(&mptcp_sched_list_lock);
-+ list_del_rcu(&sched->list);
-+ spin_unlock(&mptcp_sched_list_lock);
-+}
-+EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
-+
-+void mptcp_get_default_scheduler(char *name)
-+{
-+ struct mptcp_sched_ops *sched;
-+
-+ BUG_ON(list_empty(&mptcp_sched_list));
-+
-+ rcu_read_lock();
-+ sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
-+ strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
-+ rcu_read_unlock();
-+}
-+
-+int mptcp_set_default_scheduler(const char *name)
-+{
-+ struct mptcp_sched_ops *sched;
-+ int ret = -ENOENT;
-+
-+ spin_lock(&mptcp_sched_list_lock);
-+ sched = mptcp_sched_find(name);
-+#ifdef CONFIG_MODULES
-+ if (!sched && capable(CAP_NET_ADMIN)) {
-+ spin_unlock(&mptcp_sched_list_lock);
-+
-+ request_module("mptcp_%s", name);
-+ spin_lock(&mptcp_sched_list_lock);
-+ sched = mptcp_sched_find(name);
-+ }
-+#endif
-+
-+ if (sched) {
-+ list_move(&sched->list, &mptcp_sched_list);
-+ ret = 0;
-+ } else {
-+ pr_info("%s is not available\n", name);
-+ }
-+ spin_unlock(&mptcp_sched_list_lock);
-+
-+ return ret;
-+}
-+
-+void mptcp_init_scheduler(struct mptcp_cb *mpcb)
-+{
-+ struct mptcp_sched_ops *sched;
-+
-+ rcu_read_lock();
-+ list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
-+ if (try_module_get(sched->owner)) {
-+ mpcb->sched_ops = sched;
-+ break;
-+ }
-+ }
-+ rcu_read_unlock();
-+}
-+
-+/* Manage refcounts on socket close. */
-+void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
-+{
-+ module_put(mpcb->sched_ops->owner);
-+}
-+
-+/* Set default value from kernel configuration at bootup */
-+static int __init mptcp_scheduler_default(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
-+
-+ return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
-+}
-+late_initcall(mptcp_scheduler_default);
-diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
-new file mode 100644
-index 000000000000..29ca1d868d17
---- /dev/null
-+++ b/net/mptcp/mptcp_wvegas.c
-@@ -0,0 +1,268 @@
-+/*
-+ * MPTCP implementation - WEIGHTED VEGAS
-+ *
-+ * Algorithm design:
-+ * Yu Cao <cyAnalyst@126.com>
-+ * Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
-+ * Xiaoming Fu <fu@cs.uni-goettinggen.de>
-+ *
-+ * Implementation:
-+ * Yu Cao <cyAnalyst@126.com>
-+ * Enhuan Dong <deh13@mails.tsinghua.edu.cn>
-+ *
-+ * Ported to the official MPTCP-kernel:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/skbuff.h>
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <linux/module.h>
-+#include <linux/tcp.h>
-+
-+static int initial_alpha = 2;
-+static int total_alpha = 10;
-+static int gamma = 1;
-+
-+module_param(initial_alpha, int, 0644);
-+MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
-+module_param(total_alpha, int, 0644);
-+MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
-+module_param(gamma, int, 0644);
-+MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
-+
-+#define MPTCP_WVEGAS_SCALE 16
-+
-+/* wVegas variables */
-+struct wvegas {
-+ u32 beg_snd_nxt; /* right edge during last RTT */
-+ u8 doing_wvegas_now;/* if true, do wvegas for this RTT */
-+
-+ u16 cnt_rtt; /* # of RTTs measured within last RTT */
-+ u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
-+ u32 base_rtt; /* the min of all wVegas RTT measurements seen (in usec) */
-+
-+ u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
-+ u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
-+ int alpha; /* alpha for each subflows */
-+
-+ u32 queue_delay; /* queue delay*/
-+};
-+
-+
-+static inline u64 mptcp_wvegas_scale(u32 val, int scale)
-+{
-+ return (u64) val << scale;
-+}
-+
-+static void wvegas_enable(const struct sock *sk)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ wvegas->doing_wvegas_now = 1;
-+
-+ wvegas->beg_snd_nxt = tp->snd_nxt;
-+
-+ wvegas->cnt_rtt = 0;
-+ wvegas->sampled_rtt = 0;
-+
-+ wvegas->instant_rate = 0;
-+ wvegas->alpha = initial_alpha;
-+ wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
-+
-+ wvegas->queue_delay = 0;
-+}
-+
-+static inline void wvegas_disable(const struct sock *sk)
-+{
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ wvegas->doing_wvegas_now = 0;
-+}
-+
-+static void mptcp_wvegas_init(struct sock *sk)
-+{
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ wvegas->base_rtt = 0x7fffffff;
-+ wvegas_enable(sk);
-+}
-+
-+static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
-+{
-+ return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
-+}
-+
-+static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
-+{
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+ u32 vrtt;
-+
-+ if (rtt_us < 0)
-+ return;
-+
-+ vrtt = rtt_us + 1;
-+
-+ if (vrtt < wvegas->base_rtt)
-+ wvegas->base_rtt = vrtt;
-+
-+ wvegas->sampled_rtt += vrtt;
-+ wvegas->cnt_rtt++;
-+}
-+
-+static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
-+{
-+ if (ca_state == TCP_CA_Open)
-+ wvegas_enable(sk);
-+ else
-+ wvegas_disable(sk);
-+}
-+
-+static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
-+{
-+ if (event == CA_EVENT_CWND_RESTART) {
-+ mptcp_wvegas_init(sk);
-+ } else if (event == CA_EVENT_LOSS) {
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+ wvegas->instant_rate = 0;
-+ }
-+}
-+
-+static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
-+{
-+ return min(tp->snd_ssthresh, tp->snd_cwnd - 1);
-+}
-+
-+static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
-+{
-+ u64 total_rate = 0;
-+ struct sock *sub_sk;
-+ const struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ if (!mpcb)
-+ return wvegas->weight;
-+
-+
-+ mptcp_for_each_sk(mpcb, sub_sk) {
-+ struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
-+
-+ /* sampled_rtt is initialized by 0 */
-+ if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
-+ total_rate += sub_wvegas->instant_rate;
-+ }
-+
-+ if (total_rate && wvegas->instant_rate)
-+ return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
-+ else
-+ return wvegas->weight;
-+}
-+
-+static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ if (!wvegas->doing_wvegas_now) {
-+ tcp_reno_cong_avoid(sk, ack, acked);
-+ return;
-+ }
-+
-+ if (after(ack, wvegas->beg_snd_nxt)) {
-+ wvegas->beg_snd_nxt = tp->snd_nxt;
-+
-+ if (wvegas->cnt_rtt <= 2) {
-+ tcp_reno_cong_avoid(sk, ack, acked);
-+ } else {
-+ u32 rtt, diff, q_delay;
-+ u64 target_cwnd;
-+
-+ rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
-+ target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
-+
-+ diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
-+
-+ if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
-+ tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
-+ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
-+
-+ } else if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+ tcp_slow_start(tp, acked);
-+ } else {
-+ if (diff >= wvegas->alpha) {
-+ wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
-+ wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
-+ wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
-+ }
-+ if (diff > wvegas->alpha) {
-+ tp->snd_cwnd--;
-+ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
-+ } else if (diff < wvegas->alpha) {
-+ tp->snd_cwnd++;
-+ }
-+
-+ /* Try to drain link queue if needed*/
-+ q_delay = rtt - wvegas->base_rtt;
-+ if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
-+ wvegas->queue_delay = q_delay;
-+
-+ if (q_delay >= 2 * wvegas->queue_delay) {
-+ u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
-+ tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
-+ wvegas->queue_delay = 0;
-+ }
-+ }
-+
-+ if (tp->snd_cwnd < 2)
-+ tp->snd_cwnd = 2;
-+ else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
-+ tp->snd_cwnd = tp->snd_cwnd_clamp;
-+
-+ tp->snd_ssthresh = tcp_current_ssthresh(sk);
-+ }
-+
-+ wvegas->cnt_rtt = 0;
-+ wvegas->sampled_rtt = 0;
-+ }
-+ /* Use normal slow start */
-+ else if (tp->snd_cwnd <= tp->snd_ssthresh)
-+ tcp_slow_start(tp, acked);
-+}
-+
-+
-+static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
-+ .init = mptcp_wvegas_init,
-+ .ssthresh = tcp_reno_ssthresh,
-+ .cong_avoid = mptcp_wvegas_cong_avoid,
-+ .pkts_acked = mptcp_wvegas_pkts_acked,
-+ .set_state = mptcp_wvegas_state,
-+ .cwnd_event = mptcp_wvegas_cwnd_event,
-+
-+ .owner = THIS_MODULE,
-+ .name = "wvegas",
-+};
-+
-+static int __init mptcp_wvegas_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
-+ tcp_register_congestion_control(&mptcp_wvegas);
-+ return 0;
-+}
-+
-+static void __exit mptcp_wvegas_unregister(void)
-+{
-+ tcp_unregister_congestion_control(&mptcp_wvegas);
-+}
-+
-+module_init(mptcp_wvegas_register);
-+module_exit(mptcp_wvegas_unregister);
-+
-+MODULE_AUTHOR("Yu Cao, Enhuan Dong");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP wVegas");
-+MODULE_VERSION("0.1");
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-06 11:38 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-06 11:38 UTC (permalink / raw
To: gentoo-commits
commit: f2ea3e49d07e5b148c974633ec003ba2382f1189
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Mon Oct 6 11:38:42 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Mon Oct 6 11:38:42 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=f2ea3e49
Move multipath to experimental.
---
5010_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 ++++++++++++++++++++++++++
1 file changed, 19230 insertions(+)
diff --git a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
new file mode 100644
index 0000000..3000da3
--- /dev/null
+++ b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
@@ -0,0 +1,19230 @@
+diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
+index 768a0fb67dd6..5a46d91a8df9 100644
+--- a/drivers/infiniband/hw/cxgb4/cm.c
++++ b/drivers/infiniband/hw/cxgb4/cm.c
+@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
+ */
+ memset(&tmp_opt, 0, sizeof(tmp_opt));
+ tcp_clear_options(&tmp_opt);
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
+
+ req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
+ memset(req, 0, sizeof(*req));
+diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
+index 2faef339d8f2..d86c853ffaad 100644
+--- a/include/linux/ipv6.h
++++ b/include/linux/ipv6.h
+@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ return inet_sk(__sk)->pinet6;
+ }
+
+-static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
+-{
+- struct request_sock *req = reqsk_alloc(ops);
+-
+- if (req)
+- inet_rsk(req)->pktopts = NULL;
+-
+- return req;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ return (struct raw6_sock *)sk;
+@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ return NULL;
+ }
+
+-static inline struct inet6_request_sock *
+- inet6_rsk(const struct request_sock *rsk)
+-{
+- return NULL;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ return NULL;
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index ec89301ada41..99ea4b0e3693 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
+ bool zero_okay,
+ __sum16 check)
+ {
+- if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
+- skb->csum_valid = 1;
++ if (skb_csum_unnecessary(skb)) {
++ return false;
++ } else if (zero_okay && !check) {
++ skb->ip_summed = CHECKSUM_UNNECESSARY;
+ return false;
+ }
+
+diff --git a/include/linux/tcp.h b/include/linux/tcp.h
+index a0513210798f..7bc2e078d6ca 100644
+--- a/include/linux/tcp.h
++++ b/include/linux/tcp.h
+@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
+ /* TCP Fast Open */
+ #define TCP_FASTOPEN_COOKIE_MIN 4 /* Min Fast Open Cookie size in bytes */
+ #define TCP_FASTOPEN_COOKIE_MAX 16 /* Max Fast Open Cookie size in bytes */
+-#define TCP_FASTOPEN_COOKIE_SIZE 8 /* the size employed by this impl. */
++#define TCP_FASTOPEN_COOKIE_SIZE 4 /* the size employed by this impl. */
+
+ /* TCP Fast Open Cookie as stored in memory */
+ struct tcp_fastopen_cookie {
+@@ -72,6 +72,51 @@ struct tcp_sack_block {
+ u32 end_seq;
+ };
+
++struct tcp_out_options {
++ u16 options; /* bit field of OPTION_* */
++ u8 ws; /* window scale, 0 to disable */
++ u8 num_sack_blocks;/* number of SACK blocks to include */
++ u8 hash_size; /* bytes in hash_location */
++ u16 mss; /* 0 to disable */
++ __u8 *hash_location; /* temporary pointer, overloaded */
++ __u32 tsval, tsecr; /* need to include OPTION_TS */
++ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
++#ifdef CONFIG_MPTCP
++ u16 mptcp_options; /* bit field of MPTCP related OPTION_* */
++ u8 dss_csum:1,
++ add_addr_v4:1,
++ add_addr_v6:1; /* dss-checksum required? */
++
++ union {
++ struct {
++ __u64 sender_key; /* sender's key for mptcp */
++ __u64 receiver_key; /* receiver's key for mptcp */
++ } mp_capable;
++
++ struct {
++ __u64 sender_truncated_mac;
++ __u32 sender_nonce;
++ /* random number of the sender */
++ __u32 token; /* token for mptcp */
++ u8 low_prio:1;
++ } mp_join_syns;
++ };
++
++ struct {
++ struct in_addr addr;
++ u8 addr_id;
++ } add_addr4;
++
++ struct {
++ struct in6_addr addr;
++ u8 addr_id;
++ } add_addr6;
++
++ u16 remove_addrs; /* list of address id */
++ u8 addr_id; /* address id (mp_join or add_address) */
++#endif /* CONFIG_MPTCP */
++};
++
+ /*These are used to set the sack_ok field in struct tcp_options_received */
+ #define TCP_SACK_SEEN (1 << 0) /*1 = peer is SACK capable, */
+ #define TCP_FACK_ENABLED (1 << 1) /*1 = FACK is enabled locally*/
+@@ -95,6 +140,9 @@ struct tcp_options_received {
+ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
+ };
+
++struct mptcp_cb;
++struct mptcp_tcp_sock;
++
+ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+ {
+ rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
+
+ struct tcp_request_sock {
+ struct inet_request_sock req;
+-#ifdef CONFIG_TCP_MD5SIG
+- /* Only used by TCP MD5 Signature so far. */
+ const struct tcp_request_sock_ops *af_specific;
+-#endif
+ struct sock *listener; /* needed for TFO */
+ u32 rcv_isn;
+ u32 snt_isn;
+@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
+ return (struct tcp_request_sock *)req;
+ }
+
++struct tcp_md5sig_key;
++
+ struct tcp_sock {
+ /* inet_connection_sock has to be the first member of tcp_sock */
+ struct inet_connection_sock inet_conn;
+@@ -326,6 +373,37 @@ struct tcp_sock {
+ * socket. Used to retransmit SYNACKs etc.
+ */
+ struct request_sock *fastopen_rsk;
++
++ /* MPTCP/TCP-specific callbacks */
++ const struct tcp_sock_ops *ops;
++
++ struct mptcp_cb *mpcb;
++ struct sock *meta_sk;
++ /* We keep these flags even if CONFIG_MPTCP is not checked, because
++ * it allows checking MPTCP capability just by checking the mpc flag,
++ * rather than adding ifdefs everywhere.
++ */
++ u16 mpc:1, /* Other end is multipath capable */
++ inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
++ send_mp_fclose:1,
++ request_mptcp:1, /* Did we send out an MP_CAPABLE?
++ * (this speeds up mptcp_doit() in tcp_recvmsg)
++ */
++ mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
++ pf:1, /* Potentially Failed state: when this flag is set, we
++ * stop using the subflow
++ */
++ mp_killed:1, /* Killed with a tcp_done in mptcp? */
++ was_meta_sk:1, /* This was a meta sk (in case of reuse) */
++ is_master_sk,
++ close_it:1, /* Must close socket in mptcp_data_ready? */
++ closing:1;
++ struct mptcp_tcp_sock *mptcp;
++#ifdef CONFIG_MPTCP
++ struct hlist_nulls_node tk_table;
++ u32 mptcp_loc_token;
++ u64 mptcp_loc_key;
++#endif /* CONFIG_MPTCP */
+ };
+
+ enum tsq_flags {
+@@ -337,6 +415,8 @@ enum tsq_flags {
+ TCP_MTU_REDUCED_DEFERRED, /* tcp_v{4|6}_err() could not call
+ * tcp_v{4|6}_mtu_reduced()
+ */
++ MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
++ MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
+ };
+
+ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
+@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
+ #ifdef CONFIG_TCP_MD5SIG
+ struct tcp_md5sig_key *tw_md5_key;
+ #endif
++ struct mptcp_tw *mptcp_tw;
+ };
+
+ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
+diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
+index 74af137304be..83f63033897a 100644
+--- a/include/net/inet6_connection_sock.h
++++ b/include/net/inet6_connection_sock.h
+@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
+
+ struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
+ const struct request_sock *req);
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++ const u32 rnd, const u32 synq_hsize);
+
+ struct request_sock *inet6_csk_search_req(const struct sock *sk,
+ struct request_sock ***prevp,
+diff --git a/include/net/inet_common.h b/include/net/inet_common.h
+index fe7994c48b75..780f229f46a8 100644
+--- a/include/net/inet_common.h
++++ b/include/net/inet_common.h
+@@ -1,6 +1,8 @@
+ #ifndef _INET_COMMON_H
+ #define _INET_COMMON_H
+
++#include <net/sock.h>
++
+ extern const struct proto_ops inet_stream_ops;
+ extern const struct proto_ops inet_dgram_ops;
+
+@@ -13,6 +15,8 @@ struct sock;
+ struct sockaddr;
+ struct socket;
+
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
+ int inet_release(struct socket *sock);
+ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ int addr_len, int flags);
+diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
+index 7a4313887568..f62159e39839 100644
+--- a/include/net/inet_connection_sock.h
++++ b/include/net/inet_connection_sock.h
+@@ -30,6 +30,7 @@
+
+ struct inet_bind_bucket;
+ struct tcp_congestion_ops;
++struct tcp_options_received;
+
+ /*
+ * Pointers to address related TCP functions
+@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
+
+ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
+
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++ const u32 synq_hsize);
++
+ struct request_sock *inet_csk_search_req(const struct sock *sk,
+ struct request_sock ***prevp,
+ const __be16 rport,
+diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
+index b1edf17bec01..6a32d8d6b85e 100644
+--- a/include/net/inet_sock.h
++++ b/include/net/inet_sock.h
+@@ -86,10 +86,14 @@ struct inet_request_sock {
+ wscale_ok : 1,
+ ecn_ok : 1,
+ acked : 1,
+- no_srccheck: 1;
++ no_srccheck: 1,
++ mptcp_rqsk : 1,
++ saw_mpc : 1;
+ kmemcheck_bitfield_end(flags);
+- struct ip_options_rcu *opt;
+- struct sk_buff *pktopts;
++ union {
++ struct ip_options_rcu *opt;
++ struct sk_buff *pktopts;
++ };
+ u32 ir_mark;
+ };
+
+diff --git a/include/net/mptcp.h b/include/net/mptcp.h
+new file mode 100644
+index 000000000000..712780fc39e4
+--- /dev/null
++++ b/include/net/mptcp.h
+@@ -0,0 +1,1439 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_H
++#define _MPTCP_H
++
++#include <linux/inetdevice.h>
++#include <linux/ipv6.h>
++#include <linux/list.h>
++#include <linux/net.h>
++#include <linux/netpoll.h>
++#include <linux/skbuff.h>
++#include <linux/socket.h>
++#include <linux/tcp.h>
++#include <linux/kernel.h>
++
++#include <asm/byteorder.h>
++#include <asm/unaligned.h>
++#include <crypto/hash.h>
++#include <net/tcp.h>
++
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ #define ntohll(x) be64_to_cpu(x)
++ #define htonll(x) cpu_to_be64(x)
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ #define ntohll(x) (x)
++ #define htonll(x) (x)
++#endif
++
++struct mptcp_loc4 {
++ u8 loc4_id;
++ u8 low_prio:1;
++ struct in_addr addr;
++};
++
++struct mptcp_rem4 {
++ u8 rem4_id;
++ __be16 port;
++ struct in_addr addr;
++};
++
++struct mptcp_loc6 {
++ u8 loc6_id;
++ u8 low_prio:1;
++ struct in6_addr addr;
++};
++
++struct mptcp_rem6 {
++ u8 rem6_id;
++ __be16 port;
++ struct in6_addr addr;
++};
++
++struct mptcp_request_sock {
++ struct tcp_request_sock req;
++ /* hlist-nulls entry to the hash-table. Depending on whether this is a
++ * a new MPTCP connection or an additional subflow, the request-socket
++ * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
++ */
++ struct hlist_nulls_node hash_entry;
++
++ union {
++ struct {
++ /* Only on initial subflows */
++ u64 mptcp_loc_key;
++ u64 mptcp_rem_key;
++ u32 mptcp_loc_token;
++ };
++
++ struct {
++ /* Only on additional subflows */
++ struct mptcp_cb *mptcp_mpcb;
++ u32 mptcp_rem_nonce;
++ u32 mptcp_loc_nonce;
++ u64 mptcp_hash_tmac;
++ };
++ };
++
++ u8 loc_id;
++ u8 rem_id; /* Address-id in the MP_JOIN */
++ u8 dss_csum:1,
++ is_sub:1, /* Is this a new subflow? */
++ low_prio:1, /* Interface set to low-prio? */
++ rcv_low_prio:1;
++};
++
++struct mptcp_options_received {
++ u16 saw_mpc:1,
++ dss_csum:1,
++ drop_me:1,
++
++ is_mp_join:1,
++ join_ack:1,
++
++ saw_low_prio:2, /* 0x1 - low-prio set for this subflow
++ * 0x2 - low-prio set for another subflow
++ */
++ low_prio:1,
++
++ saw_add_addr:2, /* Saw at least one add_addr option:
++ * 0x1: IPv4 - 0x2: IPv6
++ */
++ more_add_addr:1, /* Saw one more add-addr. */
++
++ saw_rem_addr:1, /* Saw at least one rem_addr option */
++ more_rem_addr:1, /* Saw one more rem-addr. */
++
++ mp_fail:1,
++ mp_fclose:1;
++ u8 rem_id; /* Address-id in the MP_JOIN */
++ u8 prio_addr_id; /* Address-id in the MP_PRIO */
++
++ const unsigned char *add_addr_ptr; /* Pointer to add-address option */
++ const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
++
++ u32 data_ack;
++ u32 data_seq;
++ u16 data_len;
++
++ u32 mptcp_rem_token;/* Remote token */
++
++ /* Key inside the option (from mp_capable or fast_close) */
++ u64 mptcp_key;
++
++ u32 mptcp_recv_nonce;
++ u64 mptcp_recv_tmac;
++ u8 mptcp_recv_mac[20];
++};
++
++struct mptcp_tcp_sock {
++ struct tcp_sock *next; /* Next subflow socket */
++ struct hlist_node cb_list;
++ struct mptcp_options_received rx_opt;
++
++ /* Those three fields record the current mapping */
++ u64 map_data_seq;
++ u32 map_subseq;
++ u16 map_data_len;
++ u16 slave_sk:1,
++ fully_established:1,
++ establish_increased:1,
++ second_packet:1,
++ attached:1,
++ send_mp_fail:1,
++ include_mpc:1,
++ mapping_present:1,
++ map_data_fin:1,
++ low_prio:1, /* use this socket as backup */
++ rcv_low_prio:1, /* Peer sent low-prio option to us */
++ send_mp_prio:1, /* Trigger to send mp_prio on this socket */
++ pre_established:1; /* State between sending 3rd ACK and
++ * receiving the fourth ack of new subflows.
++ */
++
++ /* isn: needed to translate abs to relative subflow seqnums */
++ u32 snt_isn;
++ u32 rcv_isn;
++ u8 path_index;
++ u8 loc_id;
++ u8 rem_id;
++
++#define MPTCP_SCHED_SIZE 4
++ u8 mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
++
++ struct sk_buff *shortcut_ofoqueue; /* Shortcut to the current modified
++ * skb in the ofo-queue.
++ */
++
++ int init_rcv_wnd;
++ u32 infinite_cutoff_seq;
++ struct delayed_work work;
++ u32 mptcp_loc_nonce;
++ struct tcp_sock *tp; /* Where is my daddy? */
++ u32 last_end_data_seq;
++
++ /* MP_JOIN subflow: timer for retransmitting the 3rd ack */
++ struct timer_list mptcp_ack_timer;
++
++ /* HMAC of the third ack */
++ char sender_mac[20];
++};
++
++struct mptcp_tw {
++ struct list_head list;
++ u64 loc_key;
++ u64 rcv_nxt;
++ struct mptcp_cb __rcu *mpcb;
++ u8 meta_tw:1,
++ in_list:1;
++};
++
++#define MPTCP_PM_NAME_MAX 16
++struct mptcp_pm_ops {
++ struct list_head list;
++
++ /* Signal the creation of a new MPTCP-session. */
++ void (*new_session)(const struct sock *meta_sk);
++ void (*release_sock)(struct sock *meta_sk);
++ void (*fully_established)(struct sock *meta_sk);
++ void (*new_remote_address)(struct sock *meta_sk);
++ int (*get_local_id)(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio);
++ void (*addr_signal)(struct sock *sk, unsigned *size,
++ struct tcp_out_options *opts, struct sk_buff *skb);
++ void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
++ sa_family_t family, __be16 port, u8 id);
++ void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
++ void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
++ void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
++
++ char name[MPTCP_PM_NAME_MAX];
++ struct module *owner;
++};
++
++#define MPTCP_SCHED_NAME_MAX 16
++struct mptcp_sched_ops {
++ struct list_head list;
++
++ struct sock * (*get_subflow)(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test);
++ struct sk_buff * (*next_segment)(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit);
++ void (*init)(struct sock *sk);
++
++ char name[MPTCP_SCHED_NAME_MAX];
++ struct module *owner;
++};
++
++struct mptcp_cb {
++ /* list of sockets in this multipath connection */
++ struct tcp_sock *connection_list;
++ /* list of sockets that need a call to release_cb */
++ struct hlist_head callback_list;
++
++ /* High-order bits of 64-bit sequence numbers */
++ u32 snd_high_order[2];
++ u32 rcv_high_order[2];
++
++ u16 send_infinite_mapping:1,
++ in_time_wait:1,
++ list_rcvd:1, /* XXX TO REMOVE */
++ addr_signal:1, /* Path-manager wants us to call addr_signal */
++ dss_csum:1,
++ server_side:1,
++ infinite_mapping_rcv:1,
++ infinite_mapping_snd:1,
++ dfin_combined:1, /* Was the DFIN combined with subflow-fin? */
++ passive_close:1,
++ snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
++ rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
++
++ /* socket count in this connection */
++ u8 cnt_subflows;
++ u8 cnt_established;
++
++ struct mptcp_sched_ops *sched_ops;
++
++ struct sk_buff_head reinject_queue;
++ /* First cache-line boundary is here minus 8 bytes. But from the
++ * reinject-queue only the next and prev pointers are regularly
++ * accessed. Thus, the whole data-path is on a single cache-line.
++ */
++
++ u64 csum_cutoff_seq;
++
++ /***** Start of fields, used for connection closure */
++ spinlock_t tw_lock;
++ unsigned char mptw_state;
++ u8 dfin_path_index;
++
++ struct list_head tw_list;
++
++ /***** Start of fields, used for subflow establishment and closure */
++ atomic_t mpcb_refcnt;
++
++ /* Mutex needed, because otherwise mptcp_close will complain that the
++ * socket is owned by the user.
++ * E.g., mptcp_sub_close_wq is taking the meta-lock.
++ */
++ struct mutex mpcb_mutex;
++
++ /***** Start of fields, used for subflow establishment */
++ struct sock *meta_sk;
++
++ /* Master socket, also part of the connection_list, this
++ * socket is the one that the application sees.
++ */
++ struct sock *master_sk;
++
++ __u64 mptcp_loc_key;
++ __u64 mptcp_rem_key;
++ __u32 mptcp_loc_token;
++ __u32 mptcp_rem_token;
++
++#define MPTCP_PM_SIZE 608
++ u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
++ struct mptcp_pm_ops *pm_ops;
++
++ u32 path_index_bits;
++ /* Next pi to pick up in case a new path becomes available */
++ u8 next_path_index;
++
++ /* Original snd/rcvbuf of the initial subflow.
++ * Used for the new subflows on the server-side to allow correct
++ * autotuning
++ */
++ int orig_sk_rcvbuf;
++ int orig_sk_sndbuf;
++ u32 orig_window_clamp;
++
++ /* Timer for retransmitting SYN/ACK+MP_JOIN */
++ struct timer_list synack_timer;
++};
++
++#define MPTCP_SUB_CAPABLE 0
++#define MPTCP_SUB_LEN_CAPABLE_SYN 12
++#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN 12
++#define MPTCP_SUB_LEN_CAPABLE_ACK 20
++#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN 20
++
++#define MPTCP_SUB_JOIN 1
++#define MPTCP_SUB_LEN_JOIN_SYN 12
++#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN 12
++#define MPTCP_SUB_LEN_JOIN_SYNACK 16
++#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN 16
++#define MPTCP_SUB_LEN_JOIN_ACK 24
++#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN 24
++
++#define MPTCP_SUB_DSS 2
++#define MPTCP_SUB_LEN_DSS 4
++#define MPTCP_SUB_LEN_DSS_ALIGN 4
++
++/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
++ * as they are part of the DSS-option.
++ * To get the total length, just add the different options together.
++ */
++#define MPTCP_SUB_LEN_SEQ 10
++#define MPTCP_SUB_LEN_SEQ_CSUM 12
++#define MPTCP_SUB_LEN_SEQ_ALIGN 12
++
++#define MPTCP_SUB_LEN_SEQ_64 14
++#define MPTCP_SUB_LEN_SEQ_CSUM_64 16
++#define MPTCP_SUB_LEN_SEQ_64_ALIGN 16
++
++#define MPTCP_SUB_LEN_ACK 4
++#define MPTCP_SUB_LEN_ACK_ALIGN 4
++
++#define MPTCP_SUB_LEN_ACK_64 8
++#define MPTCP_SUB_LEN_ACK_64_ALIGN 8
++
++/* This is the "default" option-length we will send out most often.
++ * MPTCP DSS-header
++ * 32-bit data sequence number
++ * 32-bit data ack
++ *
++ * It is necessary to calculate the effective MSS we will be using when
++ * sending data.
++ */
++#define MPTCP_SUB_LEN_DSM_ALIGN (MPTCP_SUB_LEN_DSS_ALIGN + \
++ MPTCP_SUB_LEN_SEQ_ALIGN + \
++ MPTCP_SUB_LEN_ACK_ALIGN)
++
++#define MPTCP_SUB_ADD_ADDR 3
++#define MPTCP_SUB_LEN_ADD_ADDR4 8
++#define MPTCP_SUB_LEN_ADD_ADDR6 20
++#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN 8
++#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN 20
++
++#define MPTCP_SUB_REMOVE_ADDR 4
++#define MPTCP_SUB_LEN_REMOVE_ADDR 4
++
++#define MPTCP_SUB_PRIO 5
++#define MPTCP_SUB_LEN_PRIO 3
++#define MPTCP_SUB_LEN_PRIO_ADDR 4
++#define MPTCP_SUB_LEN_PRIO_ALIGN 4
++
++#define MPTCP_SUB_FAIL 6
++#define MPTCP_SUB_LEN_FAIL 12
++#define MPTCP_SUB_LEN_FAIL_ALIGN 12
++
++#define MPTCP_SUB_FCLOSE 7
++#define MPTCP_SUB_LEN_FCLOSE 12
++#define MPTCP_SUB_LEN_FCLOSE_ALIGN 12
++
++
++#define OPTION_MPTCP (1 << 5)
++
++#ifdef CONFIG_MPTCP
++
++/* Used for checking if the mptcp initialization has been successful */
++extern bool mptcp_init_failed;
++
++/* MPTCP options */
++#define OPTION_TYPE_SYN (1 << 0)
++#define OPTION_TYPE_SYNACK (1 << 1)
++#define OPTION_TYPE_ACK (1 << 2)
++#define OPTION_MP_CAPABLE (1 << 3)
++#define OPTION_DATA_ACK (1 << 4)
++#define OPTION_ADD_ADDR (1 << 5)
++#define OPTION_MP_JOIN (1 << 6)
++#define OPTION_MP_FAIL (1 << 7)
++#define OPTION_MP_FCLOSE (1 << 8)
++#define OPTION_REMOVE_ADDR (1 << 9)
++#define OPTION_MP_PRIO (1 << 10)
++
++/* MPTCP flags: both TX and RX */
++#define MPTCPHDR_SEQ 0x01 /* DSS.M option is present */
++#define MPTCPHDR_FIN 0x02 /* DSS.F option is present */
++#define MPTCPHDR_SEQ64_INDEX 0x04 /* index of seq in mpcb->snd_high_order */
++/* MPTCP flags: RX only */
++#define MPTCPHDR_ACK 0x08
++#define MPTCPHDR_SEQ64_SET 0x10 /* Did we received a 64-bit seq number? */
++#define MPTCPHDR_SEQ64_OFO 0x20 /* Is it not in our circular array? */
++#define MPTCPHDR_DSS_CSUM 0x40
++#define MPTCPHDR_JOIN 0x80
++/* MPTCP flags: TX only */
++#define MPTCPHDR_INF 0x08
++
++struct mptcp_option {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ver:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ver:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_capable {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ver:4,
++ sub:4;
++ __u8 h:1,
++ rsv:5,
++ b:1,
++ a:1;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ver:4;
++ __u8 a:1,
++ b:1,
++ rsv:5,
++ h:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u64 sender_key;
++ __u64 receiver_key;
++} __attribute__((__packed__));
++
++struct mp_join {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 b:1,
++ rsv:3,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:3,
++ b:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++ union {
++ struct {
++ u32 token;
++ u32 nonce;
++ } syn;
++ struct {
++ __u64 mac;
++ u32 nonce;
++ } synack;
++ struct {
++ __u8 mac[20];
++ } ack;
++ } u;
++} __attribute__((__packed__));
++
++struct mp_dss {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ A:1,
++ a:1,
++ M:1,
++ m:1,
++ F:1,
++ rsv2:3;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:3,
++ F:1,
++ m:1,
++ M:1,
++ a:1,
++ A:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_add_addr {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 ipver:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ ipver:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++ union {
++ struct {
++ struct in_addr addr;
++ __be16 port;
++ } v4;
++ struct {
++ struct in6_addr addr;
++ __be16 port;
++ } v6;
++ } u;
++} __attribute__((__packed__));
++
++struct mp_remove_addr {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 rsv:4,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ /* list of addr_id */
++ __u8 addrs_id;
++};
++
++struct mp_fail {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:8;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __be64 data_seq;
++} __attribute__((__packed__));
++
++struct mp_fclose {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u16 rsv1:4,
++ sub:4,
++ rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u16 sub:4,
++ rsv1:4,
++ rsv2:8;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u64 key;
++} __attribute__((__packed__));
++
++struct mp_prio {
++ __u8 kind;
++ __u8 len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++ __u8 b:1,
++ rsv:3,
++ sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++ __u8 sub:4,
++ rsv:3,
++ b:1;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++ __u8 addr_id;
++} __attribute__((__packed__));
++
++static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
++{
++ return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
++}
++
++#define MPTCP_APP 2
++
++extern int sysctl_mptcp_enabled;
++extern int sysctl_mptcp_checksum;
++extern int sysctl_mptcp_debug;
++extern int sysctl_mptcp_syn_retries;
++
++extern struct workqueue_struct *mptcp_wq;
++
++#define mptcp_debug(fmt, args...) \
++ do { \
++ if (unlikely(sysctl_mptcp_debug)) \
++ pr_err(__FILE__ ": " fmt, ##args); \
++ } while (0)
++
++/* Iterates over all subflows */
++#define mptcp_for_each_tp(mpcb, tp) \
++ for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
++
++#define mptcp_for_each_sk(mpcb, sk) \
++ for ((sk) = (struct sock *)(mpcb)->connection_list; \
++ sk; \
++ sk = (struct sock *)tcp_sk(sk)->mptcp->next)
++
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp) \
++ for (__sk = (struct sock *)(__mpcb)->connection_list, \
++ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
++ __sk; \
++ __sk = __temp, \
++ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
++
++/* Iterates over all bit set to 1 in a bitset */
++#define mptcp_for_each_bit_set(b, i) \
++ for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
++
++#define mptcp_for_each_bit_unset(b, i) \
++ mptcp_for_each_bit_set(~b, i)
++
++extern struct lock_class_key meta_key;
++extern struct lock_class_key meta_slock_key;
++extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
++
++/* This is needed to ensure that two subsequent key/nonce-generation result in
++ * different keys/nonces if the IPs and ports are the same.
++ */
++extern u32 mptcp_seed;
++
++#define MPTCP_HASH_SIZE 1024
++
++extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++extern spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
++
++/* Lock, protecting the two hash-tables that hold the token. Namely,
++ * mptcp_reqsk_tk_htb and tk_hashtable
++ */
++extern spinlock_t mptcp_tk_hashlock; /* hashtable protection */
++
++/* Request-sockets can be hashed in the tk_htb for collision-detection or in
++ * the regular htb for join-connections. We need to define different NULLS
++ * values so that we can correctly detect a request-socket that has been
++ * recycled. See also c25eb3bfb9729.
++ */
++#define MPTCP_REQSK_NULLS_BASE (1U << 29)
++
++
++void mptcp_data_ready(struct sock *sk);
++void mptcp_write_space(struct sock *sk);
++
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++ struct sock *sk);
++void mptcp_ofo_queue(struct sock *meta_sk);
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++ gfp_t flags);
++void mptcp_del_sock(struct sock *sk);
++void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
++void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
++void mptcp_update_sndbuf(const struct tcp_sock *tp);
++void mptcp_send_fin(struct sock *meta_sk);
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
++bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++ struct mptcp_options_received *mopt);
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++ struct mptcp_options_received *mopt,
++ const struct sk_buff *skb);
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++ unsigned *remaining);
++void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts,
++ unsigned *remaining);
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++ struct tcp_out_options *opts, unsigned *size);
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb);
++void mptcp_close(struct sock *meta_sk, long timeout);
++int mptcp_doit(struct sock *sk);
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev);
++struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt);
++u32 __mptcp_select_window(struct sock *sk);
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
++unsigned int mptcp_current_mss(struct sock *meta_sk);
++int mptcp_select_size(const struct sock *meta_sk, bool sg);
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++ u32 *hash_out);
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
++void mptcp_fin(struct sock *meta_sk);
++void mptcp_retransmit_timer(struct sock *meta_sk);
++int mptcp_write_wakeup(struct sock *meta_sk);
++void mptcp_sub_close_wq(struct work_struct *work);
++void mptcp_sub_close(struct sock *sk, unsigned long delay);
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
++void mptcp_fallback_meta_sk(struct sock *meta_sk);
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_ack_handler(unsigned long);
++int mptcp_check_rtt(const struct tcp_sock *tp, int time);
++int mptcp_check_snd_buf(const struct tcp_sock *tp);
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++ const struct sk_buff *skb);
++void __init mptcp_init(void);
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
++void mptcp_destroy_sock(struct sock *sk);
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++ const struct sk_buff *skb,
++ const struct mptcp_options_received *mopt);
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++ int large_allowed);
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
++void mptcp_time_wait(struct sock *sk, int state, int timeo);
++void mptcp_disconnect(struct sock *sk);
++bool mptcp_should_expand_sndbuf(const struct sock *sk);
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_tsq_flags(struct sock *sk);
++void mptcp_tsq_sub_deferred(struct sock *meta_sk);
++struct mp_join *mptcp_find_join(const struct sk_buff *skb);
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
++void mptcp_hash_remove(struct tcp_sock *meta_tp);
++struct sock *mptcp_hash_find(const struct net *net, const u32 token);
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
++int mptcp_do_join_short(struct sk_buff *skb,
++ const struct mptcp_options_received *mopt,
++ struct net *net);
++void mptcp_reqsk_destructor(struct request_sock *req);
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb);
++int mptcp_check_req(struct sk_buff *skb, struct net *net);
++void mptcp_connect_init(struct sock *sk);
++void mptcp_sub_force_close(struct sock *sk);
++int mptcp_sub_len_remove_addr_align(u16 bitfield);
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++ struct sk_buff *skb);
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
++void mptcp_init_congestion_control(struct sock *sk);
++
++/* MPTCP-path-manager registration/initialization functions */
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_init_path_manager(struct mptcp_cb *mpcb);
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
++void mptcp_fallback_default(struct mptcp_cb *mpcb);
++void mptcp_get_default_path_manager(char *name);
++int mptcp_set_default_path_manager(const char *name);
++extern struct mptcp_pm_ops mptcp_pm_default;
++
++/* MPTCP-scheduler registration/initialization functions */
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_init_scheduler(struct mptcp_cb *mpcb);
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
++void mptcp_get_default_scheduler(char *name);
++int mptcp_set_default_scheduler(const char *name);
++extern struct mptcp_sched_ops mptcp_sched_default;
++
++static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
++ unsigned long len)
++{
++ sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
++ jiffies + len);
++}
++
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
++{
++ sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
++}
++
++static inline bool is_mptcp_enabled(const struct sock *sk)
++{
++ if (!sysctl_mptcp_enabled || mptcp_init_failed)
++ return false;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++ return false;
++
++ return true;
++}
++
++static inline int mptcp_pi_to_flag(int pi)
++{
++ return 1 << (pi - 1);
++}
++
++static inline
++struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
++{
++ return (struct mptcp_request_sock *)req;
++}
++
++static inline
++struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
++{
++ return (struct request_sock *)req;
++}
++
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++ struct sock *sk_it;
++
++ if (tcp_sk(sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
++ if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
++ !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
++ return false;
++ }
++
++ return true;
++}
++
++static inline void mptcp_push_pending_frames(struct sock *meta_sk)
++{
++ /* We check packets out and send-head here. TCP only checks the
++ * send-head. But, MPTCP also checks packets_out, as this is an
++ * indication that we might want to do opportunistic reinjection.
++ */
++ if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
++ struct tcp_sock *tp = tcp_sk(meta_sk);
++
++ /* We don't care about the MSS, because it will be set in
++ * mptcp_write_xmit.
++ */
++ __tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
++ }
++}
++
++static inline void mptcp_send_reset(struct sock *sk)
++{
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++ mptcp_sub_force_close(sk);
++}
++
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
++}
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
++}
++
++/* Is it a data-fin while in infinite mapping mode?
++ * In infinite mode, a subflow-fin is in fact a data-fin.
++ */
++static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
++ const struct tcp_sock *tp)
++{
++ return mptcp_is_data_fin(skb) ||
++ (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
++}
++
++static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
++{
++ u64 data_seq_high = (u32)(data_seq >> 32);
++
++ if (mpcb->rcv_high_order[0] == data_seq_high)
++ return 0;
++ else if (mpcb->rcv_high_order[1] == data_seq_high)
++ return MPTCPHDR_SEQ64_INDEX;
++ else
++ return MPTCPHDR_SEQ64_OFO;
++}
++
++/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
++ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
++ */
++static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
++ u32 *data_seq,
++ struct mptcp_cb *mpcb)
++{
++ __u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
++
++ if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
++ u64 data_seq64 = get_unaligned_be64(ptr);
++
++ if (mpcb)
++ TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
++
++ *data_seq = (u32)data_seq64;
++ ptr++;
++ } else {
++ *data_seq = get_unaligned_be32(ptr);
++ }
++
++ return ptr;
++}
++
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++ return tcp_sk(sk)->meta_sk;
++}
++
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++ return tcp_sk(tp->meta_sk);
++}
++
++static inline int is_meta_tp(const struct tcp_sock *tp)
++{
++ return tp->mpcb && mptcp_meta_tp(tp) == tp;
++}
++
++static inline int is_meta_sk(const struct sock *sk)
++{
++ return sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP &&
++ mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
++}
++
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++ return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
++}
++
++static inline void mptcp_hash_request_remove(struct request_sock *req)
++{
++ int in_softirq = 0;
++
++ if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
++ return;
++
++ if (in_softirq()) {
++ spin_lock(&mptcp_reqsk_hlock);
++ in_softirq = 1;
++ } else {
++ spin_lock_bh(&mptcp_reqsk_hlock);
++ }
++
++ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++
++ if (in_softirq)
++ spin_unlock(&mptcp_reqsk_hlock);
++ else
++ spin_unlock_bh(&mptcp_reqsk_hlock);
++}
++
++static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
++{
++ mopt->saw_mpc = 0;
++ mopt->dss_csum = 0;
++ mopt->drop_me = 0;
++
++ mopt->is_mp_join = 0;
++ mopt->join_ack = 0;
++
++ mopt->saw_low_prio = 0;
++ mopt->low_prio = 0;
++
++ mopt->saw_add_addr = 0;
++ mopt->more_add_addr = 0;
++
++ mopt->saw_rem_addr = 0;
++ mopt->more_rem_addr = 0;
++
++ mopt->mp_fail = 0;
++ mopt->mp_fclose = 0;
++}
++
++static inline void mptcp_reset_mopt(struct tcp_sock *tp)
++{
++ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++ mopt->saw_low_prio = 0;
++ mopt->saw_add_addr = 0;
++ mopt->more_add_addr = 0;
++ mopt->saw_rem_addr = 0;
++ mopt->more_rem_addr = 0;
++ mopt->join_ack = 0;
++ mopt->mp_fail = 0;
++ mopt->mp_fclose = 0;
++}
++
++static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
++ const struct mptcp_cb *mpcb)
++{
++ return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
++ MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
++}
++
++static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
++ u32 data_seq_32)
++{
++ return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
++}
++
++static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
++{
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++ meta_tp->rcv_nxt);
++}
++
++static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
++{
++ if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++ mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
++ }
++}
++
++static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
++ u32 old_rcv_nxt)
++{
++ if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
++ mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
++ }
++}
++
++static inline int mptcp_sk_can_send(const struct sock *sk)
++{
++ return tcp_passive_fastopen(sk) ||
++ ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
++ !tcp_sk(sk)->mptcp->pre_established);
++}
++
++static inline int mptcp_sk_can_recv(const struct sock *sk)
++{
++ return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
++}
++
++static inline int mptcp_sk_can_send_ack(const struct sock *sk)
++{
++ return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
++ TCPF_CLOSE | TCPF_LISTEN)) &&
++ !tcp_sk(sk)->mptcp->pre_established;
++}
++
++/* Only support GSO if all subflows supports it */
++static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
++{
++ struct sock *sk;
++
++ if (tcp_sk(meta_sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++ if (!sk_can_gso(sk))
++ return false;
++ }
++ return true;
++}
++
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++ struct sock *sk;
++
++ if (tcp_sk(meta_sk)->mpcb->dss_csum)
++ return false;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++ if (!(sk->sk_route_caps & NETIF_F_SG))
++ return false;
++ }
++ return true;
++}
++
++static inline void mptcp_set_rto(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *sk_it;
++ struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
++ __u32 max_rto = 0;
++
++ /* We are in recovery-phase on the MPTCP-level. Do not update the
++ * RTO, because this would kill exponential backoff.
++ */
++ if (micsk->icsk_retransmits)
++ return;
++
++ mptcp_for_each_sk(tp->mpcb, sk_it) {
++ if (mptcp_sk_can_send(sk_it) &&
++ inet_csk(sk_it)->icsk_rto > max_rto)
++ max_rto = inet_csk(sk_it)->icsk_rto;
++ }
++ if (max_rto) {
++ micsk->icsk_rto = max_rto << 1;
++
++ /* A successfull rto-measurement - reset backoff counter */
++ micsk->icsk_backoff = 0;
++ }
++}
++
++static inline int mptcp_sysctl_syn_retries(void)
++{
++ return sysctl_mptcp_syn_retries;
++}
++
++static inline void mptcp_sub_close_passive(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
++
++ /* Only close, if the app did a send-shutdown (passive close), and we
++ * received the data-ack of the data-fin.
++ */
++ if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
++ mptcp_sub_close(sk, 0);
++}
++
++static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* If data has been acknowleged on the meta-level, fully_established
++ * will have been set before and thus we will not fall back to infinite
++ * mapping.
++ */
++ if (likely(tp->mptcp->fully_established))
++ return false;
++
++ if (!(flag & MPTCP_FLAG_DATA_ACKED))
++ return false;
++
++ /* Don't fallback twice ;) */
++ if (tp->mpcb->infinite_mapping_snd)
++ return false;
++
++ pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
++ __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
++ &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
++ __builtin_return_address(0));
++ if (!is_master_tp(tp))
++ return true;
++
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mpcb->infinite_mapping_rcv = 1;
++ tp->mptcp->fully_established = 1;
++
++ return false;
++}
++
++/* Find the first index whose bit in the bit-field == 0 */
++static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
++{
++ u8 base = mpcb->next_path_index;
++ int i;
++
++ /* Start at 1, because 0 is reserved for the meta-sk */
++ mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
++ if (i + base < 1)
++ continue;
++ if (i + base >= sizeof(mpcb->path_index_bits) * 8)
++ break;
++ i += base;
++ mpcb->path_index_bits |= (1 << i);
++ mpcb->next_path_index = i + 1;
++ return i;
++ }
++ mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
++ if (i >= sizeof(mpcb->path_index_bits) * 8)
++ break;
++ if (i < 1)
++ continue;
++ mpcb->path_index_bits |= (1 << i);
++ mpcb->next_path_index = i + 1;
++ return i;
++ }
++
++ return 0;
++}
++
++static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
++{
++ return sk->sk_family == AF_INET6 &&
++ ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
++}
++
++/* TCP and MPTCP mpc flag-depending functions */
++u16 mptcp_select_window(struct sock *sk);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_tcp_set_rto(struct sock *sk);
++
++/* TCP and MPTCP flag-depending functions */
++bool mptcp_prune_ofo_queue(struct sock *sk);
++
++#else /* CONFIG_MPTCP */
++#define mptcp_debug(fmt, args...) \
++ do { \
++ } while (0)
++
++/* Without MPTCP, we just do one iteration
++ * over the only socket available. This assumes that
++ * the sk/tp arg is the socket in that case.
++ */
++#define mptcp_for_each_sk(mpcb, sk)
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++ return false;
++}
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++ return false;
++}
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++ return NULL;
++}
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++ return NULL;
++}
++static inline int is_meta_sk(const struct sock *sk)
++{
++ return 0;
++}
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++ return 0;
++}
++static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
++static inline void mptcp_del_sock(const struct sock *sk) {}
++static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
++static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
++static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
++static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
++ const struct sock *sk) {}
++static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
++static inline void mptcp_set_rto(const struct sock *sk) {}
++static inline void mptcp_send_fin(const struct sock *meta_sk) {}
++static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb) {}
++static inline void mptcp_syn_options(const struct sock *sk,
++ struct tcp_out_options *opts,
++ unsigned *remaining) {}
++static inline void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts,
++ unsigned *remaining) {}
++
++static inline void mptcp_established_options(struct sock *sk,
++ struct sk_buff *skb,
++ struct tcp_out_options *opts,
++ unsigned *size) {}
++static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb) {}
++static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
++static inline int mptcp_doit(struct sock *sk)
++{
++ return 0;
++}
++static inline int mptcp_check_req_fastopen(struct sock *child,
++ struct request_sock *req)
++{
++ return 1;
++}
++static inline int mptcp_check_req_master(const struct sock *sk,
++ const struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev)
++{
++ return 1;
++}
++static inline struct sock *mptcp_check_req_child(struct sock *sk,
++ struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt)
++{
++ return NULL;
++}
++static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++ return 0;
++}
++static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++ return 0;
++}
++static inline void mptcp_sub_close_passive(struct sock *sk) {}
++static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
++{
++ return false;
++}
++static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
++static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++ return 0;
++}
++static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++ return 0;
++}
++static inline int mptcp_sysctl_syn_retries(void)
++{
++ return 0;
++}
++static inline void mptcp_send_reset(const struct sock *sk) {}
++static inline int mptcp_handle_options(struct sock *sk,
++ const struct tcphdr *th,
++ struct sk_buff *skb)
++{
++ return 0;
++}
++static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
++static inline void __init mptcp_init(void) {}
++static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++ return 0;
++}
++static inline bool mptcp_sk_can_gso(const struct sock *sk)
++{
++ return false;
++}
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++ return false;
++}
++static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
++ u32 mss_now, int large_allowed)
++{
++ return 0;
++}
++static inline void mptcp_destroy_sock(struct sock *sk) {}
++static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
++ struct sock **skptr,
++ struct sk_buff *skb,
++ const struct mptcp_options_received *mopt)
++{
++ return 0;
++}
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++ return false;
++}
++static inline int mptcp_init_tw_sock(struct sock *sk,
++ struct tcp_timewait_sock *tw)
++{
++ return 0;
++}
++static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
++static inline void mptcp_disconnect(struct sock *sk) {}
++static inline void mptcp_tsq_flags(struct sock *sk) {}
++static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
++static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
++static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
++static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct tcp_options_received *rx_opt,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb) {}
++static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb) {}
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_H */
+diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
+new file mode 100644
+index 000000000000..93ad97c77c5a
+--- /dev/null
++++ b/include/net/mptcp_v4.h
+@@ -0,0 +1,67 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef MPTCP_V4_H_
++#define MPTCP_V4_H_
++
++
++#include <linux/in.h>
++#include <linux/skbuff.h>
++#include <net/mptcp.h>
++#include <net/request_sock.h>
++#include <net/sock.h>
++
++extern struct request_sock_ops mptcp_request_sock_ops;
++extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++#ifdef CONFIG_MPTCP
++
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++ const __be32 laddr, const struct net *net);
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++ struct mptcp_rem4 *rem);
++int mptcp_pm_v4_init(void);
++void mptcp_pm_v4_undo(void);
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++
++#else
++
++static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
++ const struct sk_buff *skb)
++{
++ return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* MPTCP_V4_H_ */
+diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
+new file mode 100644
+index 000000000000..49a4f30ccd4d
+--- /dev/null
++++ b/include/net/mptcp_v6.h
+@@ -0,0 +1,69 @@
++/*
++ * MPTCP implementation
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_V6_H
++#define _MPTCP_V6_H
++
++#include <linux/in6.h>
++#include <net/if_inet6.h>
++
++#include <net/mptcp.h>
++
++
++#ifdef CONFIG_MPTCP
++extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
++extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
++extern struct request_sock_ops mptcp6_request_sock_ops;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++ const struct in6_addr *laddr, const struct net *net);
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++ struct mptcp_rem6 *rem);
++int mptcp_pm_v6_init(void);
++void mptcp_pm_v6_undo(void);
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport);
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport);
++
++#else /* CONFIG_MPTCP */
++
++#define mptcp_v6_mapped ipv6_mapped
++
++static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_V6_H */
+diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
+index 361d26077196..bae95a11c531 100644
+--- a/include/net/net_namespace.h
++++ b/include/net/net_namespace.h
+@@ -16,6 +16,7 @@
+ #include <net/netns/packet.h>
+ #include <net/netns/ipv4.h>
+ #include <net/netns/ipv6.h>
++#include <net/netns/mptcp.h>
+ #include <net/netns/ieee802154_6lowpan.h>
+ #include <net/netns/sctp.h>
+ #include <net/netns/dccp.h>
+@@ -92,6 +93,9 @@ struct net {
+ #if IS_ENABLED(CONFIG_IPV6)
+ struct netns_ipv6 ipv6;
+ #endif
++#if IS_ENABLED(CONFIG_MPTCP)
++ struct netns_mptcp mptcp;
++#endif
+ #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
+ struct netns_ieee802154_lowpan ieee802154_lowpan;
+ #endif
+diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
+new file mode 100644
+index 000000000000..bad418b04cc8
+--- /dev/null
++++ b/include/net/netns/mptcp.h
+@@ -0,0 +1,44 @@
++/*
++ * MPTCP implementation - MPTCP namespace
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#ifndef __NETNS_MPTCP_H__
++#define __NETNS_MPTCP_H__
++
++#include <linux/compiler.h>
++
++enum {
++ MPTCP_PM_FULLMESH = 0,
++ MPTCP_PM_MAX
++};
++
++struct netns_mptcp {
++ void *path_managers[MPTCP_PM_MAX];
++};
++
++#endif /* __NETNS_MPTCP_H__ */
+diff --git a/include/net/request_sock.h b/include/net/request_sock.h
+index 7f830ff67f08..e79e87a8e1a6 100644
+--- a/include/net/request_sock.h
++++ b/include/net/request_sock.h
+@@ -164,7 +164,7 @@ struct request_sock_queue {
+ };
+
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+- unsigned int nr_table_entries);
++ unsigned int nr_table_entries, gfp_t flags);
+
+ void __reqsk_queue_destroy(struct request_sock_queue *queue);
+ void reqsk_queue_destroy(struct request_sock_queue *queue);
+diff --git a/include/net/sock.h b/include/net/sock.h
+index 156350745700..0e23cae8861f 100644
+--- a/include/net/sock.h
++++ b/include/net/sock.h
+@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
+
+ int sk_wait_data(struct sock *sk, long *timeo);
+
++/* START - needed for MPTCP */
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
++void sock_lock_init(struct sock *sk);
++
++extern struct lock_class_key af_callback_keys[AF_MAX];
++extern char *const af_family_clock_key_strings[AF_MAX+1];
++
++#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
++/* END - needed for MPTCP */
++
+ struct request_sock_ops;
+ struct timewait_sock_ops;
+ struct inet_hashinfo;
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 7286db80e8b8..ff92e74cd684 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TCPOPT_SACK 5 /* SACK Block */
+ #define TCPOPT_TIMESTAMP 8 /* Better RTT estimations/PAWS */
+ #define TCPOPT_MD5SIG 19 /* MD5 Signature (RFC2385) */
++#define TCPOPT_MPTCP 30
+ #define TCPOPT_EXP 254 /* Experimental */
+ /* Magic number to be after the option value for sharing TCP
+ * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
+@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TFO_SERVER_WO_SOCKOPT1 0x400
+ #define TFO_SERVER_WO_SOCKOPT2 0x800
+
++/* Flags from tcp_input.c for tcp_ack */
++#define FLAG_DATA 0x01 /* Incoming frame contained data. */
++#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
++#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
++#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
++#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
++#define FLAG_DATA_SACKED 0x20 /* New SACK. */
++#define FLAG_ECE 0x40 /* ECE in this ACK */
++#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
++#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
++#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
++#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
++#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
++#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
++#define MPTCP_FLAG_DATA_ACKED 0x8000
++
++#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
++#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
++#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
++#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
++
+ extern struct inet_timewait_death_row tcp_death_row;
+
+ /* sysctl variables for tcp */
+@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
+ #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+ #define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
+
++/**** START - Exports needed for MPTCP ****/
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
++
++struct mptcp_options_received;
++
++void tcp_enter_quickack_mode(struct sock *sk);
++int tcp_close_state(struct sock *sk);
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++ const struct sk_buff *skb);
++int tcp_xmit_probe_skb(struct sock *sk, int urgent);
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++ gfp_t gfp_mask);
++unsigned int tcp_mss_split_point(const struct sock *sk,
++ const struct sk_buff *skb,
++ unsigned int mss_now,
++ unsigned int max_segs,
++ int nonagle);
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss, int nonagle);
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss);
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now);
++void __pskb_trim_head(struct sk_buff *skb, int len);
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
++void tcp_reset(struct sock *sk);
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++ const u32 ack_seq, const u32 nwin);
++bool tcp_urg_mode(const struct tcp_sock *tp);
++void tcp_ack_probe(struct sock *sk);
++void tcp_rearm_rto(struct sock *sk);
++int tcp_write_timeout(struct sock *sk);
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++ unsigned int timeout, bool syn_set);
++void tcp_write_err(struct sock *sk);
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now);
++
++int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req);
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc);
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
++void tcp_v4_reqsk_destructor(struct request_sock *req);
++
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req);
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl, struct request_sock *req,
++ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
++void tcp_v6_destroy_sock(struct sock *sk);
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
++void tcp_v6_hash(struct sock *sk);
++struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req,
++ struct dst_entry *dst);
++void tcp_v6_reqsk_destructor(struct request_sock *req);
++
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
++ int large_allowed);
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
++
++void skb_clone_fraglist(struct sk_buff *skb);
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
++
++void inet_twsk_free(struct inet_timewait_sock *tw);
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
++/* These states need RST on ABORT according to RFC793 */
++static inline bool tcp_need_reset(int state)
++{
++ return (1 << state) &
++ (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
++ TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
++}
++
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
++ int hlen);
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++ bool *fragstolen);
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
++ struct sk_buff *from, bool *fragstolen);
++/**** END - Exports needed for MPTCP ****/
++
+ void tcp_tasklet_init(void);
+
+ void tcp_v4_err(struct sk_buff *skb, u32);
+@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ size_t len, int nonblock, int flags, int *addr_len);
+ void tcp_parse_options(const struct sk_buff *skb,
+ struct tcp_options_received *opt_rx,
++ struct mptcp_options_received *mopt_rx,
+ int estab, struct tcp_fastopen_cookie *foc);
+ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
+
+@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
+
+ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ u16 *mssp);
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
+-#else
+-static inline __u32 cookie_v4_init_sequence(struct sock *sk,
+- struct sk_buff *skb,
+- __u16 *mss)
+-{
+- return 0;
+-}
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mss);
+ #endif
+
+ __u32 cookie_init_timestamp(struct request_sock *req);
+@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
+ const struct tcphdr *th, u16 *mssp);
+ __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
+ __u16 *mss);
+-#else
+-static inline __u32 cookie_v6_init_sequence(struct sock *sk,
+- struct sk_buff *skb,
+- __u16 *mss)
+-{
+- return 0;
+-}
+ #endif
+ /* tcp_output.c */
+
+@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
+ void tcp_send_loss_probe(struct sock *sk);
+ bool tcp_schedule_loss_probe(struct sock *sk);
+
++u16 tcp_select_window(struct sock *sk);
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++
+ /* tcp_input.c */
+ void tcp_resume_early_retransmit(struct sock *sk);
+ void tcp_rearm_rto(struct sock *sk);
+ void tcp_reset(struct sock *sk);
++void tcp_set_rto(struct sock *sk);
++bool tcp_should_expand_sndbuf(const struct sock *sk);
++bool tcp_prune_ofo_queue(struct sock *sk);
+
+ /* tcp_timer.c */
+ void tcp_init_xmit_timers(struct sock *);
+@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
+ */
+ struct tcp_skb_cb {
+ union {
+- struct inet_skb_parm h4;
++ union {
++ struct inet_skb_parm h4;
+ #if IS_ENABLED(CONFIG_IPV6)
+- struct inet6_skb_parm h6;
++ struct inet6_skb_parm h6;
+ #endif
+- } header; /* For incoming frames */
++ } header; /* For incoming frames */
++#ifdef CONFIG_MPTCP
++ union { /* For MPTCP outgoing frames */
++ __u32 path_mask; /* paths that tried to send this skb */
++ __u32 dss[6]; /* DSS options */
++ };
++#endif
++ };
+ __u32 seq; /* Starting sequence number */
+ __u32 end_seq; /* SEQ + FIN + SYN + datalen */
+ __u32 when; /* used to compute rtt's */
++#ifdef CONFIG_MPTCP
++ __u8 mptcp_flags; /* flags for the MPTCP layer */
++ __u8 dss_off; /* Number of 4-byte words until
++ * seq-number */
++#endif
+ __u8 tcp_flags; /* TCP header flags. (tcp[13]) */
+
+ __u8 sacked; /* State flags for SACK/FACK. */
+@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
+ /* Determine a window scaling and initial window to offer. */
+ void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
+ __u32 *window_clamp, int wscale_ok,
+- __u8 *rcv_wscale, __u32 init_rcv_wnd);
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
+
+ static inline int tcp_win_from_space(int space)
+ {
+@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
+ space - (space>>sysctl_tcp_adv_win_scale);
+ }
+
++#ifdef CONFIG_MPTCP
++extern struct static_key mptcp_static_key;
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++ return static_key_false(&mptcp_static_key) && tp->mpc;
++}
++#else
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++ return 0;
++}
++#endif
++
+ /* Note: caller must be prepared to deal with negative returns */
+ static inline int tcp_space(const struct sock *sk)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = tcp_sk(sk)->meta_sk;
++
+ return tcp_win_from_space(sk->sk_rcvbuf -
+ atomic_read(&sk->sk_rmem_alloc));
+ }
+
+ static inline int tcp_full_space(const struct sock *sk)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = tcp_sk(sk)->meta_sk;
++
+ return tcp_win_from_space(sk->sk_rcvbuf);
+ }
+
+@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
+ ireq->wscale_ok = rx_opt->wscale_ok;
+ ireq->acked = 0;
+ ireq->ecn_ok = 0;
++ ireq->mptcp_rqsk = 0;
++ ireq->saw_mpc = 0;
+ ireq->ir_rmt_port = tcp_hdr(skb)->source;
+ ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
+ }
+@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
+ void tcp4_proc_exit(void);
+ #endif
+
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++ const struct tcp_request_sock_ops *af_ops,
++ struct sock *sk, struct sk_buff *skb);
++
+ /* TCP af-specific functions */
+ struct tcp_sock_af_ops {
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
+ #endif
+ };
+
++/* TCP/MPTCP-specific functions */
++struct tcp_sock_ops {
++ u32 (*__select_window)(struct sock *sk);
++ u16 (*select_window)(struct sock *sk);
++ void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk);
++ void (*init_buffer_space)(struct sock *sk);
++ void (*set_rto)(struct sock *sk);
++ bool (*should_expand_sndbuf)(const struct sock *sk);
++ void (*send_fin)(struct sock *sk);
++ bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp);
++ void (*send_active_reset)(struct sock *sk, gfp_t priority);
++ int (*write_wakeup)(struct sock *sk);
++ bool (*prune_ofo_queue)(struct sock *sk);
++ void (*retransmit_timer)(struct sock *sk);
++ void (*time_wait)(struct sock *sk, int state, int timeo);
++ void (*cleanup_rbuf)(struct sock *sk, int copied);
++ void (*init_congestion_control)(struct sock *sk);
++};
++extern const struct tcp_sock_ops tcp_specific;
++
+ struct tcp_request_sock_ops {
++ u16 mss_clamp;
+ #ifdef CONFIG_TCP_MD5SIG
+ struct tcp_md5sig_key *(*md5_lookup) (struct sock *sk,
+ struct request_sock *req);
+@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
+ const struct request_sock *req,
+ const struct sk_buff *skb);
+ #endif
++ int (*init_req)(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb);
++#ifdef CONFIG_SYN_COOKIES
++ __u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mss);
++#endif
++ struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict);
++ __u32 (*init_seq)(const struct sk_buff *skb);
++ int (*send_synack)(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl, struct request_sock *req,
++ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++ void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
++ const unsigned long timeout);
+ };
+
++#ifdef CONFIG_SYN_COOKIES
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++ struct sock *sk, struct sk_buff *skb,
++ __u16 *mss)
++{
++ return ops->cookie_init_seq(sk, skb, mss);
++}
++#else
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++ struct sock *sk, struct sk_buff *skb,
++ __u16 *mss)
++{
++ return 0;
++}
++#endif
++
+ int tcpv4_offload_init(void);
+
+ void tcp_v4_init(void);
+diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
+index 9cf2394f0bcf..c2634b6ed854 100644
+--- a/include/uapi/linux/if.h
++++ b/include/uapi/linux/if.h
+@@ -109,6 +109,9 @@ enum net_device_flags {
+ #define IFF_DORMANT IFF_DORMANT
+ #define IFF_ECHO IFF_ECHO
+
++#define IFF_NOMULTIPATH 0x80000 /* Disable for MPTCP */
++#define IFF_MPBACKUP 0x100000 /* Use as backup path for MPTCP */
++
+ #define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
+ IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
+
+diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
+index 3b9718328d8b..487475681d84 100644
+--- a/include/uapi/linux/tcp.h
++++ b/include/uapi/linux/tcp.h
+@@ -112,6 +112,7 @@ enum {
+ #define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */
+ #define TCP_TIMESTAMP 24
+ #define TCP_NOTSENT_LOWAT 25 /* limit number of unsent bytes in write queue */
++#define MPTCP_ENABLED 26
+
+ struct tcp_repair_opt {
+ __u32 opt_code;
+diff --git a/net/Kconfig b/net/Kconfig
+index d92afe4204d9..96b58593ad5e 100644
+--- a/net/Kconfig
++++ b/net/Kconfig
+@@ -79,6 +79,7 @@ if INET
+ source "net/ipv4/Kconfig"
+ source "net/ipv6/Kconfig"
+ source "net/netlabel/Kconfig"
++source "net/mptcp/Kconfig"
+
+ endif # if INET
+
+diff --git a/net/Makefile b/net/Makefile
+index cbbbe6d657ca..244bac1435b1 100644
+--- a/net/Makefile
++++ b/net/Makefile
+@@ -20,6 +20,7 @@ obj-$(CONFIG_INET) += ipv4/
+ obj-$(CONFIG_XFRM) += xfrm/
+ obj-$(CONFIG_UNIX) += unix/
+ obj-$(CONFIG_NET) += ipv6/
++obj-$(CONFIG_MPTCP) += mptcp/
+ obj-$(CONFIG_PACKET) += packet/
+ obj-$(CONFIG_NET_KEY) += key/
+ obj-$(CONFIG_BRIDGE) += bridge/
+diff --git a/net/core/dev.c b/net/core/dev.c
+index 367a586d0c8a..215d2757fbf6 100644
+--- a/net/core/dev.c
++++ b/net/core/dev.c
+@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
+
+ dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
+ IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
+- IFF_AUTOMEDIA)) |
++ IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
+ (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
+ IFF_ALLMULTI));
+
+diff --git a/net/core/request_sock.c b/net/core/request_sock.c
+index 467f326126e0..909dfa13f499 100644
+--- a/net/core/request_sock.c
++++ b/net/core/request_sock.c
+@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
+ EXPORT_SYMBOL(sysctl_max_syn_backlog);
+
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+- unsigned int nr_table_entries)
++ unsigned int nr_table_entries,
++ gfp_t flags)
+ {
+ size_t lopt_size = sizeof(struct listen_sock);
+ struct listen_sock *lopt;
+@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
+ nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
+ lopt_size += nr_table_entries * sizeof(struct request_sock *);
+ if (lopt_size > PAGE_SIZE)
+- lopt = vzalloc(lopt_size);
++ lopt = __vmalloc(lopt_size,
++ flags | __GFP_HIGHMEM | __GFP_ZERO,
++ PAGE_KERNEL);
+ else
+- lopt = kzalloc(lopt_size, GFP_KERNEL);
++ lopt = kzalloc(lopt_size, flags);
+ if (lopt == NULL)
+ return -ENOMEM;
+
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..8abc5d60fbe3 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
+ skb_drop_list(&skb_shinfo(skb)->frag_list);
+ }
+
+-static void skb_clone_fraglist(struct sk_buff *skb)
++void skb_clone_fraglist(struct sk_buff *skb)
+ {
+ struct sk_buff *list;
+
+@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
+ skb->inner_mac_header += off;
+ }
+
+-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+ {
+ __copy_skb_header(new, old);
+
+diff --git a/net/core/sock.c b/net/core/sock.c
+index 026e01f70274..359295523177 100644
+--- a/net/core/sock.c
++++ b/net/core/sock.c
+@@ -136,6 +136,11 @@
+
+ #include <trace/events/sock.h>
+
++#ifdef CONFIG_MPTCP
++#include <net/mptcp.h>
++#include <net/inet_common.h>
++#endif
++
+ #ifdef CONFIG_INET
+ #include <net/tcp.h>
+ #endif
+@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
+ "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG" ,
+ "slock-AF_NFC" , "slock-AF_VSOCK" ,"slock-AF_MAX"
+ };
+-static const char *const af_family_clock_key_strings[AF_MAX+1] = {
++char *const af_family_clock_key_strings[AF_MAX+1] = {
+ "clock-AF_UNSPEC", "clock-AF_UNIX" , "clock-AF_INET" ,
+ "clock-AF_AX25" , "clock-AF_IPX" , "clock-AF_APPLETALK",
+ "clock-AF_NETROM", "clock-AF_BRIDGE" , "clock-AF_ATMPVC" ,
+@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
+ * sk_callback_lock locking rules are per-address-family,
+ * so split the lock classes by using a per-AF key:
+ */
+-static struct lock_class_key af_callback_keys[AF_MAX];
++struct lock_class_key af_callback_keys[AF_MAX];
+
+ /* Take into consideration the size of the struct sk_buff overhead in the
+ * determination of these values, since that is non-constant across
+@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
+ }
+ }
+
+-#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
+-
+ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
+ {
+ if (sk->sk_flags & flags) {
+@@ -1253,8 +1256,25 @@ lenout:
+ *
+ * (We also register the sk_lock with the lock validator.)
+ */
+-static inline void sock_lock_init(struct sock *sk)
+-{
++void sock_lock_init(struct sock *sk)
++{
++#ifdef CONFIG_MPTCP
++ /* Reclassify the lock-class for subflows */
++ if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
++ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
++ sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
++ &meta_slock_key,
++ "sk_lock-AF_INET-MPTCP",
++ &meta_key);
++
++ /* We don't yet have the mptcp-point.
++ * Thus we still need inet_sock_destruct
++ */
++ sk->sk_destruct = inet_sock_destruct;
++ return;
++ }
++#endif
++
+ sock_lock_init_class_and_name(sk,
+ af_family_slock_key_strings[sk->sk_family],
+ af_family_slock_keys + sk->sk_family,
+@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
+ }
+ EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
+
+-static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
+ int family)
+ {
+ struct sock *sk;
+diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
+index 4db3c2a1679c..04cb17d4b0ce 100644
+--- a/net/dccp/ipv6.c
++++ b/net/dccp/ipv6.c
+@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
+ goto drop;
+
+- req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
++ req = inet_reqsk_alloc(&dccp6_request_sock_ops);
+ if (req == NULL)
+ goto drop;
+
+diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
+index 05c57f0fcabe..630434db0085 100644
+--- a/net/ipv4/Kconfig
++++ b/net/ipv4/Kconfig
+@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
+ For further details see:
+ http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
+
++config TCP_CONG_COUPLED
++ tristate "MPTCP COUPLED CONGESTION CONTROL"
++ depends on MPTCP
++ default n
++ ---help---
++ MultiPath TCP Coupled Congestion Control
++ To enable it, just put 'coupled' in tcp_congestion_control
++
++config TCP_CONG_OLIA
++ tristate "MPTCP Opportunistic Linked Increase"
++ depends on MPTCP
++ default n
++ ---help---
++ MultiPath TCP Opportunistic Linked Increase Congestion Control
++ To enable it, just put 'olia' in tcp_congestion_control
++
++config TCP_CONG_WVEGAS
++ tristate "MPTCP WVEGAS CONGESTION CONTROL"
++ depends on MPTCP
++ default n
++ ---help---
++ wVegas congestion control for MPTCP
++ To enable it, just put 'wvegas' in tcp_congestion_control
++
+ choice
+ prompt "Default TCP congestion control"
+ default DEFAULT_CUBIC
+@@ -584,6 +608,15 @@ choice
+ config DEFAULT_WESTWOOD
+ bool "Westwood" if TCP_CONG_WESTWOOD=y
+
++ config DEFAULT_COUPLED
++ bool "Coupled" if TCP_CONG_COUPLED=y
++
++ config DEFAULT_OLIA
++ bool "Olia" if TCP_CONG_OLIA=y
++
++ config DEFAULT_WVEGAS
++ bool "Wvegas" if TCP_CONG_WVEGAS=y
++
+ config DEFAULT_RENO
+ bool "Reno"
+
+@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
+ default "vegas" if DEFAULT_VEGAS
+ default "westwood" if DEFAULT_WESTWOOD
+ default "veno" if DEFAULT_VENO
++ default "coupled" if DEFAULT_COUPLED
++ default "wvegas" if DEFAULT_WVEGAS
+ default "reno" if DEFAULT_RENO
+ default "cubic"
+
+diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
+index d156b3c5f363..4afd6d8d9028 100644
+--- a/net/ipv4/af_inet.c
++++ b/net/ipv4/af_inet.c
+@@ -104,6 +104,7 @@
+ #include <net/ip_fib.h>
+ #include <net/inet_connection_sock.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/ping.h>
+@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
+ * Create an inet socket.
+ */
+
+-static int inet_create(struct net *net, struct socket *sock, int protocol,
+- int kern)
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ struct sock *sk;
+ struct inet_protosw *answer;
+@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
+ lock_sock(sk2);
+
+ sock_rps_record_flow(sk2);
++
++ if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
++ struct sock *sk_it = sk2;
++
++ mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++
++ if (tcp_sk(sk2)->mpcb->master_sk) {
++ sk_it = tcp_sk(sk2)->mpcb->master_sk;
++
++ write_lock_bh(&sk_it->sk_callback_lock);
++ sk_it->sk_wq = newsock->wq;
++ sk_it->sk_socket = newsock;
++ write_unlock_bh(&sk_it->sk_callback_lock);
++ }
++ }
++
+ WARN_ON(!((1 << sk2->sk_state) &
+ (TCPF_ESTABLISHED | TCPF_SYN_RECV |
+ TCPF_CLOSE_WAIT | TCPF_CLOSE)));
+@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
+
+ ip_init();
+
++ /* We must initialize MPTCP before TCP. */
++ mptcp_init();
++
+ tcp_v4_init();
+
+ /* Setup TCP slab cache for open requests. */
+diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
+index 14d02ea905b6..7d734d8af19b 100644
+--- a/net/ipv4/inet_connection_sock.c
++++ b/net/ipv4/inet_connection_sock.c
+@@ -23,6 +23,7 @@
+ #include <net/route.h>
+ #include <net/tcp_states.h>
+ #include <net/xfrm.h>
++#include <net/mptcp.h>
+
+ #ifdef INET_CSK_DEBUG
+ const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
+@@ -465,8 +466,8 @@ no_route:
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
+
+-static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
+- const u32 rnd, const u32 synq_hsize)
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++ const u32 synq_hsize)
+ {
+ return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
+ }
+@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
+
+ lopt->clock_hand = i;
+
+- if (lopt->qlen)
++ if (lopt->qlen && !is_meta_sk(parent))
+ inet_csk_reset_keepalive_timer(parent, interval);
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
+@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
+ const struct request_sock *req,
+ const gfp_t priority)
+ {
+- struct sock *newsk = sk_clone_lock(sk, priority);
++ struct sock *newsk;
++
++ newsk = sk_clone_lock(sk, priority);
+
+ if (newsk != NULL) {
+ struct inet_connection_sock *newicsk = inet_csk(newsk);
+@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
+ {
+ struct inet_sock *inet = inet_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+- int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
++ int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
++ GFP_KERNEL);
+
+ if (rc != 0)
+ return rc;
+@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
+
+ while ((req = acc_req) != NULL) {
+ struct sock *child = req->sk;
++ bool mutex_taken = false;
+
+ acc_req = req->dl_next;
+
++ if (is_meta_sk(child)) {
++ mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
++ mutex_taken = true;
++ }
+ local_bh_disable();
+ bh_lock_sock(child);
+ WARN_ON(sock_owned_by_user(child));
+@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
+
+ bh_unlock_sock(child);
+ local_bh_enable();
++ if (mutex_taken)
++ mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
+ sock_put(child);
+
+ sk_acceptq_removed(sk);
+diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
+index c86624b36a62..0ff3fe004d62 100644
+--- a/net/ipv4/syncookies.c
++++ b/net/ipv4/syncookies.c
+@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ }
+ EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
+
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++ __u16 *mssp)
+ {
+ const struct iphdr *iph = ip_hdr(skb);
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+
+ /* check for timestamp cookie support */
+ memset(&tcp_opt, 0, sizeof(tcp_opt));
+- tcp_parse_options(skb, &tcp_opt, 0, NULL);
++ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+
+ if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ goto out;
+@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ /* Try to redo what tcp_v4_send_synack did. */
+ req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
+
+- tcp_select_initial_window(tcp_full_space(sk), req->mss,
+- &req->rcv_wnd, &req->window_clamp,
+- ireq->wscale_ok, &rcv_wscale,
+- dst_metric(&rt->dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++ &req->rcv_wnd, &req->window_clamp,
++ ireq->wscale_ok, &rcv_wscale,
++ dst_metric(&rt->dst, RTAX_INITRWND), sk);
+
+ ireq->rcv_wscale = rcv_wscale;
+
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 9d2118e5fbc7..2cb89f886d45 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -271,6 +271,7 @@
+
+ #include <net/icmp.h>
+ #include <net/inet_common.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/xfrm.h>
+ #include <net/ip.h>
+@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+ return period;
+ }
+
++const struct tcp_sock_ops tcp_specific = {
++ .__select_window = __tcp_select_window,
++ .select_window = tcp_select_window,
++ .select_initial_window = tcp_select_initial_window,
++ .init_buffer_space = tcp_init_buffer_space,
++ .set_rto = tcp_set_rto,
++ .should_expand_sndbuf = tcp_should_expand_sndbuf,
++ .init_congestion_control = tcp_init_congestion_control,
++ .send_fin = tcp_send_fin,
++ .write_xmit = tcp_write_xmit,
++ .send_active_reset = tcp_send_active_reset,
++ .write_wakeup = tcp_write_wakeup,
++ .prune_ofo_queue = tcp_prune_ofo_queue,
++ .retransmit_timer = tcp_retransmit_timer,
++ .time_wait = tcp_time_wait,
++ .cleanup_rbuf = tcp_cleanup_rbuf,
++};
++
+ /* Address-family independent initialization for a tcp_sock.
+ *
+ * NOTE: A lot of things set to zero explicitly by call to
+@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
+ sk->sk_sndbuf = sysctl_tcp_wmem[1];
+ sk->sk_rcvbuf = sysctl_tcp_rmem[1];
+
++ tp->ops = &tcp_specific;
++
+ local_bh_disable();
+ sock_update_memcg(sk);
+ sk_sockets_allocated_inc(sk);
+@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+ int ret;
+
+ sock_rps_record_flow(sk);
++
++#ifdef CONFIG_MPTCP
++ if (mptcp(tcp_sk(sk))) {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++#endif
+ /*
+ * We can't seek on a socket input
+ */
+@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
+ return NULL;
+ }
+
+-static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
+- int large_allowed)
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 xmit_size_goal, old_size_goal;
+@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+ {
+ int mss_now;
+
+- mss_now = tcp_current_mss(sk);
+- *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ if (mptcp(tcp_sk(sk))) {
++ mss_now = mptcp_current_mss(sk);
++ *size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ } else {
++ mss_now = tcp_current_mss(sk);
++ *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++ }
+
+ return mss_now;
+ }
+@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+ * is fully established.
+ */
+ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+- !tcp_passive_fastopen(sk)) {
++ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++ tp->mpcb->master_sk : sk)) {
+ if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ goto out_err;
+ }
+
++ if (mptcp(tp)) {
++ struct sock *sk_it = sk;
++
++ /* We must check this with socket-lock hold because we iterate
++ * over the subflows.
++ */
++ if (!mptcp_can_sendpage(sk)) {
++ ssize_t ret;
++
++ release_sock(sk);
++ ret = sock_no_sendpage(sk->sk_socket, page, offset,
++ size, flags);
++ lock_sock(sk);
++ return ret;
++ }
++
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++
+ clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+
+ mss_now = tcp_send_mss(sk, &size_goal, flags);
+@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+ {
+ ssize_t res;
+
+- if (!(sk->sk_route_caps & NETIF_F_SG) ||
+- !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
++ /* If MPTCP is enabled, we check it later after establishment */
++ if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
++ !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
+ return sock_no_sendpage(sk->sk_socket, page, offset, size,
+ flags);
+
+@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
+ const struct tcp_sock *tp = tcp_sk(sk);
+ int tmp = tp->mss_cache;
+
++ if (mptcp(tp))
++ return mptcp_select_size(sk, sg);
++
+ if (sg) {
+ if (sk_can_gso(sk)) {
+ /* Small frames wont use a full page:
+@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ * is fully established.
+ */
+ if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+- !tcp_passive_fastopen(sk)) {
++ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++ tp->mpcb->master_sk : sk)) {
+ if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ goto do_error;
+ }
+
++ if (mptcp(tp)) {
++ struct sock *sk_it = sk;
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++
+ if (unlikely(tp->repair)) {
+ if (tp->repair_queue == TCP_RECV_QUEUE) {
+ copied = tcp_send_rcvq(sk, msg, size);
+@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
+ goto out_err;
+
+- sg = !!(sk->sk_route_caps & NETIF_F_SG);
++ if (mptcp(tp))
++ sg = mptcp_can_sg(sk);
++ else
++ sg = !!(sk->sk_route_caps & NETIF_F_SG);
+
+ while (--iovlen >= 0) {
+ size_t seglen = iov->iov_len;
+@@ -1183,8 +1251,15 @@ new_segment:
+
+ /*
+ * Check whether we can use HW checksum.
++ *
++ * If dss-csum is enabled, we do not do hw-csum.
++ * In case of non-mptcp we check the
++ * device-capabilities.
++ * In case of mptcp, hw-csum's will be handled
++ * later in mptcp_write_xmit.
+ */
+- if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
++ if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
++ (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ skb_entail(sk, skb);
+@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
+
+ /* Optimize, __tcp_select_window() is not cheap. */
+ if (2*rcv_window_now <= tp->window_clamp) {
+- __u32 new_window = __tcp_select_window(sk);
++ __u32 new_window = tp->ops->__select_window(sk);
+
+ /* Send ACK now, if this read freed lots of space
+ * in our buffer. Certainly, new_window is new window.
+@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ /* Clean up data we have read: This will do ACK frames. */
+ if (copied > 0) {
+ tcp_recv_skb(sk, seq, &offset);
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+ }
+ return copied;
+ }
+@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+
+ lock_sock(sk);
+
++#ifdef CONFIG_MPTCP
++ if (mptcp(tp)) {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tp->mpcb, sk_it)
++ sock_rps_record_flow(sk_it);
++ }
++#endif
++
+ err = -ENOTCONN;
+ if (sk->sk_state == TCP_LISTEN)
+ goto out;
+@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ }
+ }
+
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+
+ if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
+ /* Install new reader */
+@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ if (tp->rcv_wnd == 0 &&
+ !skb_queue_empty(&sk->sk_async_wait_queue)) {
+ tcp_service_net_dma(sk, true);
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+ } else
+ dma_async_issue_pending(tp->ucopy.dma_chan);
+ }
+@@ -1993,7 +2076,7 @@ skip_copy:
+ */
+
+ /* Clean up data we have read: This will do ACK frames. */
+- tcp_cleanup_rbuf(sk, copied);
++ tp->ops->cleanup_rbuf(sk, copied);
+
+ release_sock(sk);
+ return copied;
+@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
+ /* TCP_CLOSING */ TCP_CLOSING,
+ };
+
+-static int tcp_close_state(struct sock *sk)
++int tcp_close_state(struct sock *sk)
+ {
+ int next = (int)new_state[sk->sk_state];
+ int ns = next & TCP_STATE_MASK;
+@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
+ TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
+ /* Clear out any half completed packets. FIN if needed. */
+ if (tcp_close_state(sk))
+- tcp_send_fin(sk);
++ tcp_sk(sk)->ops->send_fin(sk);
+ }
+ }
+ EXPORT_SYMBOL(tcp_shutdown);
+@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
+ int data_was_unread = 0;
+ int state;
+
++ if (is_meta_sk(sk)) {
++ mptcp_close(sk, timeout);
++ return;
++ }
++
+ lock_sock(sk);
+ sk->sk_shutdown = SHUTDOWN_MASK;
+
+@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
+ /* Unread data was tossed, zap the connection. */
+ NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, sk->sk_allocation);
++ tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
+ } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
+ /* Check zero linger _after_ checking for unread data. */
+ sk->sk_prot->disconnect(sk, 0);
+@@ -2247,7 +2335,7 @@ adjudge_to_death:
+ struct tcp_sock *tp = tcp_sk(sk);
+ if (tp->linger2 < 0) {
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ NET_INC_STATS_BH(sock_net(sk),
+ LINUX_MIB_TCPABORTONLINGER);
+ } else {
+@@ -2257,7 +2345,8 @@ adjudge_to_death:
+ inet_csk_reset_keepalive_timer(sk,
+ tmo - TCP_TIMEWAIT_LEN);
+ } else {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
++ tmo);
+ goto out;
+ }
+ }
+@@ -2266,7 +2355,7 @@ adjudge_to_death:
+ sk_mem_reclaim(sk);
+ if (tcp_check_oom(sk, 0)) {
+ tcp_set_state(sk, TCP_CLOSE);
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
+ NET_INC_STATS_BH(sock_net(sk),
+ LINUX_MIB_TCPABORTONMEMORY);
+ }
+@@ -2291,15 +2380,6 @@ out:
+ }
+ EXPORT_SYMBOL(tcp_close);
+
+-/* These states need RST on ABORT according to RFC793 */
+-
+-static inline bool tcp_need_reset(int state)
+-{
+- return (1 << state) &
+- (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
+- TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
+-}
+-
+ int tcp_disconnect(struct sock *sk, int flags)
+ {
+ struct inet_sock *inet = inet_sk(sk);
+@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
+ /* The last check adjusts for discrepancy of Linux wrt. RFC
+ * states
+ */
+- tcp_send_active_reset(sk, gfp_any());
++ tp->ops->send_active_reset(sk, gfp_any());
+ sk->sk_err = ECONNRESET;
+ } else if (old_state == TCP_SYN_SENT)
+ sk->sk_err = ECONNRESET;
+@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
+ if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+ inet_reset_saddr(sk);
+
++ if (is_meta_sk(sk)) {
++ mptcp_disconnect(sk);
++ } else {
++ if (tp->inside_tk_table)
++ mptcp_hash_remove_bh(tp);
++ }
++
+ sk->sk_shutdown = 0;
+ sock_reset_flag(sk, SOCK_DONE);
+ tp->srtt_us = 0;
+@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ break;
+
+ case TCP_DEFER_ACCEPT:
++ /* An established MPTCP-connection (mptcp(tp) only returns true
++ * if the socket is established) should not use DEFER on new
++ * subflows.
++ */
++ if (mptcp(tp))
++ break;
+ /* Translate value in seconds to number of retransmits */
+ icsk->icsk_accept_queue.rskq_defer_accept =
+ secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
+ inet_csk_ack_scheduled(sk)) {
+ icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
+- tcp_cleanup_rbuf(sk, 1);
++ tp->ops->cleanup_rbuf(sk, 1);
+ if (!(val & 1))
+ icsk->icsk_ack.pingpong = 1;
+ }
+@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ tp->notsent_lowat = val;
+ sk->sk_write_space(sk);
+ break;
++#ifdef CONFIG_MPTCP
++ case MPTCP_ENABLED:
++ if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
++ if (val)
++ tp->mptcp_enabled = 1;
++ else
++ tp->mptcp_enabled = 0;
++ } else {
++ err = -EPERM;
++ }
++ break;
++#endif
+ default:
+ err = -ENOPROTOOPT;
+ break;
+@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
+ case TCP_NOTSENT_LOWAT:
+ val = tp->notsent_lowat;
+ break;
++#ifdef CONFIG_MPTCP
++ case MPTCP_ENABLED:
++ val = tp->mptcp_enabled;
++ break;
++#endif
+ default:
+ return -ENOPROTOOPT;
+ }
+@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
+ if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
+
++ WARN_ON(sk->sk_state == TCP_CLOSE);
+ tcp_set_state(sk, TCP_CLOSE);
++
+ tcp_clear_xmit_timers(sk);
++
+ if (req != NULL)
+ reqsk_fastopen_remove(sk, req, false);
+
+diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
+index 9771563ab564..5c230d96c4c1 100644
+--- a/net/ipv4/tcp_fastopen.c
++++ b/net/ipv4/tcp_fastopen.c
+@@ -7,6 +7,7 @@
+ #include <linux/rculist.h>
+ #include <net/inetpeer.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+
+ int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
+
+@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ {
+ struct tcp_sock *tp;
+ struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
+- struct sock *child;
++ struct sock *child, *meta_sk;
+
+ req->num_retrans = 0;
+ req->num_timeout = 0;
+@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ /* Add the child socket directly into the accept queue */
+ inet_csk_reqsk_queue_add(sk, req, child);
+
+- /* Now finish processing the fastopen child socket. */
+- inet_csk(child)->icsk_af_ops->rebuild_header(child);
+- tcp_init_congestion_control(child);
+- tcp_mtup_init(child);
+- tcp_init_metrics(child);
+- tcp_init_buffer_space(child);
+-
+ /* Queue the data carried in the SYN packet. We need to first
+ * bump skb's refcnt because the caller will attempt to free it.
+ *
+@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ tp->syn_data_acked = 1;
+ }
+ tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++
++ meta_sk = child;
++ if (!mptcp_check_req_fastopen(meta_sk, req)) {
++ child = tcp_sk(meta_sk)->mpcb->master_sk;
++ tp = tcp_sk(child);
++ }
++
++ /* Now finish processing the fastopen child socket. */
++ inet_csk(child)->icsk_af_ops->rebuild_header(child);
++ tp->ops->init_congestion_control(child);
++ tcp_mtup_init(child);
++ tcp_init_metrics(child);
++ tp->ops->init_buffer_space(child);
++
+ sk->sk_data_ready(sk);
+- bh_unlock_sock(child);
++ if (mptcp(tcp_sk(child)))
++ bh_unlock_sock(child);
++ bh_unlock_sock(meta_sk);
+ sock_put(child);
+ WARN_ON(req->sk == NULL);
+ return true;
+diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
+index 40639c288dc2..3273bb69f387 100644
+--- a/net/ipv4/tcp_input.c
++++ b/net/ipv4/tcp_input.c
+@@ -74,6 +74,9 @@
+ #include <linux/ipsec.h>
+ #include <asm/unaligned.h>
+ #include <net/netdma.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
+
+ int sysctl_tcp_timestamps __read_mostly = 1;
+ int sysctl_tcp_window_scaling __read_mostly = 1;
+@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
+ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
+ int sysctl_tcp_early_retrans __read_mostly = 3;
+
+-#define FLAG_DATA 0x01 /* Incoming frame contained data. */
+-#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
+-#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
+-#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
+-#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
+-#define FLAG_DATA_SACKED 0x20 /* New SACK. */
+-#define FLAG_ECE 0x40 /* ECE in this ACK */
+-#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
+-#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
+-#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
+-#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
+-#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
+-#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
+-
+-#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
+-#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
+-#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
+-#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
+-
+ #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
+ #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
+
+@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
+ icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
+ }
+
+-static void tcp_enter_quickack_mode(struct sock *sk)
++void tcp_enter_quickack_mode(struct sock *sk)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ tcp_incr_quickack(sk);
+@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ per_mss = roundup_pow_of_two(per_mss) +
+ SKB_DATA_ALIGN(sizeof(struct sk_buff));
+
+- nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
+- nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++ if (mptcp(tp)) {
++ nr_segs = mptcp_check_snd_buf(tp);
++ } else {
++ nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
++ nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++ }
+
+ /* Fast Recovery (RFC 5681 3.2) :
+ * Cubic needs 1.7 factor, rounded to 2 to include
+@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ */
+ sndmem = 2 * nr_segs * per_mss;
+
+- if (sk->sk_sndbuf < sndmem)
++ /* MPTCP: after this sndmem is the new contribution of the
++ * current subflow to the aggregated sndbuf */
++ if (sk->sk_sndbuf < sndmem) {
++ int old_sndbuf = sk->sk_sndbuf;
+ sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
++ /* MPTCP: ok, the subflow sndbuf has grown, reflect
++ * this in the aggregate buffer.*/
++ if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
++ mptcp_update_sndbuf(tp);
++ }
+ }
+
+ /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
+@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
+ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
+
+ /* Check #1 */
+- if (tp->rcv_ssthresh < tp->window_clamp &&
+- (int)tp->rcv_ssthresh < tcp_space(sk) &&
++ if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
++ (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
+ !sk_under_memory_pressure(sk)) {
+ int incr;
+
+@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ * will fit to rcvbuf in future.
+ */
+ if (tcp_win_from_space(skb->truesize) <= skb->len)
+- incr = 2 * tp->advmss;
++ incr = 2 * meta_tp->advmss;
+ else
+- incr = __tcp_grow_window(sk, skb);
++ incr = __tcp_grow_window(meta_sk, skb);
+
+ if (incr) {
+ incr = max_t(int, incr, 2 * skb->len);
+- tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
+- tp->window_clamp);
++ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
++ meta_tp->window_clamp);
+ inet_csk(sk)->icsk_ack.quick |= 1;
+ }
+ }
+@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
+ int copied;
+
+ time = tcp_time_stamp - tp->rcvq_space.time;
+- if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
++ if (mptcp(tp)) {
++ if (mptcp_check_rtt(tp, time))
++ return;
++ } else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+ return;
+
+ /* Number of bytes copied to user in last RTT */
+@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
+ /* Calculate rto without backoff. This is the second half of Van Jacobson's
+ * routine referred to above.
+ */
+-static void tcp_set_rto(struct sock *sk)
++void tcp_set_rto(struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ /* Old crap is replaced with new one. 8)
+@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
+ int len;
+ int in_sack;
+
+- if (!sk_can_gso(sk))
++ /* For MPTCP we cannot shift skb-data and remove one skb from the
++ * send-queue, because this will make us loose the DSS-option (which
++ * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
++ */
++ if (!sk_can_gso(sk) || mptcp(tp))
+ goto fallback;
+
+ /* Normally R but no L won't result in plain S */
+@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
+ return false;
+
+ tcp_rtt_estimator(sk, seq_rtt_us);
+- tcp_set_rto(sk);
++ tp->ops->set_rto(sk);
+
+ /* RFC6298: only reset backoff on valid RTT measurement. */
+ inet_csk(sk)->icsk_backoff = 0;
+@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
+ }
+
+ /* If we get here, the whole TSO packet has not been acked. */
+-static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 packets_acked;
+@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ */
+ if (!(scb->tcp_flags & TCPHDR_SYN)) {
+ flag |= FLAG_DATA_ACKED;
++ if (mptcp(tp) && mptcp_is_data_seq(skb))
++ flag |= MPTCP_FLAG_DATA_ACKED;
+ } else {
+ flag |= FLAG_SYN_ACKED;
+ tp->retrans_stamp = 0;
+@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ return flag;
+ }
+
+-static void tcp_ack_probe(struct sock *sk)
++void tcp_ack_probe(struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
+ /* Check that window update is acceptable.
+ * The function assumes that snd_una<=ack<=snd_next.
+ */
+-static inline bool tcp_may_update_window(const struct tcp_sock *tp,
+- const u32 ack, const u32 ack_seq,
+- const u32 nwin)
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++ const u32 ack_seq, const u32 nwin)
+ {
+ return after(ack, tp->snd_una) ||
+ after(ack_seq, tp->snd_wl1) ||
+@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
+ }
+
+ /* This routine deals with incoming acks, but not outgoing ones. */
+-static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
++static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+ sack_rtt_us);
+ acked -= tp->packets_out;
+
++ if (mptcp(tp)) {
++ if (mptcp_fallback_infinite(sk, flag)) {
++ pr_err("%s resetting flow\n", __func__);
++ mptcp_send_reset(sk);
++ goto invalid_ack;
++ }
++
++ mptcp_clean_rtx_infinite(skb, sk);
++ }
++
+ /* Advance cwnd if state allows */
+ if (tcp_may_raise_cwnd(sk, flag))
+ tcp_cong_avoid(sk, ack, acked);
+@@ -3512,8 +3528,9 @@ old_ack:
+ * the fast version below fails.
+ */
+ void tcp_parse_options(const struct sk_buff *skb,
+- struct tcp_options_received *opt_rx, int estab,
+- struct tcp_fastopen_cookie *foc)
++ struct tcp_options_received *opt_rx,
++ struct mptcp_options_received *mopt,
++ int estab, struct tcp_fastopen_cookie *foc)
+ {
+ const unsigned char *ptr;
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
+ */
+ break;
+ #endif
++ case TCPOPT_MPTCP:
++ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++ break;
+ case TCPOPT_EXP:
+ /* Fast Open option shares code 254 using a
+ * 16 bits magic number. It's valid only in
+@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
+ if (tcp_parse_aligned_timestamp(tp, th))
+ return true;
+ }
+-
+- tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
++ tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
++ 1, NULL);
+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+
+@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
+ dst = __sk_dst_get(sk);
+ if (!dst || !dst_metric(dst, RTAX_QUICKACK))
+ inet_csk(sk)->icsk_ack.pingpong = 1;
++ if (mptcp(tp))
++ mptcp_sub_close_passive(sk);
+ break;
+
+ case TCP_CLOSE_WAIT:
+@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
+ tcp_set_state(sk, TCP_CLOSING);
+ break;
+ case TCP_FIN_WAIT2:
++ if (mptcp(tp)) {
++ /* The socket will get closed by mptcp_data_ready.
++ * We first have to process all data-sequences.
++ */
++ tp->close_it = 1;
++ break;
++ }
+ /* Received a FIN -- send ACK and enter TIME_WAIT. */
+ tcp_send_ack(sk);
+- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ break;
+ default:
+ /* Only TCP_LISTEN and TCP_CLOSE are left, in these
+@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
+ if (!sock_flag(sk, SOCK_DEAD)) {
+ sk->sk_state_change(sk);
+
++ /* Don't wake up MPTCP-subflows */
++ if (mptcp(tp))
++ return;
++
+ /* Do not send POLL_HUP for half duplex close. */
+ if (sk->sk_shutdown == SHUTDOWN_MASK ||
+ sk->sk_state == TCP_CLOSE)
+@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
+ tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
+ }
+
+- if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
++ /* In case of MPTCP, the segment may be empty if it's a
++ * non-data DATA_FIN. (see beginning of tcp_data_queue)
++ */
++ if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
++ !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
+ SOCK_DEBUG(sk, "ofo packet was already received\n");
+ __skb_unlink(skb, &tp->out_of_order_queue);
+ __kfree_skb(skb);
+@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
+ }
+ }
+
+-static bool tcp_prune_ofo_queue(struct sock *sk);
+ static int tcp_prune_queue(struct sock *sk);
+
+ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ unsigned int size)
+ {
++ if (mptcp(tcp_sk(sk)))
++ sk = mptcp_meta_sk(sk);
++
+ if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+ !sk_rmem_schedule(sk, skb, size)) {
+
+@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ return -1;
+
+ if (!sk_rmem_schedule(sk, skb, size)) {
+- if (!tcp_prune_ofo_queue(sk))
++ if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
+ return -1;
+
+ if (!sk_rmem_schedule(sk, skb, size))
+@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ * Better try to coalesce them right now to avoid future collapses.
+ * Returns true if caller should free @from instead of queueing it
+ */
+-static bool tcp_try_coalesce(struct sock *sk,
+- struct sk_buff *to,
+- struct sk_buff *from,
+- bool *fragstolen)
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
++ bool *fragstolen)
+ {
+ int delta;
+
+ *fragstolen = false;
+
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++ return false;
++
+ if (tcp_hdr(from)->fin)
+ return false;
+
+@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+
+ /* Do skb overlap to previous one? */
+ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+- if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* MPTCP allows non-data data-fin to be in the ofo-queue */
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
++ !(mptcp(tp) && end_seq == seq)) {
+ /* All the bits are present. Drop. */
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
+ __kfree_skb(skb);
+@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ end_seq);
+ break;
+ }
++ /* MPTCP allows non-data data-fin to be in the ofo-queue */
++ if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
++ continue;
+ __skb_unlink(skb1, &tp->out_of_order_queue);
+ tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
+ TCP_SKB_CB(skb1)->end_seq);
+@@ -4280,8 +4325,8 @@ end:
+ }
+ }
+
+-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
+- bool *fragstolen)
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++ bool *fragstolen)
+ {
+ int eaten;
+ struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
+ int eaten = -1;
+ bool fragstolen = false;
+
+- if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
++ /* If no data is present, but a data_fin is in the options, we still
++ * have to call mptcp_queue_skb later on. */
++ if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
++ !(mptcp(tp) && mptcp_is_data_fin(skb)))
+ goto drop;
+
+ skb_dst_drop(skb);
+@@ -4389,7 +4437,7 @@ queue_and_out:
+ eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
+ }
+ tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+- if (skb->len)
++ if (skb->len || mptcp_is_data_fin(skb))
+ tcp_event_data_recv(sk, skb);
+ if (th->fin)
+ tcp_fin(sk);
+@@ -4411,7 +4459,11 @@ queue_and_out:
+
+ if (eaten > 0)
+ kfree_skb_partial(skb, fragstolen);
+- if (!sock_flag(sk, SOCK_DEAD))
++ if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
++ /* MPTCP: we always have to call data_ready, because
++ * we may be about to receive a data-fin, which still
++ * must get queued.
++ */
+ sk->sk_data_ready(sk);
+ return;
+ }
+@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
+ next = skb_queue_next(list, skb);
+
+ __skb_unlink(skb, list);
++ if (mptcp(tcp_sk(sk)))
++ mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
+ __kfree_skb(skb);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
+
+@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
+ * Purge the out-of-order queue.
+ * Return true if queue was pruned.
+ */
+-static bool tcp_prune_ofo_queue(struct sock *sk)
++bool tcp_prune_ofo_queue(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ bool res = false;
+@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
+ /* Collapsing did not help, destructive actions follow.
+ * This must not ever occur. */
+
+- tcp_prune_ofo_queue(sk);
++ tp->ops->prune_ofo_queue(sk);
+
+ if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+ return 0;
+@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
+ return -1;
+ }
+
+-static bool tcp_should_expand_sndbuf(const struct sock *sk)
++/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
++ * As additional protections, we do not touch cwnd in retransmission phases,
++ * and if application hit its sndbuf limit recently.
++ */
++void tcp_cwnd_application_limited(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
++ sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
++ /* Limited by application or receiver window. */
++ u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
++ u32 win_used = max(tp->snd_cwnd_used, init_win);
++ if (win_used < tp->snd_cwnd) {
++ tp->snd_ssthresh = tcp_current_ssthresh(sk);
++ tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
++ }
++ tp->snd_cwnd_used = 0;
++ }
++ tp->snd_cwnd_stamp = tcp_time_stamp;
++}
++
++bool tcp_should_expand_sndbuf(const struct sock *sk)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+- if (tcp_should_expand_sndbuf(sk)) {
++ if (tp->ops->should_expand_sndbuf(sk)) {
+ tcp_sndbuf_expand(sk);
+ tp->snd_cwnd_stamp = tcp_time_stamp;
+ }
+@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
+ {
+ if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
+ sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
+- if (sk->sk_socket &&
+- test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
++ if (mptcp(tcp_sk(sk)) ||
++ (sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
+ tcp_new_space(sk);
+ }
+ }
+@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+ /* ... and right edge of window advances far enough.
+ * (tcp_recvmsg() will send ACK otherwise). Or...
+ */
+- __tcp_select_window(sk) >= tp->rcv_wnd) ||
++ tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
+ /* We ACK each frame or... */
+ tcp_in_quickack_mode(sk) ||
+ /* We have out of order data. */
+@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
++ /* MPTCP urgent data is not yet supported */
++ if (mptcp(tp))
++ return;
++
+ /* Check if we get a new urgent pointer - normally not. */
+ if (th->urg)
+ tcp_check_urg(sk, th);
+@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
+ }
+
+ #ifdef CONFIG_NET_DMA
+-static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
+- int hlen)
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ int chunk = skb->len - hlen;
+@@ -5052,9 +5132,15 @@ syn_challenge:
+ goto discard;
+ }
+
++ /* If valid: post process the received MPTCP options. */
++ if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
++ goto discard;
++
+ return true;
+
+ discard:
++ if (mptcp(tp))
++ mptcp_reset_mopt(tp);
+ __kfree_skb(skb);
+ return false;
+ }
+@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+
+ tp->rx_opt.saw_tstamp = 0;
+
++ /* MPTCP: force slowpath. */
++ if (mptcp(tp))
++ goto slow_path;
++
+ /* pred_flags is 0xS?10 << 16 + snd_wnd
+ * if header_prediction is to be made
+ * 'S' will always be tp->tcp_header_len >> 2
+@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
+ }
+ if (copied_early)
+- tcp_cleanup_rbuf(sk, skb->len);
++ tp->ops->cleanup_rbuf(sk, skb->len);
+ }
+ if (!eaten) {
+ if (tcp_checksum_complete_user(sk, skb))
+@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
+
+ tcp_init_metrics(sk);
+
+- tcp_init_congestion_control(sk);
++ tp->ops->init_congestion_control(sk);
+
+ /* Prevent spurious tcp_cwnd_restart() on first data
+ * packet.
+ */
+ tp->lsndtime = tcp_time_stamp;
+
+- tcp_init_buffer_space(sk);
++ tp->ops->init_buffer_space(sk);
+
+ if (sock_flag(sk, SOCK_KEEPOPEN))
+ inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
+@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ /* Get original SYNACK MSS value if user MSS sets mss_clamp */
+ tcp_clear_options(&opt);
+ opt.user_mss = opt.mss_clamp = 0;
+- tcp_parse_options(synack, &opt, 0, NULL);
++ tcp_parse_options(synack, &opt, NULL, 0, NULL);
+ mss = opt.mss_clamp;
+ }
+
+@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+
+ tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
+
+- if (data) { /* Retransmit unacked data in SYN */
++ /* In mptcp case, we do not rely on "retransmit", but instead on
++ * "transmit", because if fastopen data is not acked, the retransmission
++ * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
++ */
++ if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
+ tcp_for_write_queue_from(data, sk) {
+ if (data == tcp_send_head(sk) ||
+ __tcp_retransmit_skb(sk, data))
+@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp_fastopen_cookie foc = { .len = -1 };
+ int saved_clamp = tp->rx_opt.mss_clamp;
++ struct mptcp_options_received mopt;
++ mptcp_init_mp_opt(&mopt);
+
+- tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
++ tcp_parse_options(skb, &tp->rx_opt,
++ mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
+ if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+
+@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
+ tcp_ack(sk, skb, FLAG_SLOWPATH);
+
++ if (tp->request_mptcp || mptcp(tp)) {
++ int ret;
++ ret = mptcp_rcv_synsent_state_process(sk, &sk,
++ skb, &mopt);
++
++ /* May have changed if we support MPTCP */
++ tp = tcp_sk(sk);
++ icsk = inet_csk(sk);
++
++ if (ret == 1)
++ goto reset_and_undo;
++ if (ret == 2)
++ goto discard;
++ }
++
++ if (mptcp(tp) && !is_master_tp(tp)) {
++ /* Timer for repeating the ACK until an answer
++ * arrives. Used only when establishing an additional
++ * subflow inside of an MPTCP connection.
++ */
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ }
++
+ /* Ok.. it's good. Set up sequence numbers and
+ * move to established.
+ */
+@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tp->tcp_header_len = sizeof(struct tcphdr);
+ }
+
++ if (mptcp(tp)) {
++ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++ }
++
+ if (tcp_is_sack(tp) && sysctl_tcp_fack)
+ tcp_enable_fack(tp);
+
+@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_rcv_fastopen_synack(sk, skb, &foc))
+ return -1;
+
+- if (sk->sk_write_pending ||
++ /* With MPTCP we cannot send data on the third ack due to the
++ * lack of option-space to combine with an MP_CAPABLE.
++ */
++ if (!mptcp(tp) && (sk->sk_write_pending ||
+ icsk->icsk_accept_queue.rskq_defer_accept ||
+- icsk->icsk_ack.pingpong) {
++ icsk->icsk_ack.pingpong)) {
+ /* Save one ACK. Data will be ready after
+ * several ticks, if write_pending is set.
+ *
+@@ -5536,6 +5665,7 @@ discard:
+ tcp_paws_reject(&tp->rx_opt, 0))
+ goto discard_and_undo;
+
++ /* TODO - check this here for MPTCP */
+ if (th->syn) {
+ /* We see SYN without ACK. It is attempt of
+ * simultaneous connect with crossed SYNs.
+@@ -5552,6 +5682,11 @@ discard:
+ tp->tcp_header_len = sizeof(struct tcphdr);
+ }
+
++ if (mptcp(tp)) {
++ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++ }
++
+ tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
+ tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
+
+@@ -5610,6 +5745,7 @@ reset_and_undo:
+
+ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ const struct tcphdr *th, unsigned int len)
++ __releases(&sk->sk_lock.slock)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ case TCP_SYN_SENT:
+ queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
++ if (is_meta_sk(sk)) {
++ sk = tcp_sk(sk)->mpcb->master_sk;
++ tp = tcp_sk(sk);
++
++ /* Need to call it here, because it will announce new
++ * addresses, which can only be done after the third ack
++ * of the 3-way handshake.
++ */
++ mptcp_update_metasocket(sk, tp->meta_sk);
++ }
+ if (queued >= 0)
+ return queued;
+
+@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ tcp_urg(sk, skb, th);
+ __kfree_skb(skb);
+ tcp_data_snd_check(sk);
++ if (mptcp(tp) && is_master_tp(tp))
++ bh_unlock_sock(sk);
+ return 0;
+ }
+
+@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ synack_stamp = tp->lsndtime;
+ /* Make sure socket is routed, for correct metrics. */
+ icsk->icsk_af_ops->rebuild_header(sk);
+- tcp_init_congestion_control(sk);
++ tp->ops->init_congestion_control(sk);
+
+ tcp_mtup_init(sk);
+ tp->copied_seq = tp->rcv_nxt;
+- tcp_init_buffer_space(sk);
++ tp->ops->init_buffer_space(sk);
+ }
+ smp_mb();
+ tcp_set_state(sk, TCP_ESTABLISHED);
+@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ if (tp->rx_opt.tstamp_ok)
+ tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
++ if (mptcp(tp))
++ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
+
+ if (req) {
+ /* Re-arm the timer because data may have been sent out.
+@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ tcp_initialize_rcv_mss(sk);
+ tcp_fast_path_on(tp);
++ /* Send an ACK when establishing a new
++ * MPTCP subflow, i.e. using an MP_JOIN
++ * subtype.
++ */
++ if (mptcp(tp) && !is_master_tp(tp))
++ tcp_send_ack(sk);
+ break;
+
+ case TCP_FIN_WAIT1: {
+@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ tmo = tcp_fin_time(sk);
+ if (tmo > TCP_TIMEWAIT_LEN) {
+ inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
+- } else if (th->fin || sock_owned_by_user(sk)) {
++ } else if (th->fin || mptcp_is_data_fin(skb) ||
++ sock_owned_by_user(sk)) {
+ /* Bad case. We could lose such FIN otherwise.
+ * It is not a big problem, but it looks confusing
+ * and not so rare event. We still can lose it now,
+@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ */
+ inet_csk_reset_keepalive_timer(sk, tmo);
+ } else {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ goto discard;
+ }
+ break;
+@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+
+ case TCP_CLOSING:
+ if (tp->snd_una == tp->write_seq) {
+- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ goto discard;
+ }
+ break;
+@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ goto discard;
+ }
+ break;
++ case TCP_CLOSE:
++ if (tp->mp_killed)
++ goto discard;
+ }
+
+ /* step 6: check the URG bit */
+@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ */
+ if (sk->sk_shutdown & RCV_SHUTDOWN) {
+ if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
+- after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
++ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++ !mptcp(tp)) {
++ /* In case of mptcp, the reset is handled by
++ * mptcp_rcv_state_process
++ */
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
+ tcp_reset(sk);
+ return 1;
+@@ -5877,3 +6041,154 @@ discard:
+ return 0;
+ }
+ EXPORT_SYMBOL(tcp_rcv_state_process);
++
++static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++
++ if (family == AF_INET)
++ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
++ &ireq->ir_rmt_addr, port);
++#if IS_ENABLED(CONFIG_IPV6)
++ else if (family == AF_INET6)
++ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
++ &ireq->ir_v6_rmt_addr, port);
++#endif
++}
++
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++ const struct tcp_request_sock_ops *af_ops,
++ struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_options_received tmp_opt;
++ struct request_sock *req;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct dst_entry *dst = NULL;
++ __u32 isn = TCP_SKB_CB(skb)->when;
++ bool want_cookie = false, fastopen;
++ struct flowi fl;
++ struct tcp_fastopen_cookie foc = { .len = -1 };
++ int err;
++
++
++ /* TW buckets are converted to open requests without
++ * limitations, they conserve resources and peer is
++ * evidently real one.
++ */
++ if ((sysctl_tcp_syncookies == 2 ||
++ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++ want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
++ if (!want_cookie)
++ goto drop;
++ }
++
++
++ /* Accept backlog is full. If we have already queued enough
++ * of warm entries in syn queue, drop request. It is better than
++ * clogging syn queue with openreqs with exponentially increasing
++ * timeout.
++ */
++ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
++ goto drop;
++ }
++
++ req = inet_reqsk_alloc(rsk_ops);
++ if (!req)
++ goto drop;
++
++ tcp_rsk(req)->af_specific = af_ops;
++
++ tcp_clear_options(&tmp_opt);
++ tmp_opt.mss_clamp = af_ops->mss_clamp;
++ tmp_opt.user_mss = tp->rx_opt.user_mss;
++ tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
++
++ if (want_cookie && !tmp_opt.saw_tstamp)
++ tcp_clear_options(&tmp_opt);
++
++ tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
++ tcp_openreq_init(req, &tmp_opt, skb);
++
++ if (af_ops->init_req(req, sk, skb))
++ goto drop_and_free;
++
++ if (security_inet_conn_request(sk, skb, req))
++ goto drop_and_free;
++
++ if (!want_cookie || tmp_opt.tstamp_ok)
++ TCP_ECN_create_request(req, skb, sock_net(sk));
++
++ if (want_cookie) {
++ isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
++ req->cookie_ts = tmp_opt.tstamp_ok;
++ } else if (!isn) {
++ /* VJ's idea. We save last timestamp seen
++ * from the destination in peer table, when entering
++ * state TIME-WAIT, and check against it before
++ * accepting new connection request.
++ *
++ * If "isn" is not zero, this request hit alive
++ * timewait bucket, so that all the necessary checks
++ * are made in the function processing timewait state.
++ */
++ if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
++ bool strict;
++
++ dst = af_ops->route_req(sk, &fl, req, &strict);
++ if (dst && strict &&
++ !tcp_peer_is_proven(req, dst, true)) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
++ goto drop_and_release;
++ }
++ }
++ /* Kill the following clause, if you dislike this way. */
++ else if (!sysctl_tcp_syncookies &&
++ (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
++ (sysctl_max_syn_backlog >> 2)) &&
++ !tcp_peer_is_proven(req, dst, false)) {
++ /* Without syncookies last quarter of
++ * backlog is filled with destinations,
++ * proven to be alive.
++ * It means that we continue to communicate
++ * to destinations, already remembered
++ * to the moment of synflood.
++ */
++ pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
++ rsk_ops->family);
++ goto drop_and_release;
++ }
++
++ isn = af_ops->init_seq(skb);
++ }
++ if (!dst) {
++ dst = af_ops->route_req(sk, &fl, req, NULL);
++ if (!dst)
++ goto drop_and_free;
++ }
++
++ tcp_rsk(req)->snt_isn = isn;
++ tcp_openreq_init_rwin(req, sk, dst);
++ fastopen = !want_cookie &&
++ tcp_try_fastopen(sk, skb, req, &foc, dst);
++ err = af_ops->send_synack(sk, dst, &fl, req,
++ skb_get_queue_mapping(skb), &foc);
++ if (!fastopen) {
++ if (err || want_cookie)
++ goto drop_and_free;
++
++ tcp_rsk(req)->listener = NULL;
++ af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
++ }
++
++ return 0;
++
++drop_and_release:
++ dst_release(dst);
++drop_and_free:
++ reqsk_free(req);
++drop:
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++ return 0;
++}
++EXPORT_SYMBOL(tcp_conn_request);
+diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
+index 77cccda1ad0c..c77017f600f1 100644
+--- a/net/ipv4/tcp_ipv4.c
++++ b/net/ipv4/tcp_ipv4.c
+@@ -67,6 +67,8 @@
+ #include <net/icmp.h>
+ #include <net/inet_hashtables.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/transp_v6.h>
+ #include <net/ipv6.h>
+ #include <net/inet_common.h>
+@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
+ struct inet_hashinfo tcp_hashinfo;
+ EXPORT_SYMBOL(tcp_hashinfo);
+
+-static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
+ {
+ return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
+ ip_hdr(skb)->saddr,
+@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ struct inet_sock *inet;
+ const int type = icmp_hdr(icmp_skb)->type;
+ const int code = icmp_hdr(icmp_skb)->code;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+ struct sk_buff *skb;
+ struct request_sock *fastopen;
+ __u32 seq, snd_una;
+@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ return;
+ }
+
+- bh_lock_sock(sk);
++ tp = tcp_sk(sk);
++ if (mptcp(tp))
++ meta_sk = mptcp_meta_sk(sk);
++ else
++ meta_sk = sk;
++
++ bh_lock_sock(meta_sk);
+ /* If too many ICMPs get dropped on busy
+ * servers this needs to be solved differently.
+ * We do take care of PMTU discovery (RFC1191) special case :
+ * we can receive locally generated ICMP messages while socket is held.
+ */
+- if (sock_owned_by_user(sk)) {
++ if (sock_owned_by_user(meta_sk)) {
+ if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
+ NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ }
+@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ }
+
+ icsk = inet_csk(sk);
+- tp = tcp_sk(sk);
+ seq = ntohl(th->seq);
+ /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ fastopen = tp->fastopen_rsk;
+@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ goto out;
+
+ tp->mtu_info = info;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_v4_mtu_reduced(sk);
+ } else {
+ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+ goto out;
+ }
+@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ !icsk->icsk_backoff || fastopen)
+ break;
+
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ break;
+
+ icsk->icsk_backoff--;
+@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ switch (sk->sk_state) {
+ struct request_sock *req, **prev;
+ case TCP_LISTEN:
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ goto out;
+
+ req = inet_csk_search_req(sk, &prev, th->dest,
+@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ if (fastopen && fastopen->sk == NULL)
+ break;
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ sk->sk_err = err;
+
+ sk->sk_error_report(sk);
+@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ */
+
+ inet = inet_sk(sk);
+- if (!sock_owned_by_user(sk) && inet->recverr) {
++ if (!sock_owned_by_user(meta_sk) && inet->recverr) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk);
+ } else { /* Only an error on timeout */
+@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ }
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
+ * Exception: precedence violation. We do not implement it in any case.
+ */
+
+-static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct {
+@@ -702,10 +711,10 @@ release_sk1:
+ outside socket context is ugly, certainly. What can I do?
+ */
+
+-static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ u32 win, u32 tsval, u32 tsecr, int oif,
+ struct tcp_md5sig_key *key,
+- int reply_flags, u8 tos)
++ int reply_flags, u8 tos, int mptcp)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct {
+@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ #ifdef CONFIG_TCP_MD5SIG
+ + (TCPOLEN_MD5SIG_ALIGNED >> 2)
+ #endif
++#ifdef CONFIG_MPTCP
++ + ((MPTCP_SUB_LEN_DSS >> 2) +
++ (MPTCP_SUB_LEN_ACK >> 2))
++#endif
+ ];
+ } rep;
+ struct ip_reply_arg arg;
+@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ ip_hdr(skb)->daddr, &rep.th);
+ }
+ #endif
++#ifdef CONFIG_MPTCP
++ if (mptcp) {
++ int offset = (tsecr) ? 3 : 0;
++ /* Construction of 32-bit data_ack */
++ rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
++ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++ (0x20 << 8) |
++ (0x01));
++ rep.opt[offset] = htonl(data_ack);
++
++ arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++ rep.th.doff = arg.iov[0].iov_len / 4;
++ }
++#endif /* CONFIG_MPTCP */
++
+ arg.flags = reply_flags;
+ arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
+ ip_hdr(skb)->saddr, /* XXX */
+@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ struct inet_timewait_sock *tw = inet_twsk(sk);
+ struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++ u32 data_ack = 0;
++ int mptcp = 0;
++
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++ mptcp = 1;
++ }
+
+ tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++ data_ack,
+ tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ tcp_time_stamp + tcptw->tw_ts_offset,
+ tcptw->tw_ts_recent,
+ tw->tw_bound_dev_if,
+ tcp_twsk_md5_key(tcptw),
+ tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
+- tw->tw_tos
++ tw->tw_tos, mptcp
+ );
+
+ inet_twsk_put(tw);
+ }
+
+-static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req)
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req)
+ {
+ /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ */
+ tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+- tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
++ tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
+ tcp_time_stamp,
+ req->ts_recent,
+ 0,
+ tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
+ AF_INET),
+ inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
+- ip_hdr(skb)->tos);
++ ip_hdr(skb)->tos, 0);
+ }
+
+ /*
+@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+ * This still operates on a request_sock only, not on a big
+ * socket.
+ */
+-static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+- struct request_sock *req,
+- u16 queue_mapping,
+- struct tcp_fastopen_cookie *foc)
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc)
+ {
+ const struct inet_request_sock *ireq = inet_rsk(req);
+ struct flowi4 fl4;
+@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+ return err;
+ }
+
+-static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
+-{
+- int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
+-
+- if (!res) {
+- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+- }
+- return res;
+-}
+-
+ /*
+ * IPv4 request_sock destructor.
+ */
+-static void tcp_v4_reqsk_destructor(struct request_sock *req)
++void tcp_v4_reqsk_destructor(struct request_sock *req)
+ {
+ kfree(inet_rsk(req)->opt);
+ }
+@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
+ /*
+ * Save and compile IPv4 options into the request_sock if needed.
+ */
+-static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
+ {
+ const struct ip_options *opt = &(IPCB(skb)->opt);
+ struct ip_options_rcu *dopt = NULL;
+@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+
+ #endif
+
++static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++
++ ireq->ir_loc_addr = ip_hdr(skb)->daddr;
++ ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
++ ireq->no_srccheck = inet_sk(sk)->transparent;
++ ireq->opt = tcp_v4_save_options(skb);
++ ireq->ir_mark = inet_request_mark(sk, skb);
++
++ return 0;
++}
++
++static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict)
++{
++ struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
++
++ if (strict) {
++ if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
++ *strict = true;
++ else
++ *strict = false;
++ }
++
++ return dst;
++}
++
+ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
+ .family = PF_INET,
+ .obj_size = sizeof(struct tcp_request_sock),
+- .rtx_syn_ack = tcp_v4_rtx_synack,
++ .rtx_syn_ack = tcp_rtx_synack,
+ .send_ack = tcp_v4_reqsk_send_ack,
+ .destructor = tcp_v4_reqsk_destructor,
+ .send_reset = tcp_v4_send_reset,
+ .syn_ack_timeout = tcp_syn_ack_timeout,
+ };
+
++const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
++ .mss_clamp = TCP_MSS_DEFAULT,
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+ .md5_lookup = tcp_v4_reqsk_md5_lookup,
+ .calc_md5_hash = tcp_v4_md5_hash_skb,
+-};
+ #endif
++ .init_req = tcp_v4_init_req,
++#ifdef CONFIG_SYN_COOKIES
++ .cookie_init_seq = cookie_v4_init_sequence,
++#endif
++ .route_req = tcp_v4_route_req,
++ .init_seq = tcp_v4_init_sequence,
++ .send_synack = tcp_v4_send_synack,
++ .queue_hash_add = inet_csk_reqsk_queue_hash_add,
++};
+
+ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+- struct tcp_options_received tmp_opt;
+- struct request_sock *req;
+- struct inet_request_sock *ireq;
+- struct tcp_sock *tp = tcp_sk(sk);
+- struct dst_entry *dst = NULL;
+- __be32 saddr = ip_hdr(skb)->saddr;
+- __be32 daddr = ip_hdr(skb)->daddr;
+- __u32 isn = TCP_SKB_CB(skb)->when;
+- bool want_cookie = false, fastopen;
+- struct flowi4 fl4;
+- struct tcp_fastopen_cookie foc = { .len = -1 };
+- int err;
+-
+ /* Never answer to SYNs send to broadcast or multicast */
+ if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+ goto drop;
+
+- /* TW buckets are converted to open requests without
+- * limitations, they conserve resources and peer is
+- * evidently real one.
+- */
+- if ((sysctl_tcp_syncookies == 2 ||
+- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+- want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
+- if (!want_cookie)
+- goto drop;
+- }
+-
+- /* Accept backlog is full. If we have already queued enough
+- * of warm entries in syn queue, drop request. It is better than
+- * clogging syn queue with openreqs with exponentially increasing
+- * timeout.
+- */
+- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+- goto drop;
+- }
+-
+- req = inet_reqsk_alloc(&tcp_request_sock_ops);
+- if (!req)
+- goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
+-#endif
+-
+- tcp_clear_options(&tmp_opt);
+- tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
+- tmp_opt.user_mss = tp->rx_opt.user_mss;
+- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+- if (want_cookie && !tmp_opt.saw_tstamp)
+- tcp_clear_options(&tmp_opt);
+-
+- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+- tcp_openreq_init(req, &tmp_opt, skb);
++ return tcp_conn_request(&tcp_request_sock_ops,
++ &tcp_request_sock_ipv4_ops, sk, skb);
+
+- ireq = inet_rsk(req);
+- ireq->ir_loc_addr = daddr;
+- ireq->ir_rmt_addr = saddr;
+- ireq->no_srccheck = inet_sk(sk)->transparent;
+- ireq->opt = tcp_v4_save_options(skb);
+- ireq->ir_mark = inet_request_mark(sk, skb);
+-
+- if (security_inet_conn_request(sk, skb, req))
+- goto drop_and_free;
+-
+- if (!want_cookie || tmp_opt.tstamp_ok)
+- TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+- if (want_cookie) {
+- isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+- req->cookie_ts = tmp_opt.tstamp_ok;
+- } else if (!isn) {
+- /* VJ's idea. We save last timestamp seen
+- * from the destination in peer table, when entering
+- * state TIME-WAIT, and check against it before
+- * accepting new connection request.
+- *
+- * If "isn" is not zero, this request hit alive
+- * timewait bucket, so that all the necessary checks
+- * are made in the function processing timewait state.
+- */
+- if (tmp_opt.saw_tstamp &&
+- tcp_death_row.sysctl_tw_recycle &&
+- (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
+- fl4.daddr == saddr) {
+- if (!tcp_peer_is_proven(req, dst, true)) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+- goto drop_and_release;
+- }
+- }
+- /* Kill the following clause, if you dislike this way. */
+- else if (!sysctl_tcp_syncookies &&
+- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+- (sysctl_max_syn_backlog >> 2)) &&
+- !tcp_peer_is_proven(req, dst, false)) {
+- /* Without syncookies last quarter of
+- * backlog is filled with destinations,
+- * proven to be alive.
+- * It means that we continue to communicate
+- * to destinations, already remembered
+- * to the moment of synflood.
+- */
+- LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
+- &saddr, ntohs(tcp_hdr(skb)->source));
+- goto drop_and_release;
+- }
+-
+- isn = tcp_v4_init_sequence(skb);
+- }
+- if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_isn = isn;
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_openreq_init_rwin(req, sk, dst);
+- fastopen = !want_cookie &&
+- tcp_try_fastopen(sk, skb, req, &foc, dst);
+- err = tcp_v4_send_synack(sk, dst, req,
+- skb_get_queue_mapping(skb), &foc);
+- if (!fastopen) {
+- if (err || want_cookie)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_rsk(req)->listener = NULL;
+- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+- }
+-
+- return 0;
+-
+-drop_and_release:
+- dst_release(dst);
+-drop_and_free:
+- reqsk_free(req);
+ drop:
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ return 0;
+@@ -1497,7 +1433,7 @@ put_and_exit:
+ }
+ EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
+
+-static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcphdr *th = tcp_hdr(skb);
+ const struct iphdr *iph = ip_hdr(skb);
+@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+
+ if (nsk) {
+ if (nsk->sk_state != TCP_TIME_WAIT) {
++ /* Don't lock again the meta-sk. It has been locked
++ * before mptcp_v4_do_rcv.
++ */
++ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++ bh_lock_sock(mptcp_meta_sk(nsk));
+ bh_lock_sock(nsk);
++
+ return nsk;
++
+ }
+ inet_twsk_put(inet_twsk(nsk));
+ return NULL;
+@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
+ goto discard;
+ #endif
+
++ if (is_meta_sk(sk))
++ return mptcp_v4_do_rcv(sk, skb);
++
+ if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
+ struct dst_entry *dst = sk->sk_rx_dst;
+
+@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
+ } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
+ wake_up_interruptible_sync_poll(sk_sleep(sk),
+ POLLIN | POLLRDNORM | POLLRDBAND);
+- if (!inet_csk_ack_scheduled(sk))
++ if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
+ (3 * tcp_rto_min(sk)) / 4,
+ TCP_RTO_MAX);
+@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ {
+ const struct iphdr *iph;
+ const struct tcphdr *th;
+- struct sock *sk;
++ struct sock *sk, *meta_sk = NULL;
+ int ret;
+ struct net *net = dev_net(skb->dev);
+
+@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ skb->len - th->doff * 4);
+ TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++ TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ TCP_SKB_CB(skb)->when = 0;
+ TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
+ TCP_SKB_CB(skb)->sacked = 0;
+
+ sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+- if (!sk)
+- goto no_tcp_socket;
+
+ process:
+- if (sk->sk_state == TCP_TIME_WAIT)
++ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ goto do_time_wait;
+
++#ifdef CONFIG_MPTCP
++ if (!sk && th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, NULL);
++
++ if (ret < 0) {
++ tcp_v4_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++
++ /* Is there a pending request sock for this segment ? */
++ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++ if (sk)
++ sock_put(sk);
++ return 0;
++ }
++#endif
++ if (!sk)
++ goto no_tcp_socket;
++
+ if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+@@ -1759,11 +1729,21 @@ process:
+ sk_mark_napi_id(sk, skb);
+ skb->dev = NULL;
+
+- bh_lock_sock_nested(sk);
++ if (mptcp(tcp_sk(sk))) {
++ meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk))
++ skb->sk = sk;
++ } else {
++ meta_sk = sk;
++ bh_lock_sock_nested(sk);
++ }
++
+ ret = 0;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+- struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp = tcp_sk(meta_sk);
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = net_dma_find_channel();
+ if (tp->ucopy.dma_chan)
+@@ -1771,16 +1751,16 @@ process:
+ else
+ #endif
+ {
+- if (!tcp_prequeue(sk, skb))
++ if (!tcp_prequeue(meta_sk, skb))
+ ret = tcp_v4_do_rcv(sk, skb);
+ }
+- } else if (unlikely(sk_add_backlog(sk, skb,
+- sk->sk_rcvbuf + sk->sk_sndbuf))) {
+- bh_unlock_sock(sk);
++ } else if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ goto discard_and_relse;
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+
+ sock_put(sk);
+
+@@ -1835,6 +1815,18 @@ do_time_wait:
+ sk = sk2;
+ goto process;
+ }
++#ifdef CONFIG_MPTCP
++ if (th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++ if (ret < 0) {
++ tcp_v4_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++#endif
+ /* Fall through to ACK */
+ }
+ case TCP_TW_ACK:
+@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
+
+ tcp_init_sock(sk);
+
+- icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v4_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv4_specific;
+
+ #ifdef CONFIG_TCP_MD5SIG
+ tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
+@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
+
+ tcp_cleanup_congestion_control(sk);
+
++ if (mptcp(tp))
++ mptcp_destroy_sock(sk);
++ if (tp->inside_tk_table)
++ mptcp_hash_remove(tp);
++
+ /* Cleanup up the write buffer. */
+ tcp_write_queue_purge(sk);
+
+@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
+ }
+ #endif /* CONFIG_PROC_FS */
+
++#ifdef CONFIG_MPTCP
++static void tcp_v4_clear_sk(struct sock *sk, int size)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* we do not want to clear tk_table field, because of RCU lookups */
++ sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
++
++ size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
++ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
++}
++#endif
++
+ struct proto tcp_prot = {
+ .name = "TCP",
+ .owner = THIS_MODULE,
+@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
+ .destroy_cgroup = tcp_destroy_cgroup,
+ .proto_cgroup = tcp_proto_cgroup,
+ #endif
++#ifdef CONFIG_MPTCP
++ .clear_sk = tcp_v4_clear_sk,
++#endif
+ };
+ EXPORT_SYMBOL(tcp_prot);
+
+diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
+index e68e0d4af6c9..ae6946857dff 100644
+--- a/net/ipv4/tcp_minisocks.c
++++ b/net/ipv4/tcp_minisocks.c
+@@ -18,11 +18,13 @@
+ * Jorge Cwik, <jorge@laser.satlink.net>
+ */
+
++#include <linux/kconfig.h>
+ #include <linux/mm.h>
+ #include <linux/module.h>
+ #include <linux/slab.h>
+ #include <linux/sysctl.h>
+ #include <linux/workqueue.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/inet_common.h>
+ #include <net/xfrm.h>
+@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ struct tcp_options_received tmp_opt;
+ struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
+ bool paws_reject = false;
++ struct mptcp_options_received mopt;
+
+ tmp_opt.saw_tstamp = 0;
+ if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ mptcp_init_mp_opt(&mopt);
++
++ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+
+ if (tmp_opt.saw_tstamp) {
+ tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
+@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ tmp_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
+ paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
+ }
++
++ if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
++ if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
++ goto kill_with_rst;
++ }
+ }
+
+ if (tw->tw_substate == TCP_FIN_WAIT2) {
+@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ if (!th->ack ||
+ !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
+ TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
++ /* If mptcp_is_data_fin() returns true, we are sure that
++ * mopt has been initialized - otherwise it would not
++ * be a DATA_FIN.
++ */
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
++ mptcp_is_data_fin(skb) &&
++ TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
++ mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
++ return TCP_TW_ACK;
++
+ inet_twsk_put(tw);
+ return TCP_TW_SUCCESS;
+ }
+@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
+ tcptw->tw_ts_offset = tp->tsoffset;
+
++ if (mptcp(tp)) {
++ if (mptcp_init_tw_sock(sk, tcptw)) {
++ inet_twsk_free(tw);
++ goto exit;
++ }
++ } else {
++ tcptw->mptcp_tw = NULL;
++ }
++
+ #if IS_ENABLED(CONFIG_IPV6)
+ if (tw->tw_family == PF_INET6) {
+ struct ipv6_pinfo *np = inet6_sk(sk);
+@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
+ }
+
++exit:
+ tcp_update_metrics(sk);
+ tcp_done(sk);
+ }
+
+ void tcp_twsk_destructor(struct sock *sk)
+ {
+-#ifdef CONFIG_TCP_MD5SIG
+ struct tcp_timewait_sock *twsk = tcp_twsk(sk);
+
++ if (twsk->mptcp_tw)
++ mptcp_twsk_destructor(twsk);
++#ifdef CONFIG_TCP_MD5SIG
+ if (twsk->tw_md5_key)
+ kfree_rcu(twsk->tw_md5_key, rcu);
+ #endif
+@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
+ req->window_clamp = tcp_full_space(sk);
+
+ /* tcp_full_space because it is guaranteed to be the first packet */
+- tcp_select_initial_window(tcp_full_space(sk),
+- mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
++ tp->ops->select_initial_window(tcp_full_space(sk),
++ mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
++ (ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
+ &req->rcv_wnd,
+ &req->window_clamp,
+ ireq->wscale_ok,
+ &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ dst_metric(dst, RTAX_INITRWND), sk);
+ ireq->rcv_wscale = rcv_wscale;
+ }
+ EXPORT_SYMBOL(tcp_openreq_init_rwin);
+@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
+ newtp->rx_opt.ts_recent_stamp = 0;
+ newtp->tcp_header_len = sizeof(struct tcphdr);
+ }
++ if (ireq->saw_mpc)
++ newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
+ newtp->tsoffset = 0;
+ #ifdef CONFIG_TCP_MD5SIG
+ newtp->md5sig_info = NULL; /*XXX*/
+@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ bool fastopen)
+ {
+ struct tcp_options_received tmp_opt;
++ struct mptcp_options_received mopt;
+ struct sock *child;
+ const struct tcphdr *th = tcp_hdr(skb);
+ __be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
+ bool paws_reject = false;
+
+- BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
++ BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
+
+ tmp_opt.saw_tstamp = 0;
++
++ mptcp_init_mp_opt(&mopt);
++
+ if (th->doff > (sizeof(struct tcphdr)>>2)) {
+- tcp_parse_options(skb, &tmp_opt, 0, NULL);
++ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+
+ if (tmp_opt.saw_tstamp) {
+ tmp_opt.ts_recent = req->ts_recent;
+@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ *
+ * Reset timer after retransmitting SYNACK, similar to
+ * the idea of fast retransmit in recovery.
++ *
++ * Fall back to TCP if MP_CAPABLE is not set.
+ */
++
++ if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
++ inet_rsk(req)->saw_mpc = false;
++
++
+ if (!inet_rtx_syn_ack(sk, req))
+ req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
+ TCP_RTO_MAX) + jiffies;
+@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ * socket is created, wait for troubles.
+ */
+ child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
++
+ if (child == NULL)
+ goto listen_overflow;
+
++ if (!is_meta_sk(sk)) {
++ int ret = mptcp_check_req_master(sk, child, req, prev);
++ if (ret < 0)
++ goto listen_overflow;
++
++ /* MPTCP-supported */
++ if (!ret)
++ return tcp_sk(child)->mpcb->master_sk;
++ } else {
++ return mptcp_check_req_child(sk, child, req, prev, &mopt);
++ }
+ inet_csk_reqsk_queue_unlink(sk, req, prev);
+ inet_csk_reqsk_queue_removed(sk, req);
+
+@@ -746,7 +804,17 @@ embryonic_reset:
+ tcp_reset(sk);
+ }
+ if (!fastopen) {
+- inet_csk_reqsk_queue_drop(sk, req, prev);
++ if (is_meta_sk(sk)) {
++ /* We want to avoid stoping the keepalive-timer and so
++ * avoid ending up in inet_csk_reqsk_queue_removed ...
++ */
++ inet_csk_reqsk_queue_unlink(sk, req, prev);
++ if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
++ mptcp_delete_synack_timer(sk);
++ reqsk_free(req);
++ } else {
++ inet_csk_reqsk_queue_drop(sk, req, prev);
++ }
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+ }
+ return NULL;
+@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ {
+ int ret = 0;
+ int state = child->sk_state;
++ struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
+
+- if (!sock_owned_by_user(child)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
+ skb->len);
+ /* Wakeup parent, send SIGIO */
+@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ * in main socket hash table and lock on listening
+ * socket does not protect us more.
+ */
+- __sk_add_backlog(child, skb);
++ if (mptcp(tcp_sk(child)))
++ skb->sk = child;
++ __sk_add_backlog(meta_sk, skb);
+ }
+
+- bh_unlock_sock(child);
++ if (mptcp(tcp_sk(child)))
++ bh_unlock_sock(child);
++ bh_unlock_sock(meta_sk);
+ sock_put(child);
+ return ret;
+ }
+diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
+index 179b51e6bda3..efd31b6c5784 100644
+--- a/net/ipv4/tcp_output.c
++++ b/net/ipv4/tcp_output.c
+@@ -36,6 +36,12 @@
+
+ #define pr_fmt(fmt) "TCP: " fmt
+
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++#include <net/ipv6.h>
+ #include <net/tcp.h>
+
+ #include <linux/compiler.h>
+@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
+ unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
+ EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
+
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+- int push_one, gfp_t gfp);
+-
+ /* Account for new data that has been sent to the network. */
+-static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
+ void tcp_select_initial_window(int __space, __u32 mss,
+ __u32 *rcv_wnd, __u32 *window_clamp,
+ int wscale_ok, __u8 *rcv_wscale,
+- __u32 init_rcv_wnd)
++ __u32 init_rcv_wnd, const struct sock *sk)
+ {
+ unsigned int space = (__space < 0 ? 0 : __space);
+
+@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
+ * value can be stuffed directly into th->window for an outgoing
+ * frame.
+ */
+-static u16 tcp_select_window(struct sock *sk)
++u16 tcp_select_window(struct sock *sk)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 old_win = tp->rcv_wnd;
+- u32 cur_win = tcp_receive_window(tp);
+- u32 new_win = __tcp_select_window(sk);
++ /* The window must never shrink at the meta-level. At the subflow we
++ * have to allow this. Otherwise we may announce a window too large
++ * for the current meta-level sk_rcvbuf.
++ */
++ u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
++ u32 new_win = tp->ops->__select_window(sk);
+
+ /* Never shrink the offered window */
+ if (new_win < cur_win) {
+@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
+ LINUX_MIB_TCPWANTZEROWINDOWADV);
+ new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+ }
++
+ tp->rcv_wnd = new_win;
+ tp->rcv_wup = tp->rcv_nxt;
+
+@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
+ /* Constructs common control bits of non-data skb. If SYN/FIN is present,
+ * auto increment end seqno.
+ */
+-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ {
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ TCP_SKB_CB(skb)->end_seq = seq;
+ }
+
+-static inline bool tcp_urg_mode(const struct tcp_sock *tp)
++bool tcp_urg_mode(const struct tcp_sock *tp)
+ {
+ return tp->snd_una != tp->snd_up;
+ }
+@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
+ #define OPTION_MD5 (1 << 2)
+ #define OPTION_WSCALE (1 << 3)
+ #define OPTION_FAST_OPEN_COOKIE (1 << 8)
+-
+-struct tcp_out_options {
+- u16 options; /* bit field of OPTION_* */
+- u16 mss; /* 0 to disable */
+- u8 ws; /* window scale, 0 to disable */
+- u8 num_sack_blocks; /* number of SACK blocks to include */
+- u8 hash_size; /* bytes in hash_location */
+- __u8 *hash_location; /* temporary pointer, overloaded */
+- __u32 tsval, tsecr; /* need to include OPTION_TS */
+- struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
+-};
++/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
+
+ /* Write previously computed TCP options to the packet.
+ *
+@@ -430,7 +428,7 @@ struct tcp_out_options {
+ * (but it may well be that other scenarios fail similarly).
+ */
+ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+- struct tcp_out_options *opts)
++ struct tcp_out_options *opts, struct sk_buff *skb)
+ {
+ u16 options = opts->options; /* mungable copy */
+
+@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+ }
+ ptr += (foc->len + 3) >> 2;
+ }
++
++ if (unlikely(OPTION_MPTCP & opts->options))
++ mptcp_options_write(ptr, tp, opts, skb);
+ }
+
+ /* Compute TCP options for SYN packets. This is not the final
+@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
+ if (unlikely(!(OPTION_TS & opts->options)))
+ remaining -= TCPOLEN_SACKPERM_ALIGNED;
+ }
++ if (tp->request_mptcp || mptcp(tp))
++ mptcp_syn_options(sk, opts, &remaining);
+
+ if (fastopen && fastopen->cookie.len >= 0) {
+ u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
+@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
+ }
+ }
+
++ if (ireq->saw_mpc)
++ mptcp_synack_options(req, opts, &remaining);
++
+ return MAX_TCP_OPTION_SPACE - remaining;
+ }
+
+@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
+ opts->tsecr = tp->rx_opt.ts_recent;
+ size += TCPOLEN_TSTAMP_ALIGNED;
+ }
++ if (mptcp(tp))
++ mptcp_established_options(sk, skb, opts, &size);
+
+ eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
+ if (unlikely(eff_sacks)) {
+- const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+- opts->num_sack_blocks =
+- min_t(unsigned int, eff_sacks,
+- (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
+- TCPOLEN_SACK_PERBLOCK);
+- size += TCPOLEN_SACK_BASE_ALIGNED +
+- opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
++ const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
++ if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
++ opts->num_sack_blocks = 0;
++ else
++ opts->num_sack_blocks =
++ min_t(unsigned int, eff_sacks,
++ (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
++ TCPOLEN_SACK_PERBLOCK);
++ if (opts->num_sack_blocks)
++ size += TCPOLEN_SACK_BASE_ALIGNED +
++ opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+ }
+
+ return size;
+@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
+ if ((1 << sk->sk_state) &
+ (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
+ TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
+- tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+- 0, GFP_ATOMIC);
++ tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
++ tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
+ }
+ /*
+ * One tasklet per cpu tries to send more skbs.
+@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
+ unsigned long flags;
+ struct list_head *q, *n;
+ struct tcp_sock *tp;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+
+ local_irq_save(flags);
+ list_splice_init(&tsq->head, &list);
+@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
+ list_del(&tp->tsq_node);
+
+ sk = (struct sock *)tp;
+- bh_lock_sock(sk);
++ meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++ bh_lock_sock(meta_sk);
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_tsq_handler(sk);
++ if (mptcp(tp))
++ tcp_tsq_handler(meta_sk);
+ } else {
++ if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
++ goto exit;
++
+ /* defer the work to tcp_release_cb() */
+ set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
++
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++exit:
++ bh_unlock_sock(meta_sk);
+
+ clear_bit(TSQ_QUEUED, &tp->tsq_flags);
+ sk_free(sk);
+@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
+ #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) | \
+ (1UL << TCP_WRITE_TIMER_DEFERRED) | \
+ (1UL << TCP_DELACK_TIMER_DEFERRED) | \
+- (1UL << TCP_MTU_REDUCED_DEFERRED))
++ (1UL << TCP_MTU_REDUCED_DEFERRED) | \
++ (1UL << MPTCP_PATH_MANAGER) | \
++ (1UL << MPTCP_SUB_DEFERRED))
++
+ /**
+ * tcp_release_cb - tcp release_sock() callback
+ * @sk: socket
+@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
+ sk->sk_prot->mtu_reduced(sk);
+ __sock_put(sk);
+ }
++ if (flags & (1UL << MPTCP_PATH_MANAGER)) {
++ if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
++ tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
++ __sock_put(sk);
++ }
++ if (flags & (1UL << MPTCP_SUB_DEFERRED))
++ mptcp_tsq_sub_deferred(sk);
+ }
+ EXPORT_SYMBOL(tcp_release_cb);
+
+@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
+ * We are working here with either a clone of the original
+ * SKB, or a fresh unique copy made by the retransmit engine.
+ */
+-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+- gfp_t gfp_mask)
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++ gfp_t gfp_mask)
+ {
+ const struct inet_connection_sock *icsk = inet_csk(sk);
+ struct inet_sock *inet;
+@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ */
+ th->window = htons(min(tp->rcv_wnd, 65535U));
+ } else {
+- th->window = htons(tcp_select_window(sk));
++ th->window = htons(tp->ops->select_window(sk));
+ }
+ th->check = 0;
+ th->urg_ptr = 0;
+@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ }
+ }
+
+- tcp_options_write((__be32 *)(th + 1), tp, &opts);
++ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
+ TCP_ECN_send(sk, skb, tcp_header_size);
+
+@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
+ * otherwise socket can stall.
+ */
+-static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ }
+
+ /* Initialize TSO segments for a packet. */
+-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+- unsigned int mss_now)
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now)
+ {
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+ /* Make sure we own this skb before messing gso_size/gso_segs */
+ WARN_ON_ONCE(skb_cloned(skb));
+
+- if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
++ if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
++ (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
+ /* Avoid the costly divide in the normal
+ * non-TSO case.
+ */
+@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
+ /* Pcount in the middle of the write queue got changed, we need to do various
+ * tweaks to fix counters
+ */
+-static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
+@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
+ * eventually). The difference is that pulled data not copied, but
+ * immediately discarded.
+ */
+-static void __pskb_trim_head(struct sk_buff *skb, int len)
++void __pskb_trim_head(struct sk_buff *skb, int len)
+ {
+ struct skb_shared_info *shinfo;
+ int i, k, eat;
+@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
+ /* Remove acked data from a packet in the transmit queue. */
+ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ {
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
++ return mptcp_trim_head(sk, skb, len);
++
+ if (skb_unclone(skb, GFP_ATOMIC))
+ return -ENOMEM;
+
+@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ if (tcp_skb_pcount(skb) > 1)
+ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
+
++#ifdef CONFIG_MPTCP
++ /* Some data got acked - we assume that the seq-number reached the dest.
++ * Anyway, our MPTCP-option has been trimmed above - we lost it here.
++ * Only remove the SEQ if the call does not come from a meta retransmit.
++ */
++ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++ TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
++#endif
++
+ return 0;
+ }
+
+@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
+
+ return mss_now;
+ }
++EXPORT_SYMBOL(tcp_current_mss);
+
+ /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
+ * As additional protections, we do not touch cwnd in retransmission phases,
+@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
+ * But we can avoid doing the divide again given we already have
+ * skb_pcount = skb->len / mss_now
+ */
+-static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
+- const struct sk_buff *skb)
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++ const struct sk_buff *skb)
+ {
+ if (skb->len < tcp_skb_pcount(skb) * mss_now)
+ tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
+ (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
+ }
+ /* Returns the portion of skb which can be sent right away */
+-static unsigned int tcp_mss_split_point(const struct sock *sk,
+- const struct sk_buff *skb,
+- unsigned int mss_now,
+- unsigned int max_segs,
+- int nonagle)
++unsigned int tcp_mss_split_point(const struct sock *sk,
++ const struct sk_buff *skb,
++ unsigned int mss_now,
++ unsigned int max_segs,
++ int nonagle)
+ {
+ const struct tcp_sock *tp = tcp_sk(sk);
+ u32 partial, needed, window, max_len;
+@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
+ /* Can at least one segment of SKB be sent right now, according to the
+ * congestion window rules? If so, return how many segments are allowed.
+ */
+-static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+- const struct sk_buff *skb)
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
++ const struct sk_buff *skb)
+ {
+ u32 in_flight, cwnd;
+
+ /* Don't be strict about the congestion window for the final FIN. */
+- if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
++ if (skb &&
++ (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
+ tcp_skb_pcount(skb) == 1)
+ return 1;
+
+@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+ * This must be invoked the first time we consider transmitting
+ * SKB onto the wire.
+ */
+-static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+- unsigned int mss_now)
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++ unsigned int mss_now)
+ {
+ int tso_segs = tcp_skb_pcount(skb);
+
+@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+ /* Return true if the Nagle test allows this packet to be
+ * sent now.
+ */
+-static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
+- unsigned int cur_mss, int nonagle)
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss, int nonagle)
+ {
+ /* Nagle rule does not apply to frames, which sit in the middle of the
+ * write_queue (they have no chances to get new data).
+@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ return true;
+
+ /* Don't use the nagle rule for urgent data (or for the final FIN). */
+- if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
++ if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
++ mptcp_is_data_fin(skb))
+ return true;
+
+ if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
+@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ }
+
+ /* Does at least the first segment of SKB fit into the send window? */
+-static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
+- const struct sk_buff *skb,
+- unsigned int cur_mss)
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++ unsigned int cur_mss)
+ {
+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
+
+@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
+ u32 send_win, cong_win, limit, in_flight;
+ int win_divisor;
+
+- if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
++ if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
+ goto send_now;
+
+ if (icsk->icsk_ca_state != TCP_CA_Open)
+@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
+ * Returns true, if no segments are in flight and we have queued segments,
+ * but cannot send anything now because of SWS or another problem.
+ */
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ int push_one, gfp_t gfp)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+
+ sent_pkts = 0;
+
+- if (!push_one) {
++ /* pmtu not yet supported with MPTCP. Should be possible, by early
++ * exiting the loop inside tcp_mtu_probe, making sure that only one
++ * single DSS-mapping gets probed.
++ */
++ if (!push_one && !mptcp(tp)) {
+ /* Do MTU probing. */
+ result = tcp_mtu_probe(sk);
+ if (!result) {
+@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
+ int err = -1;
+
+ if (tcp_send_head(sk) != NULL) {
+- err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
++ err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
++ GFP_ATOMIC);
+ goto rearm_timer;
+ }
+
+@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
+ if (unlikely(sk->sk_state == TCP_CLOSE))
+ return;
+
+- if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
+- sk_gfp_atomic(sk, GFP_ATOMIC)))
++ if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
++ sk_gfp_atomic(sk, GFP_ATOMIC)))
+ tcp_check_probe_timer(sk);
+ }
+
+@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
+
+ BUG_ON(!skb || skb->len < mss_now);
+
+- tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
++ tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
++ sk->sk_allocation);
+ }
+
+ /* This function returns the amount that we can raise the
+@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+ if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
+ return;
+
++ /* Currently not supported for MPTCP - but it should be possible */
++ if (mptcp(tp))
++ return;
++
+ tcp_for_write_queue_from_safe(skb, tmp, sk) {
+ if (!tcp_can_collapse(sk, skb))
+ break;
+@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+
+ /* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
+ th->window = htons(min(req->rcv_wnd, 65535U));
+- tcp_options_write((__be32 *)(th + 1), tp, &opts);
++ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ th->doff = (tcp_header_size >> 2);
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
+
+@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
+ (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
+ tp->window_clamp = tcp_full_space(sk);
+
+- tcp_select_initial_window(tcp_full_space(sk),
+- tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
+- &tp->rcv_wnd,
+- &tp->window_clamp,
+- sysctl_tcp_window_scaling,
+- &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk),
++ tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
++ &tp->rcv_wnd,
++ &tp->window_clamp,
++ sysctl_tcp_window_scaling,
++ &rcv_wscale,
++ dst_metric(dst, RTAX_INITRWND), sk);
+
+ tp->rx_opt.rcv_wscale = rcv_wscale;
+ tp->rcv_ssthresh = tp->rcv_wnd;
+@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
+ inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ inet_csk(sk)->icsk_retransmits = 0;
+ tcp_clear_retrans(tp);
++
++#ifdef CONFIG_MPTCP
++ if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
++ if (is_master_tp(tp)) {
++ tp->request_mptcp = 1;
++ mptcp_connect_init(sk);
++ } else if (tp->mptcp) {
++ struct inet_sock *inet = inet_sk(sk);
++
++ tp->mptcp->snt_isn = tp->write_seq;
++ tp->mptcp->init_rcv_wnd = tp->rcv_wnd;
++
++ /* Set nonce for new subflows */
++ if (sk->sk_family == AF_INET)
++ tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
++ inet->inet_saddr,
++ inet->inet_daddr,
++ inet->inet_sport,
++ inet->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
++ inet6_sk(sk)->saddr.s6_addr32,
++ sk->sk_v6_daddr.s6_addr32,
++ inet->inet_sport,
++ inet->inet_dport);
++#endif
++ }
++ }
++#endif
+ }
+
+ static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
+@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
+ TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
+ }
++EXPORT_SYMBOL(tcp_send_ack);
+
+ /* This routine sends a packet with an out of date sequence
+ * number. It assumes the other end will try to ack it.
+@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
+ * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
+ * out-of-date with SND.UNA-1 to probe window.
+ */
+-static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
++int tcp_xmit_probe_skb(struct sock *sk, int urgent)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct sk_buff *skb;
+@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
+ struct tcp_sock *tp = tcp_sk(sk);
+ int err;
+
+- err = tcp_write_wakeup(sk);
++ err = tp->ops->write_wakeup(sk);
+
+ if (tp->packets_out || !tcp_send_head(sk)) {
+ /* Cancel probe timer, if it is not required. */
+@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
+ TCP_RTO_MAX);
+ }
+ }
++
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
++{
++ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++ struct flowi fl;
++ int res;
++
++ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
++ if (!res) {
++ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
++ }
++ return res;
++}
++EXPORT_SYMBOL(tcp_rtx_synack);
+diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
+index 286227abed10..966b873cbf3e 100644
+--- a/net/ipv4/tcp_timer.c
++++ b/net/ipv4/tcp_timer.c
+@@ -20,6 +20,7 @@
+
+ #include <linux/module.h>
+ #include <linux/gfp.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+
+ int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
+@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+ int sysctl_tcp_orphan_retries __read_mostly;
+ int sysctl_tcp_thin_linear_timeouts __read_mostly;
+
+-static void tcp_write_err(struct sock *sk)
++void tcp_write_err(struct sock *sk)
+ {
+ sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
+ sk->sk_error_report(sk);
+@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
+ (!tp->snd_wnd && !tp->packets_out))
+ do_reset = 1;
+ if (do_reset)
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ tcp_done(sk);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
+ return 1;
+@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
+ * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
+ * syn_set flag is set.
+ */
+-static bool retransmits_timed_out(struct sock *sk,
+- unsigned int boundary,
+- unsigned int timeout,
+- bool syn_set)
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++ unsigned int timeout, bool syn_set)
+ {
+ unsigned int linear_backoff_thresh, start_ts;
+ unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
+ }
+
+ /* A write timeout has occurred. Process the after effects. */
+-static int tcp_write_timeout(struct sock *sk)
++int tcp_write_timeout(struct sock *sk)
+ {
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
+ }
+ retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
+ syn_set = true;
++ /* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
++ if (tcp_sk(sk)->request_mptcp &&
++ icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
++ tcp_sk(sk)->request_mptcp = 0;
+ } else {
+ if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
+ /* Black hole detection */
+@@ -251,18 +254,22 @@ out:
+ static void tcp_delack_timer(unsigned long data)
+ {
+ struct sock *sk = (struct sock *)data;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+
+- bh_lock_sock(sk);
+- if (!sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_delack_timer_handler(sk);
+ } else {
+ inet_csk(sk)->icsk_ack.blocked = 1;
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
+ /* deleguate our work to tcp_release_cb() */
+ if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -479,6 +486,10 @@ out_reset_timer:
+ __sk_dst_reset(sk);
+
+ out:;
++ if (mptcp(tp)) {
++ mptcp_reinject_data(sk, 1);
++ mptcp_set_rto(sk);
++ }
+ }
+
+ void tcp_write_timer_handler(struct sock *sk)
+@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
+ break;
+ case ICSK_TIME_RETRANS:
+ icsk->icsk_pending = 0;
+- tcp_retransmit_timer(sk);
++ tcp_sk(sk)->ops->retransmit_timer(sk);
+ break;
+ case ICSK_TIME_PROBE0:
+ icsk->icsk_pending = 0;
+@@ -520,16 +531,19 @@ out:
+ static void tcp_write_timer(unsigned long data)
+ {
+ struct sock *sk = (struct sock *)data;
++ struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
+
+- bh_lock_sock(sk);
+- if (!sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (!sock_owned_by_user(meta_sk)) {
+ tcp_write_timer_handler(sk);
+ } else {
+ /* deleguate our work to tcp_release_cb() */
+ if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ sock_hold(sk);
++ if (mptcp(tcp_sk(sk)))
++ mptcp_tsq_flags(sk);
+ }
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
+ struct sock *sk = (struct sock *) data;
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ u32 elapsed;
+
+ /* Only process if socket is not in use. */
+- bh_lock_sock(sk);
+- if (sock_owned_by_user(sk)) {
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
+ /* Try again later. */
+ inet_csk_reset_keepalive_timer (sk, HZ/20);
+ goto out;
+@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
+ goto out;
+ }
+
++ if (tp->send_mp_fclose) {
++ /* MUST do this before tcp_write_timeout, because retrans_stamp
++ * may have been set to 0 in another part while we are
++ * retransmitting MP_FASTCLOSE. Then, we would crash, because
++ * retransmits_timed_out accesses the meta-write-queue.
++ *
++ * We make sure that the timestamp is != 0.
++ */
++ if (!tp->retrans_stamp)
++ tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++ if (tcp_write_timeout(sk))
++ goto out;
++
++ tcp_send_ack(sk);
++ icsk->icsk_retransmits++;
++
++ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++ elapsed = icsk->icsk_rto;
++ goto resched;
++ }
++
+ if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
+ if (tp->linger2 >= 0) {
+ const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
+
+ if (tmo > 0) {
+- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ goto out;
+ }
+ }
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ goto death;
+ }
+
+@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
+ icsk->icsk_probes_out > 0) ||
+ (icsk->icsk_user_timeout == 0 &&
+ icsk->icsk_probes_out >= keepalive_probes(tp))) {
+- tcp_send_active_reset(sk, GFP_ATOMIC);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ tcp_write_err(sk);
+ goto out;
+ }
+- if (tcp_write_wakeup(sk) <= 0) {
++ if (tp->ops->write_wakeup(sk) <= 0) {
+ icsk->icsk_probes_out++;
+ elapsed = keepalive_intvl_when(tp);
+ } else {
+@@ -642,7 +679,7 @@ death:
+ tcp_done(sk);
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
+index 5667b3003af9..7139c2973fd2 100644
+--- a/net/ipv6/addrconf.c
++++ b/net/ipv6/addrconf.c
+@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
+
+ kfree_rcu(ifp, rcu);
+ }
++EXPORT_SYMBOL(inet6_ifa_finish_destroy);
+
+ static void
+ ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
+diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
+index 7cb4392690dd..7057afbca4df 100644
+--- a/net/ipv6/af_inet6.c
++++ b/net/ipv6/af_inet6.c
+@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
+ return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
+ }
+
+-static int inet6_create(struct net *net, struct socket *sock, int protocol,
+- int kern)
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ struct inet_sock *inet;
+ struct ipv6_pinfo *np;
+diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
+index a245e5ddffbd..99c892b8992d 100644
+--- a/net/ipv6/inet6_connection_sock.c
++++ b/net/ipv6/inet6_connection_sock.c
+@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
+ /*
+ * request_sock (formerly open request) hash tables.
+ */
+-static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
+- const u32 rnd, const u32 synq_hsize)
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++ const u32 rnd, const u32 synq_hsize)
+ {
+ u32 c;
+
+diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
+index edb58aff4ae7..ea4d9fda0927 100644
+--- a/net/ipv6/ipv6_sockglue.c
++++ b/net/ipv6/ipv6_sockglue.c
+@@ -48,6 +48,8 @@
+ #include <net/addrconf.h>
+ #include <net/inet_common.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/xfrm.h>
+@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+ sock_prot_inuse_add(net, &tcp_prot, 1);
+ local_bh_enable();
+ sk->sk_prot = &tcp_prot;
+- icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v4_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv4_specific;
+ sk->sk_socket->ops = &inet_stream_ops;
+ sk->sk_family = PF_INET;
+ tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
+diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
+index a822b880689b..b2b38869d795 100644
+--- a/net/ipv6/syncookies.c
++++ b/net/ipv6/syncookies.c
+@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+
+ /* check for timestamp cookie support */
+ memset(&tcp_opt, 0, sizeof(tcp_opt));
+- tcp_parse_options(skb, &tcp_opt, 0, NULL);
++ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+
+ if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ goto out;
+
+ ret = NULL;
+- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
++ req = inet_reqsk_alloc(&tcp6_request_sock_ops);
+ if (!req)
+ goto out;
+
+@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ }
+
+ req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
+- tcp_select_initial_window(tcp_full_space(sk), req->mss,
+- &req->rcv_wnd, &req->window_clamp,
+- ireq->wscale_ok, &rcv_wscale,
+- dst_metric(dst, RTAX_INITRWND));
++ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++ &req->rcv_wnd, &req->window_clamp,
++ ireq->wscale_ok, &rcv_wscale,
++ dst_metric(dst, RTAX_INITRWND), sk);
+
+ ireq->rcv_wscale = rcv_wscale;
+
+diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
+index 229239ad96b1..fda94d71666e 100644
+--- a/net/ipv6/tcp_ipv6.c
++++ b/net/ipv6/tcp_ipv6.c
+@@ -63,6 +63,8 @@
+ #include <net/inet_common.h>
+ #include <net/secure_seq.h>
+ #include <net/tcp_memcontrol.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
+ #include <net/busy_poll.h>
+
+ #include <linux/proc_fs.h>
+@@ -71,12 +73,6 @@
+ #include <linux/crypto.h>
+ #include <linux/scatterlist.h>
+
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req);
+-
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
+-
+ static const struct inet_connection_sock_af_ops ipv6_mapped;
+ static const struct inet_connection_sock_af_ops ipv6_specific;
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
+ }
+ #endif
+
+-static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ {
+ struct dst_entry *dst = skb_dst(skb);
+ const struct rt6_info *rt = (const struct rt6_info *)dst;
+@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+ }
+
+-static void tcp_v6_hash(struct sock *sk)
++void tcp_v6_hash(struct sock *sk)
+ {
+ if (sk->sk_state != TCP_CLOSE) {
+- if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
++ if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
++ inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
+ tcp_prot.hash(sk);
+ return;
+ }
+@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
+ }
+ }
+
+-static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ {
+ return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
+ ipv6_hdr(skb)->saddr.s6_addr32,
+@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ tcp_hdr(skb)->source);
+ }
+
+-static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ int addr_len)
+ {
+ struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
+@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ sin.sin_port = usin->sin6_port;
+ sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
+
+- icsk->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_mapped;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_mapped;
+ sk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ tp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+
+ if (err) {
+ icsk->icsk_ext_hdr_len = exthdrlen;
+- icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_specific;
+ sk->sk_backlog_rcv = tcp_v6_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ tp->af_specific = &tcp_sock_ipv6_specific;
+@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
+ const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
+ struct ipv6_pinfo *np;
+- struct sock *sk;
++ struct sock *sk, *meta_sk;
+ int err;
+ struct tcp_sock *tp;
+ struct request_sock *fastopen;
+@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ return;
+ }
+
+- bh_lock_sock(sk);
+- if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
++ tp = tcp_sk(sk);
++ if (mptcp(tp))
++ meta_sk = mptcp_meta_sk(sk);
++ else
++ meta_sk = sk;
++
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
+ NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+
+ if (sk->sk_state == TCP_CLOSE)
+@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+ }
+
+- tp = tcp_sk(sk);
+ seq = ntohl(th->seq);
+ /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ fastopen = tp->fastopen_rsk;
+@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+
+ tp->mtu_info = ntohl(info);
+- if (!sock_owned_by_user(sk))
++ if (!sock_owned_by_user(meta_sk))
+ tcp_v6_mtu_reduced(sk);
+- else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
++ else {
++ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
+ &tp->tsq_flags))
+- sock_hold(sk);
++ sock_hold(sk);
++ if (mptcp(tp))
++ mptcp_tsq_flags(sk);
++ }
+ goto out;
+ }
+
+@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ switch (sk->sk_state) {
+ struct request_sock *req, **prev;
+ case TCP_LISTEN:
+- if (sock_owned_by_user(sk))
++ if (sock_owned_by_user(meta_sk))
+ goto out;
+
+ req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
+@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ if (fastopen && fastopen->sk == NULL)
+ break;
+
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk); /* Wake people up to see the error (see connect in sock.c) */
+
+@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ goto out;
+ }
+
+- if (!sock_owned_by_user(sk) && np->recverr) {
++ if (!sock_owned_by_user(meta_sk) && np->recverr) {
+ sk->sk_err = err;
+ sk->sk_error_report(sk);
+ } else
+ sk->sk_err_soft = err;
+
+ out:
+- bh_unlock_sock(sk);
++ bh_unlock_sock(meta_sk);
+ sock_put(sk);
+ }
+
+
+-static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+- struct flowi6 *fl6,
+- struct request_sock *req,
+- u16 queue_mapping,
+- struct tcp_fastopen_cookie *foc)
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++ struct flowi *fl,
++ struct request_sock *req,
++ u16 queue_mapping,
++ struct tcp_fastopen_cookie *foc)
+ {
+ struct inet_request_sock *ireq = inet_rsk(req);
+ struct ipv6_pinfo *np = inet6_sk(sk);
++ struct flowi6 *fl6 = &fl->u.ip6;
+ struct sk_buff *skb;
+ int err = -ENOMEM;
+
+@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+ skb_set_queue_mapping(skb, queue_mapping);
+ err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
+ err = net_xmit_eval(err);
++ if (!tcp_rsk(req)->snt_synack && !err)
++ tcp_rsk(req)->snt_synack = tcp_time_stamp;
+ }
+
+ done:
+ return err;
+ }
+
+-static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ {
+- struct flowi6 fl6;
++ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++ struct flowi fl;
+ int res;
+
+- res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
++ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
+ if (!res) {
+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ return res;
+ }
+
+-static void tcp_v6_reqsk_destructor(struct request_sock *req)
++void tcp_v6_reqsk_destructor(struct request_sock *req)
+ {
+ kfree_skb(inet_rsk(req)->pktopts);
+ }
+@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ }
+ #endif
+
++static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct inet_request_sock *ireq = inet_rsk(req);
++ struct ipv6_pinfo *np = inet6_sk(sk);
++
++ ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
++ ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
++
++ ireq->ir_iif = sk->sk_bound_dev_if;
++ ireq->ir_mark = inet_request_mark(sk, skb);
++
++ /* So that link locals have meaning */
++ if (!sk->sk_bound_dev_if &&
++ ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
++ ireq->ir_iif = inet6_iif(skb);
++
++ if (!TCP_SKB_CB(skb)->when &&
++ (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
++ np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
++ np->rxopt.bits.rxohlim || np->repflow)) {
++ atomic_inc(&skb->users);
++ ireq->pktopts = skb;
++ }
++
++ return 0;
++}
++
++static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
++ const struct request_sock *req,
++ bool *strict)
++{
++ if (strict)
++ *strict = true;
++ return inet6_csk_route_req(sk, &fl->u.ip6, req);
++}
++
+ struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
+ .family = AF_INET6,
+ .obj_size = sizeof(struct tcp6_request_sock),
+- .rtx_syn_ack = tcp_v6_rtx_synack,
++ .rtx_syn_ack = tcp_rtx_synack,
+ .send_ack = tcp_v6_reqsk_send_ack,
+ .destructor = tcp_v6_reqsk_destructor,
+ .send_reset = tcp_v6_send_reset,
+ .syn_ack_timeout = tcp_syn_ack_timeout,
+ };
+
++const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
++ .mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) -
++ sizeof(struct ipv6hdr),
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
+ .md5_lookup = tcp_v6_reqsk_md5_lookup,
+ .calc_md5_hash = tcp_v6_md5_hash_skb,
+-};
+ #endif
++ .init_req = tcp_v6_init_req,
++#ifdef CONFIG_SYN_COOKIES
++ .cookie_init_seq = cookie_v6_init_sequence,
++#endif
++ .route_req = tcp_v6_route_req,
++ .init_seq = tcp_v6_init_sequence,
++ .send_synack = tcp_v6_send_synack,
++ .queue_hash_add = inet6_csk_reqsk_queue_hash_add,
++};
+
+-static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+- u32 tsval, u32 tsecr, int oif,
+- struct tcp_md5sig_key *key, int rst, u8 tclass,
+- u32 label)
++static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
++ u32 data_ack, u32 win, u32 tsval, u32 tsecr,
++ int oif, struct tcp_md5sig_key *key, int rst,
++ u8 tclass, u32 label, int mptcp)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ struct tcphdr *t1;
+@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ if (key)
+ tot_len += TCPOLEN_MD5SIG_ALIGNED;
+ #endif
+-
++#ifdef CONFIG_MPTCP
++ if (mptcp)
++ tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++#endif
+ buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
+ GFP_ATOMIC);
+ if (buff == NULL)
+@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ tcp_v6_md5_hash_hdr((__u8 *)topt, key,
+ &ipv6_hdr(skb)->saddr,
+ &ipv6_hdr(skb)->daddr, t1);
++ topt += 4;
++ }
++#endif
++#ifdef CONFIG_MPTCP
++ if (mptcp) {
++ /* Construction of 32-bit data_ack */
++ *topt++ = htonl((TCPOPT_MPTCP << 24) |
++ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++ (0x20 << 8) |
++ (0x01));
++ *topt++ = htonl(data_ack);
+ }
+ #endif
+
+@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ kfree_skb(buff);
+ }
+
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ const struct tcphdr *th = tcp_hdr(skb);
+ u32 seq = 0, ack_seq = 0;
+@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ (th->doff << 2);
+
+ oif = sk ? sk->sk_bound_dev_if : 0;
+- tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
++ tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
+
+ #ifdef CONFIG_TCP_MD5SIG
+ release_sk1:
+@@ -902,45 +983,52 @@ release_sk1:
+ #endif
+ }
+
+-static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ u32 win, u32 tsval, u32 tsecr, int oif,
+ struct tcp_md5sig_key *key, u8 tclass,
+- u32 label)
++ u32 label, int mptcp)
+ {
+- tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
+- label);
++ tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
++ key, 0, tclass, label, mptcp);
+ }
+
+ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ struct inet_timewait_sock *tw = inet_twsk(sk);
+ struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++ u32 data_ack = 0;
++ int mptcp = 0;
+
++ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++ mptcp = 1;
++ }
+ tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++ data_ack,
+ tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ tcp_time_stamp + tcptw->tw_ts_offset,
+ tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
+- tw->tw_tclass, (tw->tw_flowlabel << 12));
++ tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
+
+ inet_twsk_put(tw);
+ }
+
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req)
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req)
+ {
+ /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ */
+ tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+- tcp_rsk(req)->rcv_nxt,
++ tcp_rsk(req)->rcv_nxt, 0,
+ req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
+ tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
+- 0, 0);
++ 0, 0, 0);
+ }
+
+
+-static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ struct request_sock *req, **prev;
+ const struct tcphdr *th = tcp_hdr(skb);
+@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+
+ if (nsk) {
+ if (nsk->sk_state != TCP_TIME_WAIT) {
++ /* Don't lock again the meta-sk. It has been locked
++ * before mptcp_v6_do_rcv.
++ */
++ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++ bh_lock_sock(mptcp_meta_sk(nsk));
+ bh_lock_sock(nsk);
++
+ return nsk;
+ }
+ inet_twsk_put(inet_twsk(nsk));
+@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ return sk;
+ }
+
+-/* FIXME: this is substantially similar to the ipv4 code.
+- * Can some kind of merge be done? -- erics
+- */
+-static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+- struct tcp_options_received tmp_opt;
+- struct request_sock *req;
+- struct inet_request_sock *ireq;
+- struct ipv6_pinfo *np = inet6_sk(sk);
+- struct tcp_sock *tp = tcp_sk(sk);
+- __u32 isn = TCP_SKB_CB(skb)->when;
+- struct dst_entry *dst = NULL;
+- struct tcp_fastopen_cookie foc = { .len = -1 };
+- bool want_cookie = false, fastopen;
+- struct flowi6 fl6;
+- int err;
+-
+ if (skb->protocol == htons(ETH_P_IP))
+ return tcp_v4_conn_request(sk, skb);
+
+ if (!ipv6_unicast_destination(skb))
+ goto drop;
+
+- if ((sysctl_tcp_syncookies == 2 ||
+- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+- want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
+- if (!want_cookie)
+- goto drop;
+- }
+-
+- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+- goto drop;
+- }
+-
+- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
+- if (req == NULL)
+- goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
+-#endif
+-
+- tcp_clear_options(&tmp_opt);
+- tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
+- tmp_opt.user_mss = tp->rx_opt.user_mss;
+- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+- if (want_cookie && !tmp_opt.saw_tstamp)
+- tcp_clear_options(&tmp_opt);
++ return tcp_conn_request(&tcp6_request_sock_ops,
++ &tcp_request_sock_ipv6_ops, sk, skb);
+
+- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+- tcp_openreq_init(req, &tmp_opt, skb);
+-
+- ireq = inet_rsk(req);
+- ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
+- ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
+- if (!want_cookie || tmp_opt.tstamp_ok)
+- TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+- ireq->ir_iif = sk->sk_bound_dev_if;
+- ireq->ir_mark = inet_request_mark(sk, skb);
+-
+- /* So that link locals have meaning */
+- if (!sk->sk_bound_dev_if &&
+- ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
+- ireq->ir_iif = inet6_iif(skb);
+-
+- if (!isn) {
+- if (ipv6_opt_accepted(sk, skb) ||
+- np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
+- np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
+- np->repflow) {
+- atomic_inc(&skb->users);
+- ireq->pktopts = skb;
+- }
+-
+- if (want_cookie) {
+- isn = cookie_v6_init_sequence(sk, skb, &req->mss);
+- req->cookie_ts = tmp_opt.tstamp_ok;
+- goto have_isn;
+- }
+-
+- /* VJ's idea. We save last timestamp seen
+- * from the destination in peer table, when entering
+- * state TIME-WAIT, and check against it before
+- * accepting new connection request.
+- *
+- * If "isn" is not zero, this request hit alive
+- * timewait bucket, so that all the necessary checks
+- * are made in the function processing timewait state.
+- */
+- if (tmp_opt.saw_tstamp &&
+- tcp_death_row.sysctl_tw_recycle &&
+- (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
+- if (!tcp_peer_is_proven(req, dst, true)) {
+- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+- goto drop_and_release;
+- }
+- }
+- /* Kill the following clause, if you dislike this way. */
+- else if (!sysctl_tcp_syncookies &&
+- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+- (sysctl_max_syn_backlog >> 2)) &&
+- !tcp_peer_is_proven(req, dst, false)) {
+- /* Without syncookies last quarter of
+- * backlog is filled with destinations,
+- * proven to be alive.
+- * It means that we continue to communicate
+- * to destinations, already remembered
+- * to the moment of synflood.
+- */
+- LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
+- &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
+- goto drop_and_release;
+- }
+-
+- isn = tcp_v6_init_sequence(skb);
+- }
+-have_isn:
+-
+- if (security_inet_conn_request(sk, skb, req))
+- goto drop_and_release;
+-
+- if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->snt_isn = isn;
+- tcp_rsk(req)->snt_synack = tcp_time_stamp;
+- tcp_openreq_init_rwin(req, sk, dst);
+- fastopen = !want_cookie &&
+- tcp_try_fastopen(sk, skb, req, &foc, dst);
+- err = tcp_v6_send_synack(sk, dst, &fl6, req,
+- skb_get_queue_mapping(skb), &foc);
+- if (!fastopen) {
+- if (err || want_cookie)
+- goto drop_and_free;
+-
+- tcp_rsk(req)->listener = NULL;
+- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+- }
+- return 0;
+-
+-drop_and_release:
+- dst_release(dst);
+-drop_and_free:
+- reqsk_free(req);
+ drop:
+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ return 0; /* don't send reset */
+ }
+
+-static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+- struct request_sock *req,
+- struct dst_entry *dst)
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++ struct request_sock *req,
++ struct dst_entry *dst)
+ {
+ struct inet_request_sock *ireq;
+ struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
+@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+
+ newsk->sk_v6_rcv_saddr = newnp->saddr;
+
+- inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(newsk))
++ inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
++ else
++#endif
++ inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
+ newsk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -1329,7 +1292,7 @@ out:
+ * This is because we cannot sleep with the original spinlock
+ * held.
+ */
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ {
+ struct ipv6_pinfo *np = inet6_sk(sk);
+ struct tcp_sock *tp;
+@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ goto discard;
+ #endif
+
++ if (is_meta_sk(sk))
++ return mptcp_v6_do_rcv(sk, skb);
++
+ if (sk_filter(sk, skb))
+ goto discard;
+
+@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ {
+ const struct tcphdr *th;
+ const struct ipv6hdr *hdr;
+- struct sock *sk;
++ struct sock *sk, *meta_sk = NULL;
+ int ret;
+ struct net *net = dev_net(skb->dev);
+
+@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ skb->len - th->doff*4);
+ TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++ TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ TCP_SKB_CB(skb)->when = 0;
+ TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
+ TCP_SKB_CB(skb)->sacked = 0;
+
+ sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+- if (!sk)
+- goto no_tcp_socket;
+
+ process:
+- if (sk->sk_state == TCP_TIME_WAIT)
++ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ goto do_time_wait;
+
++#ifdef CONFIG_MPTCP
++ if (!sk && th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, NULL);
++
++ if (ret < 0) {
++ tcp_v6_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++
++ /* Is there a pending request sock for this segment ? */
++ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++ if (sk)
++ sock_put(sk);
++ return 0;
++ }
++#endif
++
++ if (!sk)
++ goto no_tcp_socket;
++
+ if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+@@ -1529,11 +1520,21 @@ process:
+ sk_mark_napi_id(sk, skb);
+ skb->dev = NULL;
+
+- bh_lock_sock_nested(sk);
++ if (mptcp(tcp_sk(sk))) {
++ meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk))
++ skb->sk = sk;
++ } else {
++ meta_sk = sk;
++ bh_lock_sock_nested(sk);
++ }
++
+ ret = 0;
+- if (!sock_owned_by_user(sk)) {
++ if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+- struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp = tcp_sk(meta_sk);
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = net_dma_find_channel();
+ if (tp->ucopy.dma_chan)
+@@ -1541,16 +1542,17 @@ process:
+ else
+ #endif
+ {
+- if (!tcp_prequeue(sk, skb))
++ if (!tcp_prequeue(meta_sk, skb))
+ ret = tcp_v6_do_rcv(sk, skb);
+ }
+- } else if (unlikely(sk_add_backlog(sk, skb,
+- sk->sk_rcvbuf + sk->sk_sndbuf))) {
+- bh_unlock_sock(sk);
++ } else if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ goto discard_and_relse;
+ }
+- bh_unlock_sock(sk);
++
++ bh_unlock_sock(meta_sk);
+
+ sock_put(sk);
+ return ret ? -1 : 0;
+@@ -1607,6 +1609,18 @@ do_time_wait:
+ sk = sk2;
+ goto process;
+ }
++#ifdef CONFIG_MPTCP
++ if (th->syn && !th->ack) {
++ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++ if (ret < 0) {
++ tcp_v6_send_reset(NULL, skb);
++ goto discard_it;
++ } else if (ret > 0) {
++ return 0;
++ }
++ }
++#endif
+ /* Fall through to ACK */
+ }
+ case TCP_TW_ACK:
+@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
+ }
+ }
+
+-static struct timewait_sock_ops tcp6_timewait_sock_ops = {
++struct timewait_sock_ops tcp6_timewait_sock_ops = {
+ .twsk_obj_size = sizeof(struct tcp6_timewait_sock),
+ .twsk_unique = tcp_twsk_unique,
+ .twsk_destructor = tcp_twsk_destructor,
+@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
+
+ tcp_init_sock(sk);
+
+- icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++ if (is_mptcp_enabled(sk))
++ icsk->icsk_af_ops = &mptcp_v6_specific;
++ else
++#endif
++ icsk->icsk_af_ops = &ipv6_specific;
+
+ #ifdef CONFIG_TCP_MD5SIG
+ tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
+@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
+ return 0;
+ }
+
+-static void tcp_v6_destroy_sock(struct sock *sk)
++void tcp_v6_destroy_sock(struct sock *sk)
+ {
+ tcp_v4_destroy_sock(sk);
+ inet6_destroy_sock(sk);
+@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
+ static void tcp_v6_clear_sk(struct sock *sk, int size)
+ {
+ struct inet_sock *inet = inet_sk(sk);
++#ifdef CONFIG_MPTCP
++ struct tcp_sock *tp = tcp_sk(sk);
++ /* size_tk_table goes from the end of tk_table to the end of sk */
++ int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
++ sizeof(tp->tk_table);
++#endif
+
+ /* we do not want to clear pinet6 field, because of RCU lookups */
+ sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
+
+ size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
++
++#ifdef CONFIG_MPTCP
++ /* We zero out only from pinet6 to tk_table */
++ size -= size_tk_table + sizeof(tp->tk_table);
++#endif
+ memset(&inet->pinet6 + 1, 0, size);
++
++#ifdef CONFIG_MPTCP
++ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
++#endif
++
+ }
+
+ struct proto tcpv6_prot = {
+diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
+new file mode 100644
+index 000000000000..cdfc03adabf8
+--- /dev/null
++++ b/net/mptcp/Kconfig
+@@ -0,0 +1,115 @@
++#
++# MPTCP configuration
++#
++config MPTCP
++ bool "MPTCP protocol"
++ depends on (IPV6=y || IPV6=n)
++ ---help---
++ This replaces the normal TCP stack with a Multipath TCP stack,
++ able to use several paths at once.
++
++menuconfig MPTCP_PM_ADVANCED
++ bool "MPTCP: advanced path-manager control"
++ depends on MPTCP=y
++ ---help---
++ Support for selection of different path-managers. You should choose 'Y' here,
++ because otherwise you will not actively create new MPTCP-subflows.
++
++if MPTCP_PM_ADVANCED
++
++config MPTCP_FULLMESH
++ tristate "MPTCP Full-Mesh Path-Manager"
++ depends on MPTCP=y
++ ---help---
++ This path-management module will create a full-mesh among all IP-addresses.
++
++config MPTCP_NDIFFPORTS
++ tristate "MPTCP ndiff-ports"
++ depends on MPTCP=y
++ ---help---
++ This path-management module will create multiple subflows between the same
++ pair of IP-addresses, modifying the source-port. You can set the number
++ of subflows via the mptcp_ndiffports-sysctl.
++
++config MPTCP_BINDER
++ tristate "MPTCP Binder"
++ depends on (MPTCP=y)
++ ---help---
++ This path-management module works like ndiffports, and adds the sysctl
++ option to set the gateway (and/or path to) per each additional subflow
++ via Loose Source Routing (IPv4 only).
++
++choice
++ prompt "Default MPTCP Path-Manager"
++ default DEFAULT
++ help
++ Select the Path-Manager of your choice
++
++ config DEFAULT_FULLMESH
++ bool "Full mesh" if MPTCP_FULLMESH=y
++
++ config DEFAULT_NDIFFPORTS
++ bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
++
++ config DEFAULT_BINDER
++ bool "binder" if MPTCP_BINDER=y
++
++ config DEFAULT_DUMMY
++ bool "Default"
++
++endchoice
++
++endif
++
++config DEFAULT_MPTCP_PM
++ string
++ default "default" if DEFAULT_DUMMY
++ default "fullmesh" if DEFAULT_FULLMESH
++ default "ndiffports" if DEFAULT_NDIFFPORTS
++ default "binder" if DEFAULT_BINDER
++ default "default"
++
++menuconfig MPTCP_SCHED_ADVANCED
++ bool "MPTCP: advanced scheduler control"
++ depends on MPTCP=y
++ ---help---
++ Support for selection of different schedulers. You should choose 'Y' here,
++ if you want to choose a different scheduler than the default one.
++
++if MPTCP_SCHED_ADVANCED
++
++config MPTCP_ROUNDROBIN
++ tristate "MPTCP Round-Robin"
++ depends on (MPTCP=y)
++ ---help---
++ This is a very simple round-robin scheduler. Probably has bad performance
++ but might be interesting for researchers.
++
++choice
++ prompt "Default MPTCP Scheduler"
++ default DEFAULT
++ help
++ Select the Scheduler of your choice
++
++ config DEFAULT_SCHEDULER
++ bool "Default"
++ ---help---
++ This is the default scheduler, sending first on the subflow
++ with the lowest RTT.
++
++ config DEFAULT_ROUNDROBIN
++ bool "Round-Robin" if MPTCP_ROUNDROBIN=y
++ ---help---
++ This is the round-rob scheduler, sending in a round-robin
++ fashion..
++
++endchoice
++endif
++
++config DEFAULT_MPTCP_SCHED
++ string
++ depends on (MPTCP=y)
++ default "default" if DEFAULT_SCHEDULER
++ default "roundrobin" if DEFAULT_ROUNDROBIN
++ default "default"
++
+diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
+new file mode 100644
+index 000000000000..35561a7012e3
+--- /dev/null
++++ b/net/mptcp/Makefile
+@@ -0,0 +1,20 @@
++#
++## Makefile for MultiPath TCP support code.
++#
++#
++
++obj-$(CONFIG_MPTCP) += mptcp.o
++
++mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
++ mptcp_output.o mptcp_input.o mptcp_sched.o
++
++obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
++obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
++obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
++obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
++obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
++obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
++obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
++
++mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
++
+diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
+new file mode 100644
+index 000000000000..95d8da560715
+--- /dev/null
++++ b/net/mptcp/mptcp_binder.c
+@@ -0,0 +1,487 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#include <linux/route.h>
++#include <linux/inet.h>
++#include <linux/mroute.h>
++#include <linux/spinlock_types.h>
++#include <net/inet_ecn.h>
++#include <net/route.h>
++#include <net/xfrm.h>
++#include <net/compat.h>
++#include <linux/slab.h>
++
++#define MPTCP_GW_MAX_LISTS 10
++#define MPTCP_GW_LIST_MAX_LEN 6
++#define MPTCP_GW_SYSCTL_MAX_LEN (15 * MPTCP_GW_LIST_MAX_LEN * \
++ MPTCP_GW_MAX_LISTS)
++
++struct mptcp_gw_list {
++ struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
++ u8 len[MPTCP_GW_MAX_LISTS];
++};
++
++struct binder_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++
++ struct mptcp_cb *mpcb;
++
++ /* Prevent multiple sub-sockets concurrently iterating over sockets */
++ spinlock_t *flow_lock;
++};
++
++static struct mptcp_gw_list *mptcp_gws;
++static rwlock_t mptcp_gws_lock;
++
++static int mptcp_binder_ndiffports __read_mostly = 1;
++
++static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
++
++static int mptcp_get_avail_list_ipv4(struct sock *sk)
++{
++ int i, j, list_taken, opt_ret, opt_len;
++ unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
++
++ for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
++ if (mptcp_gws->len[i] == 0)
++ goto error;
++
++ mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
++ list_taken = 0;
++
++ /* Loop through all sub-sockets in this connection */
++ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
++
++ /* Reset length and options buffer, then retrieve
++ * from socket
++ */
++ opt_len = MAX_IPOPTLEN;
++ memset(opt, 0, MAX_IPOPTLEN);
++ opt_ret = ip_getsockopt(sk, IPPROTO_IP,
++ IP_OPTIONS, opt, &opt_len);
++ if (opt_ret < 0) {
++ mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
++ __func__, opt_ret);
++ goto error;
++ }
++
++ /* If socket has no options, it has no stake in this list */
++ if (opt_len <= 0)
++ continue;
++
++ /* Iterate options buffer */
++ for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
++ if (*opt_ptr == IPOPT_LSRR) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
++ goto sock_lsrr;
++ }
++ }
++ continue;
++
++sock_lsrr:
++ /* Pointer to the 2nd to last address */
++ opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
++
++ /* Addresses start 3 bytes after type offset */
++ opt_ptr += 3;
++ j = 0;
++
++ /* Different length lists cannot be the same */
++ if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
++ continue;
++
++ /* Iterate if we are still inside options list
++ * and sysctl list
++ */
++ while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
++ /* If there is a different address, this list must
++ * not be set on this socket
++ */
++ if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
++ break;
++
++ /* Jump 4 bytes to next address */
++ opt_ptr += 4;
++ j++;
++ }
++
++ /* Reached the end without a differing address, lists
++ * are therefore identical.
++ */
++ if (j == mptcp_gws->len[i]) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
++ list_taken = 1;
++ break;
++ }
++ }
++
++ /* Free list found if not taken by a socket */
++ if (!list_taken) {
++ mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
++ break;
++ }
++ }
++
++ if (i >= MPTCP_GW_MAX_LISTS)
++ goto error;
++
++ return i;
++error:
++ return -1;
++}
++
++/* The list of addresses is parsed each time a new connection is opened,
++ * to make sure it's up to date. In case of error, all the lists are
++ * marked as unavailable and the subflow's fingerprint is set to 0.
++ */
++static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
++{
++ int i, j, ret;
++ unsigned char opt[MAX_IPOPTLEN] = {0};
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
++
++ /* Read lock: multiple sockets can read LSRR addresses at the same
++ * time, but writes are done in mutual exclusion.
++ * Spin lock: must search for free list for one socket at a time, or
++ * multiple sockets could take the same list.
++ */
++ read_lock(&mptcp_gws_lock);
++ spin_lock(fmp->flow_lock);
++
++ i = mptcp_get_avail_list_ipv4(sk);
++
++ /* Execution enters here only if a free path is found.
++ */
++ if (i >= 0) {
++ opt[0] = IPOPT_NOP;
++ opt[1] = IPOPT_LSRR;
++ opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
++ (mptcp_gws->len[i] + 1) + 3;
++ opt[3] = IPOPT_MINOFF;
++ for (j = 0; j < mptcp_gws->len[i]; ++j)
++ memcpy(opt + 4 +
++ (j * sizeof(mptcp_gws->list[i][0].s_addr)),
++ &mptcp_gws->list[i][j].s_addr,
++ sizeof(mptcp_gws->list[i][0].s_addr));
++ /* Final destination must be part of IP_OPTIONS parameter. */
++ memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
++ sizeof(addr.s_addr));
++
++ /* setsockopt must be inside the lock, otherwise another
++ * subflow could fail to see that we have taken a list.
++ */
++ ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
++ 4 + sizeof(mptcp_gws->list[i][0].s_addr)
++ * (mptcp_gws->len[i] + 1));
++
++ if (ret < 0) {
++ mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
++ __func__, ret);
++ }
++ }
++
++ spin_unlock(fmp->flow_lock);
++ read_unlock(&mptcp_gws_lock);
++
++ return;
++}
++
++/* Parses gateways string for a list of paths to different
++ * gateways, and stores them for use with the Loose Source Routing (LSRR)
++ * socket option. Each list must have "," separated addresses, and the lists
++ * themselves must be separated by "-". Returns -1 in case one or more of the
++ * addresses is not a valid ipv4/6 address.
++ */
++static int mptcp_parse_gateway_ipv4(char *gateways)
++{
++ int i, j, k, ret;
++ char *tmp_string = NULL;
++ struct in_addr tmp_addr;
++
++ tmp_string = kzalloc(16, GFP_KERNEL);
++ if (tmp_string == NULL)
++ return -ENOMEM;
++
++ write_lock(&mptcp_gws_lock);
++
++ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++
++ /* A TMP string is used since inet_pton needs a null terminated string
++ * but we do not want to modify the sysctl for obvious reasons.
++ * i will iterate over the SYSCTL string, j will iterate over the
++ * temporary string where each IP is copied into, k will iterate over
++ * the IPs in each list.
++ */
++ for (i = j = k = 0;
++ i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
++ ++i) {
++ if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
++ /* If the temp IP is empty and the current list is
++ * empty, we are done.
++ */
++ if (j == 0 && mptcp_gws->len[k] == 0)
++ break;
++
++ /* Terminate the temp IP string, then if it is
++ * non-empty parse the IP and copy it.
++ */
++ tmp_string[j] = '\0';
++ if (j > 0) {
++ mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
++
++ ret = in4_pton(tmp_string, strlen(tmp_string),
++ (u8 *)&tmp_addr.s_addr, '\0',
++ NULL);
++
++ if (ret) {
++ mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
++ ret,
++ &tmp_addr.s_addr);
++ memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
++ &tmp_addr.s_addr,
++ sizeof(tmp_addr.s_addr));
++ mptcp_gws->len[k]++;
++ j = 0;
++ tmp_string[j] = '\0';
++ /* Since we can't impose a limit to
++ * what the user can input, make sure
++ * there are not too many IPs in the
++ * SYSCTL string.
++ */
++ if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
++ mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
++ k,
++ MPTCP_GW_LIST_MAX_LEN);
++ goto error;
++ }
++ } else {
++ goto error;
++ }
++ }
++
++ if (gateways[i] == '-' || gateways[i] == '\0')
++ ++k;
++ } else {
++ tmp_string[j] = gateways[i];
++ ++j;
++ }
++ }
++
++ /* Number of flows is number of gateway lists plus master flow */
++ mptcp_binder_ndiffports = k+1;
++
++ write_unlock(&mptcp_gws_lock);
++ kfree(tmp_string);
++
++ return 0;
++
++error:
++ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++ memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
++ write_unlock(&mptcp_gws_lock);
++ kfree(tmp_string);
++ return -1;
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ const struct binder_priv *pm_priv = container_of(work,
++ struct binder_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = pm_priv->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ int iter = 0;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ if (mptcp_binder_ndiffports > iter &&
++ mptcp_binder_ndiffports > mpcb->cnt_subflows) {
++ struct mptcp_loc4 loc;
++ struct mptcp_rem4 rem;
++
++ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++ loc.loc4_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem4_id = 0; /* Default 0 */
++
++ mptcp_init4_subsockets(meta_sk, &loc, &rem);
++
++ goto next_subflow;
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void binder_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
++ static DEFINE_SPINLOCK(flow_lock);
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (meta_sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(meta_sk)) {
++ mptcp_fallback_default(mpcb);
++ return;
++ }
++#endif
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ fmp->mpcb = mpcb;
++
++ fmp->flow_lock = &flow_lock;
++}
++
++static void binder_create_subflows(struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (!work_pending(&pm_priv->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &pm_priv->subflow_work);
++ }
++}
++
++static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
++ * Inspired from proc_tcp_congestion_control().
++ */
++static int proc_mptcp_gateways(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ int ret;
++ ctl_table tbl = {
++ .maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
++ };
++
++ if (write) {
++ tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
++ if (tbl.data == NULL)
++ return -1;
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (ret == 0) {
++ ret = mptcp_parse_gateway_ipv4(tbl.data);
++ memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
++ }
++ kfree(tbl.data);
++ } else {
++ ret = proc_dostring(ctl, write, buffer, lenp, ppos);
++ }
++
++
++ return ret;
++}
++
++static struct mptcp_pm_ops binder __read_mostly = {
++ .new_session = binder_new_session,
++ .fully_established = binder_create_subflows,
++ .get_local_id = binder_get_local_id,
++ .init_subsocket_v4 = mptcp_v4_add_lsrr,
++ .name = "binder",
++ .owner = THIS_MODULE,
++};
++
++static struct ctl_table binder_table[] = {
++ {
++ .procname = "mptcp_binder_gateways",
++ .data = &sysctl_mptcp_binder_gateways,
++ .maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
++ .mode = 0644,
++ .proc_handler = &proc_mptcp_gateways
++ },
++ { }
++};
++
++struct ctl_table_header *mptcp_sysctl_binder;
++
++/* General initialization of MPTCP_PM */
++static int __init binder_register(void)
++{
++ mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
++ if (!mptcp_gws)
++ return -ENOMEM;
++
++ rwlock_init(&mptcp_gws_lock);
++
++ BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
++
++ mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
++ binder_table);
++ if (!mptcp_sysctl_binder)
++ goto sysctl_fail;
++
++ if (mptcp_register_path_manager(&binder))
++ goto pm_failed;
++
++ return 0;
++
++pm_failed:
++ unregister_net_sysctl_table(mptcp_sysctl_binder);
++sysctl_fail:
++ kfree(mptcp_gws);
++
++ return -1;
++}
++
++static void binder_unregister(void)
++{
++ mptcp_unregister_path_manager(&binder);
++ unregister_net_sysctl_table(mptcp_sysctl_binder);
++ kfree(mptcp_gws);
++}
++
++module_init(binder_register);
++module_exit(binder_unregister);
++
++MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("BINDER MPTCP");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
+new file mode 100644
+index 000000000000..5d761164eb85
+--- /dev/null
++++ b/net/mptcp/mptcp_coupled.c
+@@ -0,0 +1,270 @@
++/*
++ * MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++/* Scaling is done in the numerator with alpha_scale_num and in the denominator
++ * with alpha_scale_den.
++ *
++ * To downscale, we just need to use alpha_scale.
++ *
++ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
++ */
++static int alpha_scale_den = 10;
++static int alpha_scale_num = 32;
++static int alpha_scale = 12;
++
++struct mptcp_ccc {
++ u64 alpha;
++ bool forced_update;
++};
++
++static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
++{
++ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
++{
++ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
++}
++
++static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
++{
++ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
++}
++
++static inline u64 mptcp_ccc_scale(u32 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++static inline bool mptcp_get_forced(const struct sock *meta_sk)
++{
++ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
++}
++
++static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
++{
++ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
++}
++
++static void mptcp_ccc_recalc_alpha(const struct sock *sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ const struct sock *sub_sk;
++ int best_cwnd = 0, best_rtt = 0, can_send = 0;
++ u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
++
++ if (!mpcb)
++ return;
++
++ /* Only one subflow left - fall back to normal reno-behavior
++ * (set alpha to 1)
++ */
++ if (mpcb->cnt_established <= 1)
++ goto exit;
++
++ /* Do regular alpha-calculation for multiple subflows */
++
++ /* Find the max numerator of the alpha-calculation */
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++ u64 tmp;
++
++ if (!mptcp_ccc_sk_can_send(sub_sk))
++ continue;
++
++ can_send++;
++
++ /* We need to look for the path, that provides the max-value.
++ * Integer-overflow is not possible here, because
++ * tmp will be in u64.
++ */
++ tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
++ alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
++
++ if (tmp >= max_numerator) {
++ max_numerator = tmp;
++ best_cwnd = sub_tp->snd_cwnd;
++ best_rtt = sub_tp->srtt_us;
++ }
++ }
++
++ /* No subflow is able to send - we don't care anymore */
++ if (unlikely(!can_send))
++ goto exit;
++
++ /* Calculate the denominator */
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++
++ if (!mptcp_ccc_sk_can_send(sub_sk))
++ continue;
++
++ sum_denominator += div_u64(
++ mptcp_ccc_scale(sub_tp->snd_cwnd,
++ alpha_scale_den) * best_rtt,
++ sub_tp->srtt_us);
++ }
++ sum_denominator *= sum_denominator;
++ if (unlikely(!sum_denominator)) {
++ pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
++ __func__, mpcb->cnt_established);
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++ pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
++ __func__, sub_tp->mptcp->path_index,
++ sub_sk->sk_state, sub_tp->srtt_us,
++ sub_tp->snd_cwnd);
++ }
++ }
++
++ alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
++
++ if (unlikely(!alpha))
++ alpha = 1;
++
++exit:
++ mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
++}
++
++static void mptcp_ccc_init(struct sock *sk)
++{
++ if (mptcp(tcp_sk(sk))) {
++ mptcp_set_forced(mptcp_meta_sk(sk), 0);
++ mptcp_set_alpha(mptcp_meta_sk(sk), 1);
++ }
++ /* If we do not mptcp, behave like reno: return */
++}
++
++static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++ if (event == CA_EVENT_LOSS)
++ mptcp_ccc_recalc_alpha(sk);
++}
++
++static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
++{
++ if (!mptcp(tcp_sk(sk)))
++ return;
++
++ mptcp_set_forced(mptcp_meta_sk(sk), 1);
++}
++
++static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++ int snd_cwnd;
++
++ if (!mptcp(tp)) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ if (!tcp_is_cwnd_limited(sk))
++ return;
++
++ if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ /* In "safe" area, increase. */
++ tcp_slow_start(tp, acked);
++ mptcp_ccc_recalc_alpha(sk);
++ return;
++ }
++
++ if (mptcp_get_forced(mptcp_meta_sk(sk))) {
++ mptcp_ccc_recalc_alpha(sk);
++ mptcp_set_forced(mptcp_meta_sk(sk), 0);
++ }
++
++ if (mpcb->cnt_established > 1) {
++ u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
++
++ /* This may happen, if at the initialization, the mpcb
++ * was not yet attached to the sock, and thus
++ * initializing alpha failed.
++ */
++ if (unlikely(!alpha))
++ alpha = 1;
++
++ snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
++ alpha);
++
++ /* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
++ * Thus, we select here the max value.
++ */
++ if (snd_cwnd < tp->snd_cwnd)
++ snd_cwnd = tp->snd_cwnd;
++ } else {
++ snd_cwnd = tp->snd_cwnd;
++ }
++
++ if (tp->snd_cwnd_cnt >= snd_cwnd) {
++ if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
++ tp->snd_cwnd++;
++ mptcp_ccc_recalc_alpha(sk);
++ }
++
++ tp->snd_cwnd_cnt = 0;
++ } else {
++ tp->snd_cwnd_cnt++;
++ }
++}
++
++static struct tcp_congestion_ops mptcp_ccc = {
++ .init = mptcp_ccc_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_ccc_cong_avoid,
++ .cwnd_event = mptcp_ccc_cwnd_event,
++ .set_state = mptcp_ccc_set_state,
++ .owner = THIS_MODULE,
++ .name = "lia",
++};
++
++static int __init mptcp_ccc_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
++ return tcp_register_congestion_control(&mptcp_ccc);
++}
++
++static void __exit mptcp_ccc_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_ccc);
++}
++
++module_init(mptcp_ccc_register);
++module_exit(mptcp_ccc_unregister);
++
++MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
+new file mode 100644
+index 000000000000..28dfa0479f5e
+--- /dev/null
++++ b/net/mptcp/mptcp_ctrl.c
+@@ -0,0 +1,2401 @@
++/*
++ * MPTCP implementation - MPTCP-control
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <net/inet_common.h>
++#include <net/inet6_hashtables.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/ip6_route.h>
++#include <net/mptcp_v6.h>
++#endif
++#include <net/sock.h>
++#include <net/tcp.h>
++#include <net/tcp_states.h>
++#include <net/transp_v6.h>
++#include <net/xfrm.h>
++
++#include <linux/cryptohash.h>
++#include <linux/kconfig.h>
++#include <linux/module.h>
++#include <linux/netpoll.h>
++#include <linux/list.h>
++#include <linux/jhash.h>
++#include <linux/tcp.h>
++#include <linux/net.h>
++#include <linux/in.h>
++#include <linux/random.h>
++#include <linux/inetdevice.h>
++#include <linux/workqueue.h>
++#include <linux/atomic.h>
++#include <linux/sysctl.h>
++
++static struct kmem_cache *mptcp_sock_cache __read_mostly;
++static struct kmem_cache *mptcp_cb_cache __read_mostly;
++static struct kmem_cache *mptcp_tw_cache __read_mostly;
++
++int sysctl_mptcp_enabled __read_mostly = 1;
++int sysctl_mptcp_checksum __read_mostly = 1;
++int sysctl_mptcp_debug __read_mostly;
++EXPORT_SYMBOL(sysctl_mptcp_debug);
++int sysctl_mptcp_syn_retries __read_mostly = 3;
++
++bool mptcp_init_failed __read_mostly;
++
++struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
++EXPORT_SYMBOL(mptcp_static_key);
++
++static int proc_mptcp_path_manager(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ char val[MPTCP_PM_NAME_MAX];
++ ctl_table tbl = {
++ .data = val,
++ .maxlen = MPTCP_PM_NAME_MAX,
++ };
++ int ret;
++
++ mptcp_get_default_path_manager(val);
++
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (write && ret == 0)
++ ret = mptcp_set_default_path_manager(val);
++ return ret;
++}
++
++static int proc_mptcp_scheduler(ctl_table *ctl, int write,
++ void __user *buffer, size_t *lenp,
++ loff_t *ppos)
++{
++ char val[MPTCP_SCHED_NAME_MAX];
++ ctl_table tbl = {
++ .data = val,
++ .maxlen = MPTCP_SCHED_NAME_MAX,
++ };
++ int ret;
++
++ mptcp_get_default_scheduler(val);
++
++ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++ if (write && ret == 0)
++ ret = mptcp_set_default_scheduler(val);
++ return ret;
++}
++
++static struct ctl_table mptcp_table[] = {
++ {
++ .procname = "mptcp_enabled",
++ .data = &sysctl_mptcp_enabled,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_checksum",
++ .data = &sysctl_mptcp_checksum,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_debug",
++ .data = &sysctl_mptcp_debug,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_syn_retries",
++ .data = &sysctl_mptcp_syn_retries,
++ .maxlen = sizeof(int),
++ .mode = 0644,
++ .proc_handler = &proc_dointvec
++ },
++ {
++ .procname = "mptcp_path_manager",
++ .mode = 0644,
++ .maxlen = MPTCP_PM_NAME_MAX,
++ .proc_handler = proc_mptcp_path_manager,
++ },
++ {
++ .procname = "mptcp_scheduler",
++ .mode = 0644,
++ .maxlen = MPTCP_SCHED_NAME_MAX,
++ .proc_handler = proc_mptcp_scheduler,
++ },
++ { }
++};
++
++static inline u32 mptcp_hash_tk(u32 token)
++{
++ return token % MPTCP_HASH_SIZE;
++}
++
++struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++EXPORT_SYMBOL(tk_hashtable);
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
++
++/* The following hash table is used to avoid collision of token */
++static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_tk_hashlock; /* hashtable protection */
++
++static bool mptcp_reqsk_find_tk(const u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct mptcp_request_sock *mtreqsk;
++ const struct hlist_nulls_node *node;
++
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreqsk, node,
++ &mptcp_reqsk_tk_htb[hash], hash_entry) {
++ if (token == mtreqsk->mptcp_loc_token)
++ return true;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++ return false;
++}
++
++static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
++{
++ u32 hash = mptcp_hash_tk(token);
++
++ hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
++ &mptcp_reqsk_tk_htb[hash]);
++}
++
++static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
++{
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++}
++
++void mptcp_reqsk_destructor(struct request_sock *req)
++{
++ if (!mptcp_rsk(req)->is_sub) {
++ if (in_softirq()) {
++ mptcp_reqsk_remove_tk(req);
++ } else {
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++ }
++ } else {
++ mptcp_hash_request_remove(req);
++ }
++}
++
++static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
++{
++ u32 hash = mptcp_hash_tk(token);
++ hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
++ meta_tp->inside_tk_table = 1;
++}
++
++static bool mptcp_find_token(u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct tcp_sock *meta_tp;
++ const struct hlist_nulls_node *node;
++
++begin:
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
++ if (token == meta_tp->mptcp_loc_token)
++ return true;
++ }
++ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++ return false;
++}
++
++static void mptcp_set_key_reqsk(struct request_sock *req,
++ const struct sk_buff *skb)
++{
++ const struct inet_request_sock *ireq = inet_rsk(req);
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++ if (skb->protocol == htons(ETH_P_IP)) {
++ mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr,
++ htons(ireq->ir_num),
++ ireq->ir_rmt_port);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
++ ipv6_hdr(skb)->daddr.s6_addr32,
++ htons(ireq->ir_num),
++ ireq->ir_rmt_port);
++#endif
++ }
++
++ mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
++}
++
++/* New MPTCP-connection request, prepare a new token for the meta-socket that
++ * will be created in mptcp_check_req_master(), and store the received token.
++ */
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++ const struct mptcp_options_received *mopt,
++ const struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++ inet_rsk(req)->saw_mpc = 1;
++
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ do {
++ mptcp_set_key_reqsk(req, skb);
++ } while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
++ mptcp_find_token(mtreq->mptcp_loc_token));
++
++ mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++ mtreq->mptcp_rem_key = mopt->mptcp_key;
++}
++
++static void mptcp_set_key_sk(const struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct inet_sock *isk = inet_sk(sk);
++
++ if (sk->sk_family == AF_INET)
++ tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
++ isk->inet_daddr,
++ isk->inet_sport,
++ isk->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
++ sk->sk_v6_daddr.s6_addr32,
++ isk->inet_sport,
++ isk->inet_dport);
++#endif
++
++ mptcp_key_sha1(tp->mptcp_loc_key,
++ &tp->mptcp_loc_token, NULL);
++}
++
++void mptcp_connect_init(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ do {
++ mptcp_set_key_sk(sk);
++ } while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
++ mptcp_find_token(tp->mptcp_loc_token));
++
++ __mptcp_hash_insert(tp, tp->mptcp_loc_token);
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++}
++
++/**
++ * This function increments the refcount of the mpcb struct.
++ * It is the responsibility of the caller to decrement when releasing
++ * the structure.
++ */
++struct sock *mptcp_hash_find(const struct net *net, const u32 token)
++{
++ const u32 hash = mptcp_hash_tk(token);
++ const struct tcp_sock *meta_tp;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
++ tk_table) {
++ meta_sk = (struct sock *)meta_tp;
++ if (token == meta_tp->mptcp_loc_token &&
++ net_eq(net, sock_net(meta_sk))) {
++ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ goto out;
++ if (unlikely(token != meta_tp->mptcp_loc_token ||
++ !net_eq(net, sock_net(meta_sk)))) {
++ sock_gen_put(meta_sk);
++ goto begin;
++ }
++ goto found;
++ }
++ }
++ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash)
++ goto begin;
++out:
++ meta_sk = NULL;
++found:
++ rcu_read_unlock();
++ return meta_sk;
++}
++
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
++{
++ /* remove from the token hashtable */
++ rcu_read_lock_bh();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++ meta_tp->inside_tk_table = 0;
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock_bh();
++}
++
++void mptcp_hash_remove(struct tcp_sock *meta_tp)
++{
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++ meta_tp->inside_tk_table = 0;
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++}
++
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk, *rttsk = NULL, *lastsk = NULL;
++ u32 min_time = 0, last_active = 0;
++
++ mptcp_for_each_sk(meta_tp->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ u32 elapsed;
++
++ if (!mptcp_sk_can_send_ack(sk) || tp->pf)
++ continue;
++
++ elapsed = keepalive_time_elapsed(tp);
++
++ /* We take the one with the lowest RTT within a reasonable
++ * (meta-RTO)-timeframe
++ */
++ if (elapsed < inet_csk(meta_sk)->icsk_rto) {
++ if (!min_time || tp->srtt_us < min_time) {
++ min_time = tp->srtt_us;
++ rttsk = sk;
++ }
++ continue;
++ }
++
++ /* Otherwise, we just take the most recent active */
++ if (!rttsk && (!last_active || elapsed < last_active)) {
++ last_active = elapsed;
++ lastsk = sk;
++ }
++ }
++
++ if (rttsk)
++ return rttsk;
++
++ return lastsk;
++}
++EXPORT_SYMBOL(mptcp_select_ack_sock);
++
++static void mptcp_sock_def_error_report(struct sock *sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ if (!sock_flag(sk, SOCK_DEAD))
++ mptcp_sub_close(sk, 0);
++
++ if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
++ mpcb->send_infinite_mapping) {
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ meta_sk->sk_err = sk->sk_err;
++ meta_sk->sk_err_soft = sk->sk_err_soft;
++
++ if (!sock_flag(meta_sk, SOCK_DEAD))
++ meta_sk->sk_error_report(meta_sk);
++
++ tcp_done(meta_sk);
++ }
++
++ sk->sk_err = 0;
++ return;
++}
++
++static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
++{
++ if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
++ mptcp_cleanup_path_manager(mpcb);
++ mptcp_cleanup_scheduler(mpcb);
++ kmem_cache_free(mptcp_cb_cache, mpcb);
++ }
++}
++
++static void mptcp_sock_destruct(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ inet_sock_destruct(sk);
++
++ if (!is_meta_sk(sk) && !tp->was_meta_sk) {
++ BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
++
++ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++ tp->mptcp = NULL;
++
++ /* Taken when mpcb pointer was set */
++ sock_put(mptcp_meta_sk(sk));
++ mptcp_mpcb_put(tp->mpcb);
++ } else {
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct mptcp_tw *mptw;
++
++ /* The mpcb is disappearing - we can make the final
++ * update to the rcv_nxt of the time-wait-sock and remove
++ * its reference to the mpcb.
++ */
++ spin_lock_bh(&mpcb->tw_lock);
++ list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
++ list_del_rcu(&mptw->list);
++ mptw->in_list = 0;
++ mptcp_mpcb_put(mpcb);
++ rcu_assign_pointer(mptw->mpcb, NULL);
++ }
++ spin_unlock_bh(&mpcb->tw_lock);
++
++ mptcp_mpcb_put(mpcb);
++
++ mptcp_debug("%s destroying meta-sk\n", __func__);
++ }
++
++ WARN_ON(!static_key_false(&mptcp_static_key));
++ /* Must be the last call, because is_meta_sk() above still needs the
++ * static key
++ */
++ static_key_slow_dec(&mptcp_static_key);
++}
++
++void mptcp_destroy_sock(struct sock *sk)
++{
++ if (is_meta_sk(sk)) {
++ struct sock *sk_it, *tmpsk;
++
++ __skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
++ mptcp_purge_ofo_queue(tcp_sk(sk));
++
++ /* We have to close all remaining subflows. Normally, they
++ * should all be about to get closed. But, if the kernel is
++ * forcing a closure (e.g., tcp_write_err), the subflows might
++ * not have been closed properly (as we are waiting for the
++ * DATA_ACK of the DATA_FIN).
++ */
++ mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
++ /* Already did call tcp_close - waiting for graceful
++ * closure, or if we are retransmitting fast-close on
++ * the subflow. The reset (or timeout) will kill the
++ * subflow..
++ */
++ if (tcp_sk(sk_it)->closing ||
++ tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++
++ /* Allow the delayed work first to prevent time-wait state */
++ if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
++ continue;
++
++ mptcp_sub_close(sk_it, 0);
++ }
++
++ mptcp_delete_synack_timer(sk);
++ } else {
++ mptcp_del_sock(sk);
++ }
++}
++
++static void mptcp_set_state(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ /* Meta is not yet established - wake up the application */
++ if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
++ sk->sk_state == TCP_ESTABLISHED) {
++ tcp_set_state(meta_sk, TCP_ESTABLISHED);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ meta_sk->sk_state_change(meta_sk);
++ sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
++ }
++ }
++
++ if (sk->sk_state == TCP_ESTABLISHED) {
++ tcp_sk(sk)->mptcp->establish_increased = 1;
++ tcp_sk(sk)->mpcb->cnt_established++;
++ }
++}
++
++void mptcp_init_congestion_control(struct sock *sk)
++{
++ struct inet_connection_sock *icsk = inet_csk(sk);
++ struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
++ const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
++
++ /* The application didn't set the congestion control to use
++ * fallback to the default one.
++ */
++ if (ca == &tcp_init_congestion_ops)
++ goto use_default;
++
++ /* Use the same congestion control as set by the user. If the
++ * module is not available fallback to the default one.
++ */
++ if (!try_module_get(ca->owner)) {
++ pr_warn("%s: fallback to the system default CC\n", __func__);
++ goto use_default;
++ }
++
++ icsk->icsk_ca_ops = ca;
++ if (icsk->icsk_ca_ops->init)
++ icsk->icsk_ca_ops->init(sk);
++
++ return;
++
++use_default:
++ icsk->icsk_ca_ops = &tcp_init_congestion_ops;
++ tcp_init_congestion_control(sk);
++}
++
++u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
++u32 mptcp_seed = 0;
++
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
++{
++ u32 workspace[SHA_WORKSPACE_WORDS];
++ u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
++ u8 input[64];
++ int i;
++
++ memset(workspace, 0, sizeof(workspace));
++
++ /* Initialize input with appropriate padding */
++ memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
++ * is explicitly set too
++ */
++ memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
++ input[8] = 0x80; /* Padding: First bit after message = 1 */
++ input[63] = 0x40; /* Padding: Length of the message = 64 bits */
++
++ sha_init(mptcp_hashed_key);
++ sha_transform(mptcp_hashed_key, input, workspace);
++
++ for (i = 0; i < 5; i++)
++ mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
++
++ if (token)
++ *token = mptcp_hashed_key[0];
++ if (idsn)
++ *idsn = *((u64 *)&mptcp_hashed_key[3]);
++}
++
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++ u32 *hash_out)
++{
++ u32 workspace[SHA_WORKSPACE_WORDS];
++ u8 input[128]; /* 2 512-bit blocks */
++ int i;
++
++ memset(workspace, 0, sizeof(workspace));
++
++ /* Generate key xored with ipad */
++ memset(input, 0x36, 64);
++ for (i = 0; i < 8; i++)
++ input[i] ^= key_1[i];
++ for (i = 0; i < 8; i++)
++ input[i + 8] ^= key_2[i];
++
++ memcpy(&input[64], rand_1, 4);
++ memcpy(&input[68], rand_2, 4);
++ input[72] = 0x80; /* Padding: First bit after message = 1 */
++ memset(&input[73], 0, 53);
++
++ /* Padding: Length of the message = 512 + 64 bits */
++ input[126] = 0x02;
++ input[127] = 0x40;
++
++ sha_init(hash_out);
++ sha_transform(hash_out, input, workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ sha_transform(hash_out, &input[64], workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ for (i = 0; i < 5; i++)
++ hash_out[i] = cpu_to_be32(hash_out[i]);
++
++ /* Prepare second part of hmac */
++ memset(input, 0x5C, 64);
++ for (i = 0; i < 8; i++)
++ input[i] ^= key_1[i];
++ for (i = 0; i < 8; i++)
++ input[i + 8] ^= key_2[i];
++
++ memcpy(&input[64], hash_out, 20);
++ input[84] = 0x80;
++ memset(&input[85], 0, 41);
++
++ /* Padding: Length of the message = 512 + 160 bits */
++ input[126] = 0x02;
++ input[127] = 0xA0;
++
++ sha_init(hash_out);
++ sha_transform(hash_out, input, workspace);
++ memset(workspace, 0, sizeof(workspace));
++
++ sha_transform(hash_out, &input[64], workspace);
++
++ for (i = 0; i < 5; i++)
++ hash_out[i] = cpu_to_be32(hash_out[i]);
++}
++
++static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
++{
++ /* Socket-options handled by sk_clone_lock while creating the meta-sk.
++ * ======
++ * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
++ * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
++ * TCP_NODELAY, TCP_CORK
++ *
++ * Socket-options handled in this function here
++ * ======
++ * TCP_DEFER_ACCEPT
++ * SO_KEEPALIVE
++ *
++ * Socket-options on the todo-list
++ * ======
++ * SO_BINDTODEVICE - should probably prevent creation of new subsocks
++ * across other devices. - what about the api-draft?
++ * SO_DEBUG
++ * SO_REUSEADDR - probably we don't care about this
++ * SO_DONTROUTE, SO_BROADCAST
++ * SO_OOBINLINE
++ * SO_LINGER
++ * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
++ * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
++ * SO_RXQ_OVFL
++ * TCP_COOKIE_TRANSACTIONS
++ * TCP_MAXSEG
++ * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
++ * in mptcp_retransmit_timer. AND we need to check what is
++ * about the subsockets.
++ * TCP_LINGER2
++ * TCP_WINDOW_CLAMP
++ * TCP_USER_TIMEOUT
++ * TCP_MD5SIG
++ *
++ * Socket-options of no concern for the meta-socket (but for the subsocket)
++ * ======
++ * SO_PRIORITY
++ * SO_MARK
++ * TCP_CONGESTION
++ * TCP_SYNCNT
++ * TCP_QUICKACK
++ */
++
++ /* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
++ inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
++
++ /* Keepalives are handled entirely at the MPTCP-layer */
++ if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
++ inet_csk_reset_keepalive_timer(meta_sk,
++ keepalive_time_when(tcp_sk(meta_sk)));
++ sock_reset_flag(master_sk, SOCK_KEEPOPEN);
++ inet_csk_delete_keepalive_timer(master_sk);
++ }
++
++ /* Do not propagate subflow-errors up to the MPTCP-layer */
++ inet_sk(master_sk)->recverr = 0;
++}
++
++static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
++{
++ /* IP_TOS also goes to the subflow. */
++ if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
++ inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
++ sub_sk->sk_priority = meta_sk->sk_priority;
++ sk_dst_reset(sub_sk);
++ }
++
++ /* Inherit SO_REUSEADDR */
++ sub_sk->sk_reuse = meta_sk->sk_reuse;
++
++ /* Inherit snd/rcv-buffer locks */
++ sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
++
++ /* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
++ tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
++
++ /* Keepalives are handled entirely at the MPTCP-layer */
++ if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
++ sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
++ inet_csk_delete_keepalive_timer(sub_sk);
++ }
++
++ /* Do not propagate subflow-errors up to the MPTCP-layer */
++ inet_sk(sub_sk)->recverr = 0;
++}
++
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ /* skb-sk may be NULL if we receive a packet immediatly after the
++ * SYN/ACK + MP_CAPABLE.
++ */
++ struct sock *sk = skb->sk ? skb->sk : meta_sk;
++ int ret = 0;
++
++ skb->sk = NULL;
++
++ if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
++ kfree_skb(skb);
++ return 0;
++ }
++
++ if (sk->sk_family == AF_INET)
++ ret = tcp_v4_do_rcv(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ ret = tcp_v6_do_rcv(sk, skb);
++#endif
++
++ sock_put(sk);
++ return ret;
++}
++
++struct lock_class_key meta_key;
++struct lock_class_key meta_slock_key;
++
++static void mptcp_synack_timer_handler(unsigned long data)
++{
++ struct sock *meta_sk = (struct sock *) data;
++ struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
++
++ /* Only process if socket is not in use. */
++ bh_lock_sock(meta_sk);
++
++ if (sock_owned_by_user(meta_sk)) {
++ /* Try again later. */
++ mptcp_reset_synack_timer(meta_sk, HZ/20);
++ goto out;
++ }
++
++ /* May happen if the queue got destructed in mptcp_close */
++ if (!lopt)
++ goto out;
++
++ inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
++ TCP_TIMEOUT_INIT, TCP_RTO_MAX);
++
++ if (lopt->qlen)
++ mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
++
++out:
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk);
++}
++
++static const struct tcp_sock_ops mptcp_meta_specific = {
++ .__select_window = __mptcp_select_window,
++ .select_window = mptcp_select_window,
++ .select_initial_window = mptcp_select_initial_window,
++ .init_buffer_space = mptcp_init_buffer_space,
++ .set_rto = mptcp_tcp_set_rto,
++ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
++ .init_congestion_control = mptcp_init_congestion_control,
++ .send_fin = mptcp_send_fin,
++ .write_xmit = mptcp_write_xmit,
++ .send_active_reset = mptcp_send_active_reset,
++ .write_wakeup = mptcp_write_wakeup,
++ .prune_ofo_queue = mptcp_prune_ofo_queue,
++ .retransmit_timer = mptcp_retransmit_timer,
++ .time_wait = mptcp_time_wait,
++ .cleanup_rbuf = mptcp_cleanup_rbuf,
++};
++
++static const struct tcp_sock_ops mptcp_sub_specific = {
++ .__select_window = __mptcp_select_window,
++ .select_window = mptcp_select_window,
++ .select_initial_window = mptcp_select_initial_window,
++ .init_buffer_space = mptcp_init_buffer_space,
++ .set_rto = mptcp_tcp_set_rto,
++ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
++ .init_congestion_control = mptcp_init_congestion_control,
++ .send_fin = tcp_send_fin,
++ .write_xmit = tcp_write_xmit,
++ .send_active_reset = tcp_send_active_reset,
++ .write_wakeup = tcp_write_wakeup,
++ .prune_ofo_queue = tcp_prune_ofo_queue,
++ .retransmit_timer = tcp_retransmit_timer,
++ .time_wait = tcp_time_wait,
++ .cleanup_rbuf = tcp_cleanup_rbuf,
++};
++
++static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++ struct mptcp_cb *mpcb;
++ struct sock *master_sk;
++ struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
++ struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
++ u64 idsn;
++
++ dst_release(meta_sk->sk_rx_dst);
++ meta_sk->sk_rx_dst = NULL;
++ /* This flag is set to announce sock_lock_init to
++ * reclassify the lock-class of the master socket.
++ */
++ meta_tp->is_master_sk = 1;
++ master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
++ meta_tp->is_master_sk = 0;
++ if (!master_sk)
++ return -ENOBUFS;
++
++ master_tp = tcp_sk(master_sk);
++ master_icsk = inet_csk(master_sk);
++
++ mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
++ if (!mpcb) {
++ /* sk_free (and __sk_free) requirese wmem_alloc to be 1.
++ * All the rest is set to 0 thanks to __GFP_ZERO above.
++ */
++ atomic_set(&master_sk->sk_wmem_alloc, 1);
++ sk_free(master_sk);
++ return -ENOBUFS;
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
++ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++ newnp = inet6_sk(master_sk);
++ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++ newnp->ipv6_mc_list = NULL;
++ newnp->ipv6_ac_list = NULL;
++ newnp->ipv6_fl_list = NULL;
++ newnp->opt = NULL;
++ newnp->pktoptions = NULL;
++ (void)xchg(&newnp->rxpmtu, NULL);
++ } else if (meta_sk->sk_family == AF_INET6) {
++ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++ newnp = inet6_sk(master_sk);
++ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++ newnp->hop_limit = -1;
++ newnp->mcast_hops = IPV6_DEFAULT_MCASTHOPS;
++ newnp->mc_loop = 1;
++ newnp->pmtudisc = IPV6_PMTUDISC_WANT;
++ newnp->ipv6only = sock_net(master_sk)->ipv6.sysctl.bindv6only;
++ }
++#endif
++
++ meta_tp->mptcp = NULL;
++
++ /* Store the keys and generate the peer's token */
++ mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
++ mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
++
++ /* Generate Initial data-sequence-numbers */
++ mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
++ idsn = ntohll(idsn) + 1;
++ mpcb->snd_high_order[0] = idsn >> 32;
++ mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
++
++ meta_tp->write_seq = (u32)idsn;
++ meta_tp->snd_sml = meta_tp->write_seq;
++ meta_tp->snd_una = meta_tp->write_seq;
++ meta_tp->snd_nxt = meta_tp->write_seq;
++ meta_tp->pushed_seq = meta_tp->write_seq;
++ meta_tp->snd_up = meta_tp->write_seq;
++
++ mpcb->mptcp_rem_key = remote_key;
++ mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
++ idsn = ntohll(idsn) + 1;
++ mpcb->rcv_high_order[0] = idsn >> 32;
++ mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
++ meta_tp->copied_seq = (u32) idsn;
++ meta_tp->rcv_nxt = (u32) idsn;
++ meta_tp->rcv_wup = (u32) idsn;
++
++ meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
++ meta_tp->snd_wnd = window;
++ meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
++
++ meta_tp->packets_out = 0;
++ meta_icsk->icsk_probes_out = 0;
++
++ /* Set mptcp-pointers */
++ master_tp->mpcb = mpcb;
++ master_tp->meta_sk = meta_sk;
++ meta_tp->mpcb = mpcb;
++ meta_tp->meta_sk = meta_sk;
++ mpcb->meta_sk = meta_sk;
++ mpcb->master_sk = master_sk;
++
++ meta_tp->was_meta_sk = 0;
++
++ /* Initialize the queues */
++ skb_queue_head_init(&mpcb->reinject_queue);
++ skb_queue_head_init(&master_tp->out_of_order_queue);
++ tcp_prequeue_init(master_tp);
++ INIT_LIST_HEAD(&master_tp->tsq_node);
++
++ master_tp->tsq_flags = 0;
++
++ mutex_init(&mpcb->mpcb_mutex);
++
++ /* Init the accept_queue structure, we support a queue of 32 pending
++ * connections, it does not need to be huge, since we only store here
++ * pending subflow creations.
++ */
++ if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
++ inet_put_port(master_sk);
++ kmem_cache_free(mptcp_cb_cache, mpcb);
++ sk_free(master_sk);
++ return -ENOMEM;
++ }
++
++ /* Redefine function-pointers as the meta-sk is now fully ready */
++ static_key_slow_inc(&mptcp_static_key);
++ meta_tp->mpc = 1;
++ meta_tp->ops = &mptcp_meta_specific;
++
++ meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
++ meta_sk->sk_destruct = mptcp_sock_destruct;
++
++ /* Meta-level retransmit timer */
++ meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
++
++ tcp_init_xmit_timers(master_sk);
++ /* Has been set for sending out the SYN */
++ inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
++
++ if (!meta_tp->inside_tk_table) {
++ /* Adding the meta_tp in the token hashtable - coming from server-side */
++ rcu_read_lock();
++ spin_lock(&mptcp_tk_hashlock);
++
++ __mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
++
++ spin_unlock(&mptcp_tk_hashlock);
++ rcu_read_unlock();
++ }
++ master_tp->inside_tk_table = 0;
++
++ /* Init time-wait stuff */
++ INIT_LIST_HEAD(&mpcb->tw_list);
++ spin_lock_init(&mpcb->tw_lock);
++
++ INIT_HLIST_HEAD(&mpcb->callback_list);
++
++ mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
++
++ mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
++ mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
++ mpcb->orig_window_clamp = meta_tp->window_clamp;
++
++ /* The meta is directly linked - set refcnt to 1 */
++ atomic_set(&mpcb->mpcb_refcnt, 1);
++
++ mptcp_init_path_manager(mpcb);
++ mptcp_init_scheduler(mpcb);
++
++ setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
++ (unsigned long)meta_sk);
++
++ mptcp_debug("%s: created mpcb with token %#x\n",
++ __func__, mpcb->mptcp_loc_token);
++
++ return 0;
++}
++
++void mptcp_fallback_meta_sk(struct sock *meta_sk)
++{
++ kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
++ kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
++}
++
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++ gfp_t flags)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
++ if (!tp->mptcp)
++ return -ENOMEM;
++
++ tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
++ /* No more space for more subflows? */
++ if (!tp->mptcp->path_index) {
++ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++ return -EPERM;
++ }
++
++ INIT_HLIST_NODE(&tp->mptcp->cb_list);
++
++ tp->mptcp->tp = tp;
++ tp->mpcb = mpcb;
++ tp->meta_sk = meta_sk;
++
++ static_key_slow_inc(&mptcp_static_key);
++ tp->mpc = 1;
++ tp->ops = &mptcp_sub_specific;
++
++ tp->mptcp->loc_id = loc_id;
++ tp->mptcp->rem_id = rem_id;
++ if (mpcb->sched_ops->init)
++ mpcb->sched_ops->init(sk);
++
++ /* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
++ * included in mptcp_del_sock(), because the mpcb must remain alive
++ * until the last subsocket is completely destroyed.
++ */
++ sock_hold(meta_sk);
++ atomic_inc(&mpcb->mpcb_refcnt);
++
++ tp->mptcp->next = mpcb->connection_list;
++ mpcb->connection_list = tp;
++ tp->mptcp->attached = 1;
++
++ mpcb->cnt_subflows++;
++ atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
++ &meta_sk->sk_rmem_alloc);
++
++ mptcp_sub_inherit_sockopts(meta_sk, sk);
++ INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
++
++ /* As we successfully allocated the mptcp_tcp_sock, we have to
++ * change the function-pointers here (for sk_destruct to work correctly)
++ */
++ sk->sk_error_report = mptcp_sock_def_error_report;
++ sk->sk_data_ready = mptcp_data_ready;
++ sk->sk_write_space = mptcp_write_space;
++ sk->sk_state_change = mptcp_set_state;
++ sk->sk_destruct = mptcp_sock_destruct;
++
++ if (sk->sk_family == AF_INET)
++ mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
++ __func__ , mpcb->mptcp_loc_token,
++ tp->mptcp->path_index,
++ &((struct inet_sock *)tp)->inet_saddr,
++ ntohs(((struct inet_sock *)tp)->inet_sport),
++ &((struct inet_sock *)tp)->inet_daddr,
++ ntohs(((struct inet_sock *)tp)->inet_dport),
++ mpcb->cnt_subflows);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
++ __func__ , mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &inet6_sk(sk)->saddr,
++ ntohs(((struct inet_sock *)tp)->inet_sport),
++ &sk->sk_v6_daddr,
++ ntohs(((struct inet_sock *)tp)->inet_dport),
++ mpcb->cnt_subflows);
++#endif
++
++ return 0;
++}
++
++void mptcp_del_sock(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
++ struct mptcp_cb *mpcb;
++
++ if (!tp->mptcp || !tp->mptcp->attached)
++ return;
++
++ mpcb = tp->mpcb;
++ tp_prev = mpcb->connection_list;
++
++ mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
++ __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
++ sk->sk_state, is_meta_sk(sk));
++
++ if (tp_prev == tp) {
++ mpcb->connection_list = tp->mptcp->next;
++ } else {
++ for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
++ if (tp_prev->mptcp->next == tp) {
++ tp_prev->mptcp->next = tp->mptcp->next;
++ break;
++ }
++ }
++ }
++ mpcb->cnt_subflows--;
++ if (tp->mptcp->establish_increased)
++ mpcb->cnt_established--;
++
++ tp->mptcp->next = NULL;
++ tp->mptcp->attached = 0;
++ mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
++
++ if (!skb_queue_empty(&sk->sk_write_queue))
++ mptcp_reinject_data(sk, 0);
++
++ if (is_master_tp(tp))
++ mpcb->master_sk = NULL;
++ else if (tp->mptcp->pre_established)
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++
++ rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
++}
++
++/* Updates the metasocket ULID/port data, based on the given sock.
++ * The argument sock must be the sock accessible to the application.
++ * In this function, we update the meta socket info, based on the changes
++ * in the application socket (bind, address allocation, ...)
++ */
++void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
++{
++ if (tcp_sk(sk)->mpcb->pm_ops->new_session)
++ tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
++
++ tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
++}
++
++/* Clean up the receive buffer for full frames taken by the user,
++ * then send an ACK if necessary. COPIED is the number of bytes
++ * tcp_recvmsg has given to the user so far, it speeds up the
++ * calculation of whether or not we must ACK for the sake of
++ * a window update.
++ */
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk;
++ __u32 rcv_window_now = 0;
++
++ if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
++ rcv_window_now = tcp_receive_window(meta_tp);
++
++ if (2 * rcv_window_now > meta_tp->window_clamp)
++ rcv_window_now = 0;
++ }
++
++ mptcp_for_each_sk(meta_tp->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (!mptcp_sk_can_send_ack(sk))
++ continue;
++
++ if (!inet_csk_ack_scheduled(sk))
++ goto second_part;
++ /* Delayed ACKs frequently hit locked sockets during bulk
++ * receive.
++ */
++ if (icsk->icsk_ack.blocked ||
++ /* Once-per-two-segments ACK was not sent by tcp_input.c */
++ tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
++ /* If this read emptied read buffer, we send ACK, if
++ * connection is not bidirectional, user drained
++ * receive buffer and there was a small segment
++ * in queue.
++ */
++ (copied > 0 &&
++ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
++ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
++ !icsk->icsk_ack.pingpong)) &&
++ !atomic_read(&meta_sk->sk_rmem_alloc))) {
++ tcp_send_ack(sk);
++ continue;
++ }
++
++second_part:
++ /* This here is the second part of tcp_cleanup_rbuf */
++ if (rcv_window_now) {
++ __u32 new_window = tp->ops->__select_window(sk);
++
++ /* Send ACK now, if this read freed lots of space
++ * in our buffer. Certainly, new_window is new window.
++ * We can advertise it now, if it is not less than
++ * current one.
++ * "Lots" means "at least twice" here.
++ */
++ if (new_window && new_window >= 2 * rcv_window_now)
++ tcp_send_ack(sk);
++ }
++ }
++}
++
++static int mptcp_sub_send_fin(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *skb = tcp_write_queue_tail(sk);
++ int mss_now;
++
++ /* Optimization, tack on the FIN if we have a queue of
++ * unsent frames. But be careful about outgoing SACKS
++ * and IP options.
++ */
++ mss_now = tcp_current_mss(sk);
++
++ if (tcp_send_head(sk) != NULL) {
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ TCP_SKB_CB(skb)->end_seq++;
++ tp->write_seq++;
++ } else {
++ skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
++ if (!skb)
++ return 1;
++
++ /* Reserve space for headers and prepare control bits. */
++ skb_reserve(skb, MAX_TCP_HEADER);
++ /* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
++ tcp_init_nondata_skb(skb, tp->write_seq,
++ TCPHDR_ACK | TCPHDR_FIN);
++ tcp_queue_skb(sk, skb);
++ }
++ __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
++
++ return 0;
++}
++
++void mptcp_sub_close_wq(struct work_struct *work)
++{
++ struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
++ struct sock *sk = (struct sock *)tp;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ mutex_lock(&tp->mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ if (sock_flag(sk, SOCK_DEAD))
++ goto exit;
++
++ /* We come from tcp_disconnect. We are sure that meta_sk is set */
++ if (!mptcp(tp)) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ goto exit;
++ }
++
++ if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ } else if (tcp_close_state(sk)) {
++ sk->sk_shutdown |= SEND_SHUTDOWN;
++ tcp_send_fin(sk);
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&tp->mpcb->mpcb_mutex);
++ sock_put(sk);
++}
++
++void mptcp_sub_close(struct sock *sk, unsigned long delay)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
++
++ /* We are already closing - e.g., call from sock_def_error_report upon
++ * tcp_disconnect in tcp_close.
++ */
++ if (tp->closing)
++ return;
++
++ /* Work already scheduled ? */
++ if (work_pending(&work->work)) {
++ /* Work present - who will be first ? */
++ if (jiffies + delay > work->timer.expires)
++ return;
++
++ /* Try canceling - if it fails, work will be executed soon */
++ if (!cancel_delayed_work(work))
++ return;
++ sock_put(sk);
++ }
++
++ if (!delay) {
++ unsigned char old_state = sk->sk_state;
++
++ /* If we are in user-context we can directly do the closing
++ * procedure. No need to schedule a work-queue.
++ */
++ if (!in_softirq()) {
++ if (sock_flag(sk, SOCK_DEAD))
++ return;
++
++ if (!mptcp(tp)) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ return;
++ }
++
++ if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
++ sk->sk_state == TCP_CLOSE) {
++ tp->closing = 1;
++ sock_rps_reset_flow(sk);
++ tcp_close(sk, 0);
++ } else if (tcp_close_state(sk)) {
++ sk->sk_shutdown |= SEND_SHUTDOWN;
++ tcp_send_fin(sk);
++ }
++
++ return;
++ }
++
++ /* We directly send the FIN. Because it may take so a long time,
++ * untile the work-queue will get scheduled...
++ *
++ * If mptcp_sub_send_fin returns 1, it failed and thus we reset
++ * the old state so that tcp_close will finally send the fin
++ * in user-context.
++ */
++ if (!sk->sk_err && old_state != TCP_CLOSE &&
++ tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
++ if (old_state == TCP_ESTABLISHED)
++ TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
++ sk->sk_state = old_state;
++ }
++ }
++
++ sock_hold(sk);
++ queue_delayed_work(mptcp_wq, work, delay);
++}
++
++void mptcp_sub_force_close(struct sock *sk)
++{
++ /* The below tcp_done may have freed the socket, if he is already dead.
++ * Thus, we are not allowed to access it afterwards. That's why
++ * we have to store the dead-state in this local variable.
++ */
++ int sock_is_dead = sock_flag(sk, SOCK_DEAD);
++
++ tcp_sk(sk)->mp_killed = 1;
++
++ if (sk->sk_state != TCP_CLOSE)
++ tcp_done(sk);
++
++ if (!sock_is_dead)
++ mptcp_sub_close(sk, 0);
++}
++EXPORT_SYMBOL(mptcp_sub_force_close);
++
++/* Update the mpcb send window, based on the contributions
++ * of each subflow
++ */
++void mptcp_update_sndbuf(const struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk, *sk;
++ int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
++
++ mptcp_for_each_sk(tp->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ new_sndbuf += sk->sk_sndbuf;
++
++ if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
++ new_sndbuf = sysctl_tcp_wmem[2];
++ break;
++ }
++ }
++ meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
++
++ /* The subflow's call to sk_write_space in tcp_new_space ends up in
++ * mptcp_write_space.
++ * It has nothing to do with waking up the application.
++ * So, we do it here.
++ */
++ if (old_sndbuf != meta_sk->sk_sndbuf)
++ meta_sk->sk_write_space(meta_sk);
++}
++
++void mptcp_close(struct sock *meta_sk, long timeout)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *sk_it, *tmpsk;
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb;
++ int data_was_unread = 0;
++ int state;
++
++ mptcp_debug("%s: Close of meta_sk with tok %#x\n",
++ __func__, mpcb->mptcp_loc_token);
++
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock(meta_sk);
++
++ if (meta_tp->inside_tk_table) {
++ /* Detach the mpcb from the token hashtable */
++ mptcp_hash_remove_bh(meta_tp);
++ reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
++ }
++
++ meta_sk->sk_shutdown = SHUTDOWN_MASK;
++ /* We need to flush the recv. buffs. We do this only on the
++ * descriptor close, not protocol-sourced closes, because the
++ * reader process may not have drained the data yet!
++ */
++ while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
++ u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
++ tcp_hdr(skb)->fin;
++ data_was_unread += len;
++ __kfree_skb(skb);
++ }
++
++ sk_mem_reclaim(meta_sk);
++
++ /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
++ if (meta_sk->sk_state == TCP_CLOSE) {
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++ mptcp_sub_close(sk_it, 0);
++ }
++ goto adjudge_to_death;
++ }
++
++ if (data_was_unread) {
++ /* Unread data was tossed, zap the connection. */
++ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
++ meta_sk->sk_allocation);
++ } else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
++ /* Check zero linger _after_ checking for unread data. */
++ meta_sk->sk_prot->disconnect(meta_sk, 0);
++ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ } else if (tcp_close_state(meta_sk)) {
++ mptcp_send_fin(meta_sk);
++ } else if (meta_tp->snd_una == meta_tp->write_seq) {
++ /* The DATA_FIN has been sent and acknowledged
++ * (e.g., by sk_shutdown). Close all the other subflows
++ */
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ unsigned long delay = 0;
++ /* If we are the passive closer, don't trigger
++ * subflow-fin until the subflow has been finned
++ * by the peer. - thus we add a delay
++ */
++ if (mpcb->passive_close &&
++ sk_it->sk_state == TCP_ESTABLISHED)
++ delay = inet_csk(sk_it)->icsk_rto << 3;
++
++ mptcp_sub_close(sk_it, delay);
++ }
++ }
++
++ sk_stream_wait_close(meta_sk, timeout);
++
++adjudge_to_death:
++ state = meta_sk->sk_state;
++ sock_hold(meta_sk);
++ sock_orphan(meta_sk);
++
++ /* socket will be freed after mptcp_close - we have to prevent
++ * access from the subflows.
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ /* Similar to sock_orphan, but we don't set it DEAD, because
++ * the callbacks are still set and must be called.
++ */
++ write_lock_bh(&sk_it->sk_callback_lock);
++ sk_set_socket(sk_it, NULL);
++ sk_it->sk_wq = NULL;
++ write_unlock_bh(&sk_it->sk_callback_lock);
++ }
++
++ /* It is the last release_sock in its life. It will remove backlog. */
++ release_sock(meta_sk);
++
++ /* Now socket is owned by kernel and we acquire BH lock
++ * to finish close. No need to check for user refs.
++ */
++ local_bh_disable();
++ bh_lock_sock(meta_sk);
++ WARN_ON(sock_owned_by_user(meta_sk));
++
++ percpu_counter_inc(meta_sk->sk_prot->orphan_count);
++
++ /* Have we already been destroyed by a softirq or backlog? */
++ if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
++ goto out;
++
++ /* This is a (useful) BSD violating of the RFC. There is a
++ * problem with TCP as specified in that the other end could
++ * keep a socket open forever with no application left this end.
++ * We use a 3 minute timeout (about the same as BSD) then kill
++ * our end. If they send after that then tough - BUT: long enough
++ * that we won't make the old 4*rto = almost no time - whoops
++ * reset mistake.
++ *
++ * Nope, it was not mistake. It is really desired behaviour
++ * f.e. on http servers, when such sockets are useless, but
++ * consume significant resources. Let's do it with special
++ * linger2 option. --ANK
++ */
++
++ if (meta_sk->sk_state == TCP_FIN_WAIT2) {
++ if (meta_tp->linger2 < 0) {
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPABORTONLINGER);
++ } else {
++ const int tmo = tcp_fin_time(meta_sk);
++
++ if (tmo > TCP_TIMEWAIT_LEN) {
++ inet_csk_reset_keepalive_timer(meta_sk,
++ tmo - TCP_TIMEWAIT_LEN);
++ } else {
++ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
++ tmo);
++ goto out;
++ }
++ }
++ }
++ if (meta_sk->sk_state != TCP_CLOSE) {
++ sk_mem_reclaim(meta_sk);
++ if (tcp_too_many_orphans(meta_sk, 0)) {
++ if (net_ratelimit())
++ pr_info("MPTCP: too many of orphaned sockets\n");
++ tcp_set_state(meta_sk, TCP_CLOSE);
++ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPABORTONMEMORY);
++ }
++ }
++
++
++ if (meta_sk->sk_state == TCP_CLOSE)
++ inet_csk_destroy_sock(meta_sk);
++ /* Otherwise, socket is reprieved until protocol close. */
++
++out:
++ bh_unlock_sock(meta_sk);
++ local_bh_enable();
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk); /* Taken by sock_hold */
++}
++
++void mptcp_disconnect(struct sock *sk)
++{
++ struct sock *subsk, *tmpsk;
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ mptcp_delete_synack_timer(sk);
++
++ __skb_queue_purge(&tp->mpcb->reinject_queue);
++
++ if (tp->inside_tk_table) {
++ mptcp_hash_remove_bh(tp);
++ reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
++ }
++
++ local_bh_disable();
++ mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
++ /* The socket will get removed from the subsocket-list
++ * and made non-mptcp by setting mpc to 0.
++ *
++ * This is necessary, because tcp_disconnect assumes
++ * that the connection is completly dead afterwards.
++ * Thus we need to do a mptcp_del_sock. Due to this call
++ * we have to make it non-mptcp.
++ *
++ * We have to lock the socket, because we set mpc to 0.
++ * An incoming packet would take the subsocket's lock
++ * and go on into the receive-path.
++ * This would be a race.
++ */
++
++ bh_lock_sock(subsk);
++ mptcp_del_sock(subsk);
++ tcp_sk(subsk)->mpc = 0;
++ tcp_sk(subsk)->ops = &tcp_specific;
++ mptcp_sub_force_close(subsk);
++ bh_unlock_sock(subsk);
++ }
++ local_bh_enable();
++
++ tp->was_meta_sk = 1;
++ tp->mpc = 0;
++ tp->ops = &tcp_specific;
++}
++
++
++/* Returns 1 if we should enable MPTCP for that socket. */
++int mptcp_doit(struct sock *sk)
++{
++ /* Do not allow MPTCP enabling if the MPTCP initialization failed */
++ if (mptcp_init_failed)
++ return 0;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++ return 0;
++
++ /* Socket may already be established (e.g., called from tcp_recvmsg) */
++ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
++ return 1;
++
++ /* Don't do mptcp over loopback */
++ if (sk->sk_family == AF_INET &&
++ (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
++ ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
++ return 0;
++#if IS_ENABLED(CONFIG_IPV6)
++ if (sk->sk_family == AF_INET6 &&
++ (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
++ ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
++ return 0;
++#endif
++ if (mptcp_v6_is_v4_mapped(sk) &&
++ ipv4_is_loopback(inet_sk(sk)->inet_saddr))
++ return 0;
++
++#ifdef CONFIG_TCP_MD5SIG
++ /* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
++ if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
++ return 0;
++#endif
++
++ return 1;
++}
++
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++ struct tcp_sock *master_tp;
++ struct sock *master_sk;
++
++ if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
++ goto err_alloc_mpcb;
++
++ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++ master_tp = tcp_sk(master_sk);
++
++ if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
++ goto err_add_sock;
++
++ if (__inet_inherit_port(meta_sk, master_sk) < 0)
++ goto err_add_sock;
++
++ meta_sk->sk_prot->unhash(meta_sk);
++
++ if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
++ __inet_hash_nolisten(master_sk, NULL);
++#if IS_ENABLED(CONFIG_IPV6)
++ else
++ __inet6_hash(master_sk, NULL);
++#endif
++
++ master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
++
++ return 0;
++
++err_add_sock:
++ mptcp_fallback_meta_sk(meta_sk);
++
++ inet_csk_prepare_forced_close(master_sk);
++ tcp_done(master_sk);
++ inet_csk_prepare_forced_close(meta_sk);
++ tcp_done(meta_sk);
++
++err_alloc_mpcb:
++ return -ENOBUFS;
++}
++
++static int __mptcp_check_req_master(struct sock *child,
++ struct request_sock *req)
++{
++ struct tcp_sock *child_tp = tcp_sk(child);
++ struct sock *meta_sk = child;
++ struct mptcp_cb *mpcb;
++ struct mptcp_request_sock *mtreq;
++
++ /* Never contained an MP_CAPABLE */
++ if (!inet_rsk(req)->mptcp_rqsk)
++ return 1;
++
++ if (!inet_rsk(req)->saw_mpc) {
++ /* Fallback to regular TCP, because we saw one SYN without
++ * MP_CAPABLE. In tcp_check_req we continue the regular path.
++ * But, the socket has been added to the reqsk_tk_htb, so we
++ * must still remove it.
++ */
++ mptcp_reqsk_remove_tk(req);
++ return 1;
++ }
++
++ /* Just set this values to pass them to mptcp_alloc_mpcb */
++ mtreq = mptcp_rsk(req);
++ child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
++ child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
++
++ if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
++ child_tp->snd_wnd))
++ return -ENOBUFS;
++
++ child = tcp_sk(child)->mpcb->master_sk;
++ child_tp = tcp_sk(child);
++ mpcb = child_tp->mpcb;
++
++ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++
++ mpcb->dss_csum = mtreq->dss_csum;
++ mpcb->server_side = 1;
++
++ /* Will be moved to ESTABLISHED by tcp_rcv_state_process() */
++ mptcp_update_metasocket(child, meta_sk);
++
++ /* Needs to be done here additionally, because when accepting a
++ * new connection we pass by __reqsk_free and not reqsk_free.
++ */
++ mptcp_reqsk_remove_tk(req);
++
++ /* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
++ sock_put(meta_sk);
++
++ return 0;
++}
++
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
++{
++ struct sock *meta_sk = child, *master_sk;
++ struct sk_buff *skb;
++ u32 new_mapping;
++ int ret;
++
++ ret = __mptcp_check_req_master(child, req);
++ if (ret)
++ return ret;
++
++ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++
++ /* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
++ * pre-MPTCP data in the receive queue.
++ */
++ tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
++ tcp_rsk(req)->rcv_isn - 1;
++
++ /* Map subflow sequence number to data sequence numbers. We need to map
++ * these data to [IDSN - len - 1, IDSN[.
++ */
++ new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
++
++ /* There should be only one skb: the SYN + data. */
++ skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
++ TCP_SKB_CB(skb)->seq += new_mapping;
++ TCP_SKB_CB(skb)->end_seq += new_mapping;
++ }
++
++ /* With fastopen we change the semantics of the relative subflow
++ * sequence numbers to deal with middleboxes that could add/remove
++ * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
++ * instead of the regular TCP ISN.
++ */
++ tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
++
++ /* We need to update copied_seq of the master_sk to account for the
++ * already moved data to the meta receive queue.
++ */
++ tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
++
++ /* Handled by the master_sk */
++ tcp_sk(meta_sk)->fastopen_rsk = NULL;
++
++ return 0;
++}
++
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev)
++{
++ struct sock *meta_sk = child;
++ int ret;
++
++ ret = __mptcp_check_req_master(child, req);
++ if (ret)
++ return ret;
++
++ inet_csk_reqsk_queue_unlink(sk, req, prev);
++ inet_csk_reqsk_queue_removed(sk, req);
++ inet_csk_reqsk_queue_add(sk, req, meta_sk);
++
++ return 0;
++}
++
++struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
++ struct request_sock *req,
++ struct request_sock **prev,
++ const struct mptcp_options_received *mopt)
++{
++ struct tcp_sock *child_tp = tcp_sk(child);
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ u8 hash_mac_check[20];
++
++ child_tp->inside_tk_table = 0;
++
++ if (!mopt->join_ack)
++ goto teardown;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mtreq->mptcp_rem_nonce,
++ (u8 *)&mtreq->mptcp_loc_nonce,
++ (u32 *)hash_mac_check);
++
++ if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
++ goto teardown;
++
++ /* Point it to the same struct socket and wq as the meta_sk */
++ sk_set_socket(child, meta_sk->sk_socket);
++ child->sk_wq = meta_sk->sk_wq;
++
++ if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
++ /* Has been inherited, but now child_tp->mptcp is NULL */
++ child_tp->mpc = 0;
++ child_tp->ops = &tcp_specific;
++
++ /* TODO when we support acking the third ack for new subflows,
++ * we should silently discard this third ack, by returning NULL.
++ *
++ * Maybe, at the retransmission we will have enough memory to
++ * fully add the socket to the meta-sk.
++ */
++ goto teardown;
++ }
++
++ /* The child is a clone of the meta socket, we must now reset
++ * some of the fields
++ */
++ child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
++
++ /* We should allow proper increase of the snd/rcv-buffers. Thus, we
++ * use the original values instead of the bloated up ones from the
++ * clone.
++ */
++ child->sk_sndbuf = mpcb->orig_sk_sndbuf;
++ child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
++
++ child_tp->mptcp->slave_sk = 1;
++ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++ child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
++
++ child_tp->tsq_flags = 0;
++
++ /* Subflows do not use the accept queue, as they
++ * are attached immediately to the mpcb.
++ */
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++ reqsk_free(req);
++ return child;
++
++teardown:
++ /* Drop this request - sock creation failed. */
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++ reqsk_free(req);
++ inet_csk_prepare_forced_close(child);
++ tcp_done(child);
++ return meta_sk;
++}
++
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
++{
++ struct mptcp_tw *mptw;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++
++ /* A subsocket in tw can only receive data. So, if we are in
++ * infinite-receive, then we should not reply with a data-ack or act
++ * upon general MPTCP-signaling. We prevent this by simply not creating
++ * the mptcp_tw_sock.
++ */
++ if (mpcb->infinite_mapping_rcv) {
++ tw->mptcp_tw = NULL;
++ return 0;
++ }
++
++ /* Alloc MPTCP-tw-sock */
++ mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
++ if (!mptw)
++ return -ENOBUFS;
++
++ atomic_inc(&mpcb->mpcb_refcnt);
++
++ tw->mptcp_tw = mptw;
++ mptw->loc_key = mpcb->mptcp_loc_key;
++ mptw->meta_tw = mpcb->in_time_wait;
++ if (mptw->meta_tw) {
++ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
++ if (mpcb->mptw_state != TCP_TIME_WAIT)
++ mptw->rcv_nxt++;
++ }
++ rcu_assign_pointer(mptw->mpcb, mpcb);
++
++ spin_lock(&mpcb->tw_lock);
++ list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
++ mptw->in_list = 1;
++ spin_unlock(&mpcb->tw_lock);
++
++ return 0;
++}
++
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
++{
++ struct mptcp_cb *mpcb;
++
++ rcu_read_lock();
++ mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
++
++ /* If we are still holding a ref to the mpcb, we have to remove ourself
++ * from the list and drop the ref properly.
++ */
++ if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
++ spin_lock(&mpcb->tw_lock);
++ if (tw->mptcp_tw->in_list) {
++ list_del_rcu(&tw->mptcp_tw->list);
++ tw->mptcp_tw->in_list = 0;
++ }
++ spin_unlock(&mpcb->tw_lock);
++
++ /* Twice, because we increased it above */
++ mptcp_mpcb_put(mpcb);
++ mptcp_mpcb_put(mpcb);
++ }
++
++ rcu_read_unlock();
++
++ kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
++}
++
++/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
++ * data-fin.
++ */
++void mptcp_time_wait(struct sock *sk, int state, int timeo)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_tw *mptw;
++
++ /* Used for sockets that go into tw after the meta
++ * (see mptcp_init_tw_sock())
++ */
++ tp->mpcb->in_time_wait = 1;
++ tp->mpcb->mptw_state = state;
++
++ /* Update the time-wait-sock's information */
++ rcu_read_lock_bh();
++ list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
++ mptw->meta_tw = 1;
++ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
++
++ /* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
++ * pretend as if the DATA_FIN has already reached us, that way
++ * the checks in tcp_timewait_state_process will be good as the
++ * DATA_FIN comes in.
++ */
++ if (state != TCP_TIME_WAIT)
++ mptw->rcv_nxt++;
++ }
++ rcu_read_unlock_bh();
++
++ tcp_done(sk);
++}
++
++void mptcp_tsq_flags(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ /* It will be handled as a regular deferred-call */
++ if (is_meta_sk(sk))
++ return;
++
++ if (hlist_unhashed(&tp->mptcp->cb_list)) {
++ hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
++ /* We need to hold it here, as the sock_hold is not assured
++ * by the release_sock as it is done in regular TCP.
++ *
++ * The subsocket may get inet_csk_destroy'd while it is inside
++ * the callback_list.
++ */
++ sock_hold(sk);
++ }
++
++ if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
++ sock_hold(meta_sk);
++}
++
++void mptcp_tsq_sub_deferred(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_tcp_sock *mptcp;
++ struct hlist_node *tmp;
++
++ BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
++
++ __sock_put(meta_sk);
++ hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
++ struct tcp_sock *tp = mptcp->tp;
++ struct sock *sk = (struct sock *)tp;
++
++ hlist_del_init(&mptcp->cb_list);
++ sk->sk_prot->release_cb(sk);
++ /* Final sock_put (cfr. mptcp_tsq_flags */
++ sock_put(sk);
++ }
++}
++
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_options_received mopt;
++ u8 mptcp_hash_mac[20];
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ mtreq = mptcp_rsk(req);
++ mtreq->mptcp_mpcb = mpcb;
++ mtreq->is_sub = 1;
++ inet_rsk(req)->mptcp_rqsk = 1;
++
++ mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mtreq->mptcp_loc_nonce,
++ (u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
++ mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
++
++ mtreq->rem_id = mopt.rem_id;
++ mtreq->rcv_low_prio = mopt.low_prio;
++ inet_rsk(req)->saw_mpc = 1;
++}
++
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
++{
++ struct mptcp_options_received mopt;
++ struct mptcp_request_sock *mreq = mptcp_rsk(req);
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ mreq->is_sub = 0;
++ inet_rsk(req)->mptcp_rqsk = 1;
++ mreq->dss_csum = mopt.dss_csum;
++ mreq->hash_entry.pprev = NULL;
++
++ mptcp_reqsk_new_mptcp(req, &mopt, skb);
++}
++
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
++{
++ struct mptcp_options_received mopt;
++ const struct tcp_sock *tp = tcp_sk(sk);
++ __u32 isn = TCP_SKB_CB(skb)->when;
++ bool want_cookie = false;
++
++ if ((sysctl_tcp_syncookies == 2 ||
++ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++ want_cookie = tcp_syn_flood_action(sk, skb,
++ mptcp_request_sock_ops.slab_name);
++ if (!want_cookie)
++ goto drop;
++ }
++
++ mptcp_init_mp_opt(&mopt);
++ tcp_parse_mptcp_options(skb, &mopt);
++
++ if (mopt.is_mp_join)
++ return mptcp_do_join_short(skb, &mopt, sock_net(sk));
++ if (mopt.drop_me)
++ goto drop;
++
++ if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
++ mopt.saw_mpc = 0;
++
++ if (skb->protocol == htons(ETH_P_IP)) {
++ if (mopt.saw_mpc && !want_cookie) {
++ if (skb_rtable(skb)->rt_flags &
++ (RTCF_BROADCAST | RTCF_MULTICAST))
++ goto drop;
++
++ return tcp_conn_request(&mptcp_request_sock_ops,
++ &mptcp_request_sock_ipv4_ops,
++ sk, skb);
++ }
++
++ return tcp_v4_conn_request(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ if (mopt.saw_mpc && !want_cookie) {
++ if (!ipv6_unicast_destination(skb))
++ goto drop;
++
++ return tcp_conn_request(&mptcp6_request_sock_ops,
++ &mptcp_request_sock_ipv6_ops,
++ sk, skb);
++ }
++
++ return tcp_v6_conn_request(sk, skb);
++#endif
++ }
++drop:
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++ return 0;
++}
++
++struct workqueue_struct *mptcp_wq;
++EXPORT_SYMBOL(mptcp_wq);
++
++/* Output /proc/net/mptcp */
++static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
++{
++ struct tcp_sock *meta_tp;
++ const struct net *net = seq->private;
++ int i, n = 0;
++
++ seq_printf(seq, " sl loc_tok rem_tok v6 local_address remote_address st ns tx_queue rx_queue inode");
++ seq_putc(seq, '\n');
++
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ struct hlist_nulls_node *node;
++ rcu_read_lock_bh();
++ hlist_nulls_for_each_entry_rcu(meta_tp, node,
++ &tk_hashtable[i], tk_table) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *meta_sk = (struct sock *)meta_tp;
++ struct inet_sock *isk = inet_sk(meta_sk);
++
++ if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
++ continue;
++
++ if (capable(CAP_NET_ADMIN)) {
++ seq_printf(seq, "%4d: %04X %04X ", n++,
++ mpcb->mptcp_loc_token,
++ mpcb->mptcp_rem_token);
++ } else {
++ seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
++ }
++ if (meta_sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(meta_sk)) {
++ seq_printf(seq, " 0 %08X:%04X %08X:%04X ",
++ isk->inet_rcv_saddr,
++ ntohs(isk->inet_sport),
++ isk->inet_daddr,
++ ntohs(isk->inet_dport));
++#if IS_ENABLED(CONFIG_IPV6)
++ } else if (meta_sk->sk_family == AF_INET6) {
++ struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
++ struct in6_addr *dst = &meta_sk->sk_v6_daddr;
++ seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
++ src->s6_addr32[0], src->s6_addr32[1],
++ src->s6_addr32[2], src->s6_addr32[3],
++ ntohs(isk->inet_sport),
++ dst->s6_addr32[0], dst->s6_addr32[1],
++ dst->s6_addr32[2], dst->s6_addr32[3],
++ ntohs(isk->inet_dport));
++#endif
++ }
++ seq_printf(seq, " %02X %02X %08X:%08X %lu",
++ meta_sk->sk_state, mpcb->cnt_subflows,
++ meta_tp->write_seq - meta_tp->snd_una,
++ max_t(int, meta_tp->rcv_nxt -
++ meta_tp->copied_seq, 0),
++ sock_i_ino(meta_sk));
++ seq_putc(seq, '\n');
++ }
++
++ rcu_read_unlock_bh();
++ }
++
++ return 0;
++}
++
++static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
++{
++ return single_open_net(inode, file, mptcp_pm_seq_show);
++}
++
++static const struct file_operations mptcp_pm_seq_fops = {
++ .owner = THIS_MODULE,
++ .open = mptcp_pm_seq_open,
++ .read = seq_read,
++ .llseek = seq_lseek,
++ .release = single_release_net,
++};
++
++static int mptcp_pm_init_net(struct net *net)
++{
++ if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
++ return -ENOMEM;
++
++ return 0;
++}
++
++static void mptcp_pm_exit_net(struct net *net)
++{
++ remove_proc_entry("mptcp", net->proc_net);
++}
++
++static struct pernet_operations mptcp_pm_proc_ops = {
++ .init = mptcp_pm_init_net,
++ .exit = mptcp_pm_exit_net,
++};
++
++/* General initialization of mptcp */
++void __init mptcp_init(void)
++{
++ int i;
++ struct ctl_table_header *mptcp_sysctl;
++
++ mptcp_sock_cache = kmem_cache_create("mptcp_sock",
++ sizeof(struct mptcp_tcp_sock),
++ 0, SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_sock_cache)
++ goto mptcp_sock_cache_failed;
++
++ mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
++ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_cb_cache)
++ goto mptcp_cb_cache_failed;
++
++ mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
++ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++ if (!mptcp_tw_cache)
++ goto mptcp_tw_cache_failed;
++
++ get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
++
++ mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
++ if (!mptcp_wq)
++ goto alloc_workqueue_failed;
++
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
++ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
++ i + MPTCP_REQSK_NULLS_BASE);
++ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
++ }
++
++ spin_lock_init(&mptcp_reqsk_hlock);
++ spin_lock_init(&mptcp_tk_hashlock);
++
++ if (register_pernet_subsys(&mptcp_pm_proc_ops))
++ goto pernet_failed;
++
++#if IS_ENABLED(CONFIG_IPV6)
++ if (mptcp_pm_v6_init())
++ goto mptcp_pm_v6_failed;
++#endif
++ if (mptcp_pm_v4_init())
++ goto mptcp_pm_v4_failed;
++
++ mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
++ if (!mptcp_sysctl)
++ goto register_sysctl_failed;
++
++ if (mptcp_register_path_manager(&mptcp_pm_default))
++ goto register_pm_failed;
++
++ if (mptcp_register_scheduler(&mptcp_sched_default))
++ goto register_sched_failed;
++
++ pr_info("MPTCP: Stable release v0.89.0-rc");
++
++ mptcp_init_failed = false;
++
++ return;
++
++register_sched_failed:
++ mptcp_unregister_path_manager(&mptcp_pm_default);
++register_pm_failed:
++ unregister_net_sysctl_table(mptcp_sysctl);
++register_sysctl_failed:
++ mptcp_pm_v4_undo();
++mptcp_pm_v4_failed:
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_pm_v6_undo();
++mptcp_pm_v6_failed:
++#endif
++ unregister_pernet_subsys(&mptcp_pm_proc_ops);
++pernet_failed:
++ destroy_workqueue(mptcp_wq);
++alloc_workqueue_failed:
++ kmem_cache_destroy(mptcp_tw_cache);
++mptcp_tw_cache_failed:
++ kmem_cache_destroy(mptcp_cb_cache);
++mptcp_cb_cache_failed:
++ kmem_cache_destroy(mptcp_sock_cache);
++mptcp_sock_cache_failed:
++ mptcp_init_failed = true;
++}
+diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
+new file mode 100644
+index 000000000000..3a54413ce25b
+--- /dev/null
++++ b/net/mptcp/mptcp_fullmesh.c
+@@ -0,0 +1,1722 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#include <net/addrconf.h>
++#endif
++
++enum {
++ MPTCP_EVENT_ADD = 1,
++ MPTCP_EVENT_DEL,
++ MPTCP_EVENT_MOD,
++};
++
++#define MPTCP_SUBFLOW_RETRY_DELAY 1000
++
++/* Max number of local or remote addresses we can store.
++ * When changing, see the bitfield below in fullmesh_rem4/6.
++ */
++#define MPTCP_MAX_ADDR 8
++
++struct fullmesh_rem4 {
++ u8 rem4_id;
++ u8 bitfield;
++ u8 retry_bitfield;
++ __be16 port;
++ struct in_addr addr;
++};
++
++struct fullmesh_rem6 {
++ u8 rem6_id;
++ u8 bitfield;
++ u8 retry_bitfield;
++ __be16 port;
++ struct in6_addr addr;
++};
++
++struct mptcp_loc_addr {
++ struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
++ u8 loc4_bits;
++ u8 next_v4_index;
++
++ struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
++ u8 loc6_bits;
++ u8 next_v6_index;
++};
++
++struct mptcp_addr_event {
++ struct list_head list;
++ unsigned short family;
++ u8 code:7,
++ low_prio:1;
++ union inet_addr addr;
++};
++
++struct fullmesh_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++ /* Delayed worker, when the routing-tables are not yet ready. */
++ struct delayed_work subflow_retry_work;
++
++ /* Remote addresses */
++ struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
++ struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
++
++ struct mptcp_cb *mpcb;
++
++ u16 remove_addrs; /* Addresses to remove */
++ u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
++ u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
++
++ u8 add_addr; /* Are we sending an add_addr? */
++
++ u8 rem4_bits;
++ u8 rem6_bits;
++};
++
++struct mptcp_fm_ns {
++ struct mptcp_loc_addr __rcu *local;
++ spinlock_t local_lock; /* Protecting the above pointer */
++ struct list_head events;
++ struct delayed_work address_worker;
++
++ struct net *net;
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly;
++
++static void full_mesh_create_subflows(struct sock *meta_sk);
++
++static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
++{
++ return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
++}
++
++static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
++{
++ return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
++}
++
++/* Find the first free index in the bitfield */
++static int __mptcp_find_free_index(u8 bitfield, u8 base)
++{
++ int i;
++
++ /* There are anyways no free bits... */
++ if (bitfield == 0xff)
++ goto exit;
++
++ i = ffs(~(bitfield >> base)) - 1;
++ if (i < 0)
++ goto exit;
++
++ /* No free bits when starting at base, try from 0 on */
++ if (i + base >= sizeof(bitfield) * 8)
++ return __mptcp_find_free_index(bitfield, 0);
++
++ return i + base;
++exit:
++ return -1;
++}
++
++static int mptcp_find_free_index(u8 bitfield)
++{
++ return __mptcp_find_free_index(bitfield, 0);
++}
++
++static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
++ const struct in_addr *addr,
++ __be16 port, u8 id)
++{
++ int i;
++ struct fullmesh_rem4 *rem4;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ rem4 = &fmp->remaddr4[i];
++
++ /* Address is already in the list --- continue */
++ if (rem4->rem4_id == id &&
++ rem4->addr.s_addr == addr->s_addr && rem4->port == port)
++ return;
++
++ /* This may be the case, when the peer is behind a NAT. He is
++ * trying to JOIN, thus sending the JOIN with a certain ID.
++ * However the src_addr of the IP-packet has been changed. We
++ * update the addr in the list, because this is the address as
++ * OUR BOX sees it.
++ */
++ if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
++ /* update the address */
++ mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
++ __func__, &rem4->addr.s_addr,
++ &addr->s_addr, id);
++ rem4->addr.s_addr = addr->s_addr;
++ rem4->port = port;
++ mpcb->list_rcvd = 1;
++ return;
++ }
++ }
++
++ i = mptcp_find_free_index(fmp->rem4_bits);
++ /* Do we have already the maximum number of local/remote addresses? */
++ if (i < 0) {
++ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
++ __func__, MPTCP_MAX_ADDR, &addr->s_addr);
++ return;
++ }
++
++ rem4 = &fmp->remaddr4[i];
++
++ /* Address is not known yet, store it */
++ rem4->addr.s_addr = addr->s_addr;
++ rem4->port = port;
++ rem4->bitfield = 0;
++ rem4->retry_bitfield = 0;
++ rem4->rem4_id = id;
++ mpcb->list_rcvd = 1;
++ fmp->rem4_bits |= (1 << i);
++
++ return;
++}
++
++static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
++ const struct in6_addr *addr,
++ __be16 port, u8 id)
++{
++ int i;
++ struct fullmesh_rem6 *rem6;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ rem6 = &fmp->remaddr6[i];
++
++ /* Address is already in the list --- continue */
++ if (rem6->rem6_id == id &&
++ ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
++ return;
++
++ /* This may be the case, when the peer is behind a NAT. He is
++ * trying to JOIN, thus sending the JOIN with a certain ID.
++ * However the src_addr of the IP-packet has been changed. We
++ * update the addr in the list, because this is the address as
++ * OUR BOX sees it.
++ */
++ if (rem6->rem6_id == id) {
++ /* update the address */
++ mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
++ __func__, &rem6->addr, addr, id);
++ rem6->addr = *addr;
++ rem6->port = port;
++ mpcb->list_rcvd = 1;
++ return;
++ }
++ }
++
++ i = mptcp_find_free_index(fmp->rem6_bits);
++ /* Do we have already the maximum number of local/remote addresses? */
++ if (i < 0) {
++ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
++ __func__, MPTCP_MAX_ADDR, addr);
++ return;
++ }
++
++ rem6 = &fmp->remaddr6[i];
++
++ /* Address is not known yet, store it */
++ rem6->addr = *addr;
++ rem6->port = port;
++ rem6->bitfield = 0;
++ rem6->retry_bitfield = 0;
++ rem6->rem6_id = id;
++ mpcb->list_rcvd = 1;
++ fmp->rem6_bits |= (1 << i);
++
++ return;
++}
++
++static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ if (fmp->remaddr4[i].rem4_id == id) {
++ /* remove address from bitfield */
++ fmp->rem4_bits &= ~(1 << i);
++
++ break;
++ }
++ }
++}
++
++static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ if (fmp->remaddr6[i].rem6_id == id) {
++ /* remove address from bitfield */
++ fmp->rem6_bits &= ~(1 << i);
++
++ break;
++ }
++ }
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
++ const struct in_addr *addr, u8 index)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
++ fmp->remaddr4[i].bitfield |= (1 << index);
++ return;
++ }
++ }
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
++ const struct in6_addr *addr, u8 index)
++{
++ int i;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
++ fmp->remaddr6[i].bitfield |= (1 << index);
++ return;
++ }
++ }
++}
++
++static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
++ const union inet_addr *addr,
++ sa_family_t family, u8 id)
++{
++ if (family == AF_INET)
++ mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
++ else
++ mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
++}
++
++static void retry_subflow_worker(struct work_struct *work)
++{
++ struct delayed_work *delayed_work = container_of(work,
++ struct delayed_work,
++ work);
++ struct fullmesh_priv *fmp = container_of(delayed_work,
++ struct fullmesh_priv,
++ subflow_retry_work);
++ struct mptcp_cb *mpcb = fmp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int iter = 0, i;
++
++ /* We need a local (stable) copy of the address-list. Really, it is not
++ * such a big deal, if the address-list is not 100% up-to-date.
++ */
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++ rcu_read_unlock_bh();
++
++ if (!mptcp_local)
++ return;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
++ /* Do we need to retry establishing a subflow ? */
++ if (rem->retry_bitfield) {
++ int i = mptcp_find_free_index(~rem->retry_bitfield);
++ struct mptcp_rem4 rem4;
++
++ rem->bitfield |= (1 << i);
++ rem->retry_bitfield &= ~(1 << i);
++
++ rem4.addr = rem->addr;
++ rem4.port = rem->port;
++ rem4.rem4_id = rem->rem4_id;
++
++ mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
++ goto next_subflow;
++ }
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
++
++ /* Do we need to retry establishing a subflow ? */
++ if (rem->retry_bitfield) {
++ int i = mptcp_find_free_index(~rem->retry_bitfield);
++ struct mptcp_rem6 rem6;
++
++ rem->bitfield |= (1 << i);
++ rem->retry_bitfield &= ~(1 << i);
++
++ rem6.addr = rem->addr;
++ rem6.port = rem->port;
++ rem6.rem6_id = rem->rem6_id;
++
++ mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
++ goto next_subflow;
++ }
++ }
++#endif
++
++exit:
++ kfree(mptcp_local);
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = fmp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int iter = 0, retry = 0;
++ int i;
++
++ /* We need a local (stable) copy of the address-list. Really, it is not
++ * such a big deal, if the address-list is not 100% up-to-date.
++ */
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++ rcu_read_unlock_bh();
++
++ if (!mptcp_local)
++ return;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ struct fullmesh_rem4 *rem;
++ u8 remaining_bits;
++
++ rem = &fmp->remaddr4[i];
++ remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
++
++ /* Are there still combinations to handle? */
++ if (remaining_bits) {
++ int i = mptcp_find_free_index(~remaining_bits);
++ struct mptcp_rem4 rem4;
++
++ rem->bitfield |= (1 << i);
++
++ rem4.addr = rem->addr;
++ rem4.port = rem->port;
++ rem4.rem4_id = rem->rem4_id;
++
++ /* If a route is not yet available then retry once */
++ if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
++ &rem4) == -ENETUNREACH)
++ retry = rem->retry_bitfield |= (1 << i);
++ goto next_subflow;
++ }
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ struct fullmesh_rem6 *rem;
++ u8 remaining_bits;
++
++ rem = &fmp->remaddr6[i];
++ remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
++
++ /* Are there still combinations to handle? */
++ if (remaining_bits) {
++ int i = mptcp_find_free_index(~remaining_bits);
++ struct mptcp_rem6 rem6;
++
++ rem->bitfield |= (1 << i);
++
++ rem6.addr = rem->addr;
++ rem6.port = rem->port;
++ rem6.rem6_id = rem->rem6_id;
++
++ /* If a route is not yet available then retry once */
++ if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
++ &rem6) == -ENETUNREACH)
++ retry = rem->retry_bitfield |= (1 << i);
++ goto next_subflow;
++ }
++ }
++#endif
++
++ if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
++ sock_hold(meta_sk);
++ queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
++ msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
++ }
++
++exit:
++ kfree(mptcp_local);
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ struct sock *sk = mptcp_select_ack_sock(meta_sk);
++
++ fmp->remove_addrs |= (1 << addr_id);
++ mpcb->addr_signal = 1;
++
++ if (sk)
++ tcp_send_ack(sk);
++}
++
++static void update_addr_bitfields(struct sock *meta_sk,
++ const struct mptcp_loc_addr *mptcp_local)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ int i;
++
++ /* The bits in announced_addrs_* always match with loc*_bits. So, a
++ * simply & operation unsets the correct bits, because these go from
++ * announced to non-announced
++ */
++ fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
++
++ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++ fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
++ fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
++ }
++
++ fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
++
++ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++ fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
++ fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
++ }
++}
++
++static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
++ sa_family_t family, const union inet_addr *addr)
++{
++ int i;
++ u8 loc_bits;
++ bool found = false;
++
++ if (family == AF_INET)
++ loc_bits = mptcp_local->loc4_bits;
++ else
++ loc_bits = mptcp_local->loc6_bits;
++
++ mptcp_for_each_bit_set(loc_bits, i) {
++ if (family == AF_INET &&
++ mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
++ found = true;
++ break;
++ }
++ if (family == AF_INET6 &&
++ ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
++ &addr->in6)) {
++ found = true;
++ break;
++ }
++ }
++
++ if (!found)
++ return -1;
++
++ return i;
++}
++
++static void mptcp_address_worker(struct work_struct *work)
++{
++ const struct delayed_work *delayed_work = container_of(work,
++ struct delayed_work,
++ work);
++ struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
++ struct mptcp_fm_ns,
++ address_worker);
++ struct net *net = fm_ns->net;
++ struct mptcp_addr_event *event = NULL;
++ struct mptcp_loc_addr *mptcp_local, *old;
++ int i, id = -1; /* id is used in the socket-code on a delete-event */
++ bool success; /* Used to indicate if we succeeded handling the event */
++
++next_event:
++ success = false;
++ kfree(event);
++
++ /* First, let's dequeue an event from our event-list */
++ rcu_read_lock_bh();
++ spin_lock(&fm_ns->local_lock);
++
++ event = list_first_entry_or_null(&fm_ns->events,
++ struct mptcp_addr_event, list);
++ if (!event) {
++ spin_unlock(&fm_ns->local_lock);
++ rcu_read_unlock_bh();
++ return;
++ }
++
++ list_del(&event->list);
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++ if (event->code == MPTCP_EVENT_DEL) {
++ id = mptcp_find_address(mptcp_local, event->family, &event->addr);
++
++ /* Not in the list - so we don't care */
++ if (id < 0) {
++ mptcp_debug("%s could not find id\n", __func__);
++ goto duno;
++ }
++
++ old = mptcp_local;
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++ GFP_ATOMIC);
++ if (!mptcp_local)
++ goto duno;
++
++ if (event->family == AF_INET)
++ mptcp_local->loc4_bits &= ~(1 << id);
++ else
++ mptcp_local->loc6_bits &= ~(1 << id);
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ kfree(old);
++ } else {
++ int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
++ int j = i;
++
++ if (j < 0) {
++ /* Not in the list, so we have to find an empty slot */
++ if (event->family == AF_INET)
++ i = __mptcp_find_free_index(mptcp_local->loc4_bits,
++ mptcp_local->next_v4_index);
++ if (event->family == AF_INET6)
++ i = __mptcp_find_free_index(mptcp_local->loc6_bits,
++ mptcp_local->next_v6_index);
++
++ if (i < 0) {
++ mptcp_debug("%s no more space\n", __func__);
++ goto duno;
++ }
++
++ /* It might have been a MOD-event. */
++ event->code = MPTCP_EVENT_ADD;
++ } else {
++ /* Let's check if anything changes */
++ if (event->family == AF_INET &&
++ event->low_prio == mptcp_local->locaddr4[i].low_prio)
++ goto duno;
++
++ if (event->family == AF_INET6 &&
++ event->low_prio == mptcp_local->locaddr6[i].low_prio)
++ goto duno;
++ }
++
++ old = mptcp_local;
++ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++ GFP_ATOMIC);
++ if (!mptcp_local)
++ goto duno;
++
++ if (event->family == AF_INET) {
++ mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
++ mptcp_local->locaddr4[i].loc4_id = i + 1;
++ mptcp_local->locaddr4[i].low_prio = event->low_prio;
++ } else {
++ mptcp_local->locaddr6[i].addr = event->addr.in6;
++ mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
++ mptcp_local->locaddr6[i].low_prio = event->low_prio;
++ }
++
++ if (j < 0) {
++ if (event->family == AF_INET) {
++ mptcp_local->loc4_bits |= (1 << i);
++ mptcp_local->next_v4_index = i + 1;
++ } else {
++ mptcp_local->loc6_bits |= (1 << i);
++ mptcp_local->next_v6_index = i + 1;
++ }
++ }
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ kfree(old);
++ }
++ success = true;
++
++duno:
++ spin_unlock(&fm_ns->local_lock);
++ rcu_read_unlock_bh();
++
++ if (!success)
++ goto next_event;
++
++ /* Now we iterate over the MPTCP-sockets and apply the event. */
++ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++ const struct hlist_nulls_node *node;
++ struct tcp_sock *meta_tp;
++
++ rcu_read_lock_bh();
++ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
++ tk_table) {
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *meta_sk = (struct sock *)meta_tp, *sk;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ if (sock_net(meta_sk) != net)
++ continue;
++
++ if (meta_v4) {
++ /* skip IPv6 events if meta is IPv4 */
++ if (event->family == AF_INET6)
++ continue;
++ }
++ /* skip IPv4 events if IPV6_V6ONLY is set */
++ else if (event->family == AF_INET &&
++ inet6_sk(meta_sk)->ipv6only)
++ continue;
++
++ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ continue;
++
++ bh_lock_sock(meta_sk);
++
++ if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
++ mpcb->infinite_mapping_snd ||
++ mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping)
++ goto next;
++
++ /* May be that the pm has changed in-between */
++ if (mpcb->pm_ops != &full_mesh)
++ goto next;
++
++ if (sock_owned_by_user(meta_sk)) {
++ if (!test_and_set_bit(MPTCP_PATH_MANAGER,
++ &meta_tp->tsq_flags))
++ sock_hold(meta_sk);
++
++ goto next;
++ }
++
++ if (event->code == MPTCP_EVENT_ADD) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++
++ full_mesh_create_subflows(meta_sk);
++ }
++
++ if (event->code == MPTCP_EVENT_DEL) {
++ struct sock *sk, *tmpsk;
++ struct mptcp_loc_addr *mptcp_local;
++ bool found = false;
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++ /* In any case, we need to update our bitfields */
++ if (id >= 0)
++ update_addr_bitfields(meta_sk, mptcp_local);
++
++ /* Look for the socket and remove him */
++ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++ if ((event->family == AF_INET6 &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk))) ||
++ (event->family == AF_INET &&
++ (sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(sk))))
++ continue;
++
++ if (event->family == AF_INET &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk)) &&
++ inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
++ continue;
++
++ if (event->family == AF_INET6 &&
++ sk->sk_family == AF_INET6 &&
++ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
++ continue;
++
++ /* Reinject, so that pf = 1 and so we
++ * won't select this one as the
++ * ack-sock.
++ */
++ mptcp_reinject_data(sk, 0);
++
++ /* We announce the removal of this id */
++ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
++
++ mptcp_sub_force_close(sk);
++ found = true;
++ }
++
++ if (found)
++ goto next;
++
++ /* The id may have been given by the event,
++ * matching on a local address. And it may not
++ * have matched on one of the above sockets,
++ * because the client never created a subflow.
++ * So, we have to finally remove it here.
++ */
++ if (id > 0)
++ announce_remove_addr(id, meta_sk);
++ }
++
++ if (event->code == MPTCP_EVENT_MOD) {
++ struct sock *sk;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ if (event->family == AF_INET &&
++ (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk)) &&
++ inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
++ if (event->low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = event->low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (event->family == AF_INET6 &&
++ sk->sk_family == AF_INET6 &&
++ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
++ if (event->low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = event->low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++ }
++ }
++next:
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk);
++ }
++ rcu_read_unlock_bh();
++ }
++ goto next_event;
++}
++
++static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
++ const struct mptcp_addr_event *event)
++{
++ struct mptcp_addr_event *eventq;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++ list_for_each_entry(eventq, &fm_ns->events, list) {
++ if (eventq->family != event->family)
++ continue;
++ if (event->family == AF_INET) {
++ if (eventq->addr.in.s_addr == event->addr.in.s_addr)
++ return eventq;
++ } else {
++ if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
++ return eventq;
++ }
++ }
++ return NULL;
++}
++
++/* We already hold the net-namespace MPTCP-lock */
++static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
++{
++ struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++ if (eventq) {
++ switch (event->code) {
++ case MPTCP_EVENT_DEL:
++ mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
++ list_del(&eventq->list);
++ kfree(eventq);
++ break;
++ case MPTCP_EVENT_ADD:
++ mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
++ eventq->low_prio = event->low_prio;
++ eventq->code = MPTCP_EVENT_ADD;
++ return;
++ case MPTCP_EVENT_MOD:
++ mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
++ eventq->low_prio = event->low_prio;
++ eventq->code = MPTCP_EVENT_MOD;
++ return;
++ }
++ }
++
++ /* OK, we have to add the new address to the wait queue */
++ eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
++ if (!eventq)
++ return;
++
++ list_add_tail(&eventq->list, &fm_ns->events);
++
++ /* Create work-queue */
++ if (!delayed_work_pending(&fm_ns->address_worker))
++ queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
++ msecs_to_jiffies(500));
++}
++
++static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
++ struct net *net)
++{
++ const struct net_device *netdev = ifa->ifa_dev->dev;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ struct mptcp_addr_event mpevent;
++
++ if (ifa->ifa_scope > RT_SCOPE_LINK ||
++ ipv4_is_loopback(ifa->ifa_local))
++ return;
++
++ spin_lock_bh(&fm_ns->local_lock);
++
++ mpevent.family = AF_INET;
++ mpevent.addr.in.s_addr = ifa->ifa_local;
++ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++ if (event == NETDEV_DOWN || !netif_running(netdev) ||
++ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++ mpevent.code = MPTCP_EVENT_DEL;
++ else if (event == NETDEV_UP)
++ mpevent.code = MPTCP_EVENT_ADD;
++ else if (event == NETDEV_CHANGE)
++ mpevent.code = MPTCP_EVENT_MOD;
++
++ mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
++ &ifa->ifa_local, mpevent.code, mpevent.low_prio);
++ add_pm_event(net, &mpevent);
++
++ spin_unlock_bh(&fm_ns->local_lock);
++ return;
++}
++
++/* React on IPv4-addr add/rem-events */
++static int mptcp_pm_inetaddr_event(struct notifier_block *this,
++ unsigned long event, void *ptr)
++{
++ const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
++ struct net *net = dev_net(ifa->ifa_dev->dev);
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ addr4_event_handler(ifa, event, net);
++
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_inetaddr_notifier = {
++ .notifier_call = mptcp_pm_inetaddr_event,
++};
++
++#if IS_ENABLED(CONFIG_IPV6)
++
++/* IPV6-related address/interface watchers */
++struct mptcp_dad_data {
++ struct timer_list timer;
++ struct inet6_ifaddr *ifa;
++};
++
++static void dad_callback(unsigned long arg);
++static int inet6_addr_event(struct notifier_block *this,
++ unsigned long event, void *ptr);
++
++static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
++{
++ return (ifa->flags & IFA_F_TENTATIVE) &&
++ ifa->state == INET6_IFADDR_STATE_DAD;
++}
++
++static void dad_init_timer(struct mptcp_dad_data *data,
++ struct inet6_ifaddr *ifa)
++{
++ data->ifa = ifa;
++ data->timer.data = (unsigned long)data;
++ data->timer.function = dad_callback;
++ if (ifa->idev->cnf.rtr_solicit_delay)
++ data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
++ else
++ data->timer.expires = jiffies + (HZ/10);
++}
++
++static void dad_callback(unsigned long arg)
++{
++ struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
++
++ if (ipv6_is_in_dad_state(data->ifa)) {
++ dad_init_timer(data, data->ifa);
++ add_timer(&data->timer);
++ } else {
++ inet6_addr_event(NULL, NETDEV_UP, data->ifa);
++ in6_ifa_put(data->ifa);
++ kfree(data);
++ }
++}
++
++static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
++{
++ struct mptcp_dad_data *data;
++
++ data = kmalloc(sizeof(*data), GFP_ATOMIC);
++
++ if (!data)
++ return;
++
++ init_timer(&data->timer);
++ dad_init_timer(data, ifa);
++ add_timer(&data->timer);
++ in6_ifa_hold(ifa);
++}
++
++static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
++ struct net *net)
++{
++ const struct net_device *netdev = ifa->idev->dev;
++ int addr_type = ipv6_addr_type(&ifa->addr);
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ struct mptcp_addr_event mpevent;
++
++ if (ifa->scope > RT_SCOPE_LINK ||
++ addr_type == IPV6_ADDR_ANY ||
++ (addr_type & IPV6_ADDR_LOOPBACK) ||
++ (addr_type & IPV6_ADDR_LINKLOCAL))
++ return;
++
++ spin_lock_bh(&fm_ns->local_lock);
++
++ mpevent.family = AF_INET6;
++ mpevent.addr.in6 = ifa->addr;
++ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++ if (event == NETDEV_DOWN || !netif_running(netdev) ||
++ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++ mpevent.code = MPTCP_EVENT_DEL;
++ else if (event == NETDEV_UP)
++ mpevent.code = MPTCP_EVENT_ADD;
++ else if (event == NETDEV_CHANGE)
++ mpevent.code = MPTCP_EVENT_MOD;
++
++ mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
++ &ifa->addr, mpevent.code, mpevent.low_prio);
++ add_pm_event(net, &mpevent);
++
++ spin_unlock_bh(&fm_ns->local_lock);
++ return;
++}
++
++/* React on IPv6-addr add/rem-events */
++static int inet6_addr_event(struct notifier_block *this, unsigned long event,
++ void *ptr)
++{
++ struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
++ struct net *net = dev_net(ifa6->idev->dev);
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ if (ipv6_is_in_dad_state(ifa6))
++ dad_setup_timer(ifa6);
++ else
++ addr6_event_handler(ifa6, event, net);
++
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block inet6_addr_notifier = {
++ .notifier_call = inet6_addr_event,
++};
++
++#endif
++
++/* React on ifup/down-events */
++static int netdev_event(struct notifier_block *this, unsigned long event,
++ void *ptr)
++{
++ const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
++ struct in_device *in_dev;
++#if IS_ENABLED(CONFIG_IPV6)
++ struct inet6_dev *in6_dev;
++#endif
++
++ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++ event == NETDEV_CHANGE))
++ return NOTIFY_DONE;
++
++ rcu_read_lock();
++ in_dev = __in_dev_get_rtnl(dev);
++
++ if (in_dev) {
++ for_ifa(in_dev) {
++ mptcp_pm_inetaddr_event(NULL, event, ifa);
++ } endfor_ifa(in_dev);
++ }
++
++#if IS_ENABLED(CONFIG_IPV6)
++ in6_dev = __in6_dev_get(dev);
++
++ if (in6_dev) {
++ struct inet6_ifaddr *ifa6;
++ list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
++ inet6_addr_event(NULL, event, ifa6);
++ }
++#endif
++
++ rcu_read_unlock();
++ return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_netdev_notifier = {
++ .notifier_call = netdev_event,
++};
++
++static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
++ const union inet_addr *addr,
++ sa_family_t family, __be16 port, u8 id)
++{
++ if (family == AF_INET)
++ mptcp_addv4_raddr(mpcb, &addr->in, port, id);
++ else
++ mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
++}
++
++static void full_mesh_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ int i, index;
++ union inet_addr saddr, daddr;
++ sa_family_t family;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ /* Init local variables necessary for the rest */
++ if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
++ saddr.ip = inet_sk(meta_sk)->inet_saddr;
++ daddr.ip = inet_sk(meta_sk)->inet_daddr;
++ family = AF_INET;
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ saddr.in6 = inet6_sk(meta_sk)->saddr;
++ daddr.in6 = meta_sk->sk_v6_daddr;
++ family = AF_INET6;
++#endif
++ }
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ index = mptcp_find_address(mptcp_local, family, &saddr);
++ if (index < 0)
++ goto fallback;
++
++ full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
++ mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
++ fmp->mpcb = mpcb;
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* Look for the address among the local addresses */
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ __be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
++
++ /* We do not need to announce the initial subflow's address again */
++ if (family == AF_INET && saddr.ip == ifa_address)
++ continue;
++
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++ }
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++ /* skip IPv6 addresses if meta-socket is IPv4 */
++ if (meta_v4)
++ goto skip_ipv6;
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
++
++ /* We do not need to announce the initial subflow's address again */
++ if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
++ continue;
++
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++ }
++
++skip_ipv6:
++#endif
++
++ rcu_read_unlock();
++
++ if (family == AF_INET)
++ fmp->announced_addrs_v4 |= (1 << index);
++ else
++ fmp->announced_addrs_v6 |= (1 << index);
++
++ for (i = fmp->add_addr; i && fmp->add_addr; i--)
++ tcp_send_ack(mpcb->master_sk);
++
++ return;
++
++fallback:
++ rcu_read_unlock();
++ mptcp_fallback_default(mpcb);
++ return;
++}
++
++static void full_mesh_create_subflows(struct sock *meta_sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ return;
++
++ if (!work_pending(&fmp->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &fmp->subflow_work);
++ }
++}
++
++/* Called upon release_sock, if the socket was owned by the user during
++ * a path-management event.
++ */
++static void full_mesh_release_sock(struct sock *meta_sk)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++ struct sock *sk, *tmpsk;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++ int i;
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* First, detect modifications or additions */
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ struct in_addr ifa = mptcp_local->locaddr4[i].addr;
++ bool found = false;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (sk->sk_family == AF_INET6 &&
++ !mptcp_v6_is_v4_mapped(sk))
++ continue;
++
++ if (inet_sk(sk)->inet_saddr != ifa.s_addr)
++ continue;
++
++ found = true;
++
++ if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (!found) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++ full_mesh_create_subflows(meta_sk);
++ }
++ }
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++ /* skip IPv6 addresses if meta-socket is IPv4 */
++ if (meta_v4)
++ goto removal;
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
++ bool found = false;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(sk))
++ continue;
++
++ if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
++ continue;
++
++ found = true;
++
++ if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
++ tp->mptcp->send_mp_prio = 1;
++ tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
++
++ tcp_send_ack(sk);
++ }
++ }
++
++ if (!found) {
++ fmp->add_addr++;
++ mpcb->addr_signal = 1;
++
++ sk = mptcp_select_ack_sock(meta_sk);
++ if (sk)
++ tcp_send_ack(sk);
++ full_mesh_create_subflows(meta_sk);
++ }
++ }
++
++removal:
++#endif
++
++ /* Now, detect address-removals */
++ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++ bool shall_remove = true;
++
++ if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
++ shall_remove = false;
++ break;
++ }
++ }
++ } else {
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
++ shall_remove = false;
++ break;
++ }
++ }
++ }
++
++ if (shall_remove) {
++ /* Reinject, so that pf = 1 and so we
++ * won't select this one as the
++ * ack-sock.
++ */
++ mptcp_reinject_data(sk, 0);
++
++ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
++ meta_sk);
++
++ mptcp_sub_force_close(sk);
++ }
++ }
++
++ /* Just call it optimistically. It actually cannot do any harm */
++ update_addr_bitfields(meta_sk, mptcp_local);
++
++ rcu_read_unlock();
++}
++
++static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ int index, id = -1;
++
++ /* Handle the backup-flows */
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ index = mptcp_find_address(mptcp_local, family, addr);
++
++ if (index != -1) {
++ if (family == AF_INET) {
++ id = mptcp_local->locaddr4[index].loc4_id;
++ *low_prio = mptcp_local->locaddr4[index].low_prio;
++ } else {
++ id = mptcp_local->locaddr6[index].loc6_id;
++ *low_prio = mptcp_local->locaddr6[index].low_prio;
++ }
++ }
++
++
++ rcu_read_unlock();
++
++ return id;
++}
++
++static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
++ struct tcp_out_options *opts,
++ struct sk_buff *skb)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
++ int remove_addr_len;
++ u8 unannouncedv4 = 0, unannouncedv6 = 0;
++ bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++ mpcb->addr_signal = 0;
++
++ if (likely(!fmp->add_addr))
++ goto remove_addr;
++
++ rcu_read_lock();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++ goto skip_ipv4;
++
++ /* IPv4 */
++ unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
++ if (unannouncedv4 &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
++ int ind = mptcp_find_free_index(~unannouncedv4);
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_ADD_ADDR;
++ opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
++ opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
++ opts->add_addr_v4 = 1;
++
++ if (skb) {
++ fmp->announced_addrs_v4 |= (1 << ind);
++ fmp->add_addr--;
++ }
++ *size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
++ }
++
++ if (meta_v4)
++ goto skip_ipv6;
++
++skip_ipv4:
++ /* IPv6 */
++ unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
++ if (unannouncedv6 &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
++ int ind = mptcp_find_free_index(~unannouncedv6);
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_ADD_ADDR;
++ opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
++ opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
++ opts->add_addr_v6 = 1;
++
++ if (skb) {
++ fmp->announced_addrs_v6 |= (1 << ind);
++ fmp->add_addr--;
++ }
++ *size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
++ }
++
++skip_ipv6:
++ rcu_read_unlock();
++
++ if (!unannouncedv4 && !unannouncedv6 && skb)
++ fmp->add_addr--;
++
++remove_addr:
++ if (likely(!fmp->remove_addrs))
++ goto exit;
++
++ remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
++ if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
++ goto exit;
++
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_REMOVE_ADDR;
++ opts->remove_addrs = fmp->remove_addrs;
++ *size += remove_addr_len;
++ if (skb)
++ fmp->remove_addrs = 0;
++
++exit:
++ mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
++}
++
++static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
++{
++ mptcp_v4_rem_raddress(mpcb, rem_id);
++ mptcp_v6_rem_raddress(mpcb, rem_id);
++}
++
++/* Output /proc/net/mptcp_fullmesh */
++static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
++{
++ const struct net *net = seq->private;
++ struct mptcp_loc_addr *mptcp_local;
++ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++ int i;
++
++ seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
++
++ rcu_read_lock_bh();
++ mptcp_local = rcu_dereference(fm_ns->local);
++
++ seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
++
++ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++ struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
++
++ seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
++ loc4->low_prio, &loc4->addr);
++ }
++
++ seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
++
++ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++ struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
++
++ seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
++ loc6->low_prio, &loc6->addr);
++ }
++ rcu_read_unlock_bh();
++
++ return 0;
++}
++
++static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
++{
++ return single_open_net(inode, file, mptcp_fm_seq_show);
++}
++
++static const struct file_operations mptcp_fm_seq_fops = {
++ .owner = THIS_MODULE,
++ .open = mptcp_fm_seq_open,
++ .read = seq_read,
++ .llseek = seq_lseek,
++ .release = single_release_net,
++};
++
++static int mptcp_fm_init_net(struct net *net)
++{
++ struct mptcp_loc_addr *mptcp_local;
++ struct mptcp_fm_ns *fm_ns;
++ int err = 0;
++
++ fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
++ if (!fm_ns)
++ return -ENOBUFS;
++
++ mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
++ if (!mptcp_local) {
++ err = -ENOBUFS;
++ goto err_mptcp_local;
++ }
++
++ if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
++ &mptcp_fm_seq_fops)) {
++ err = -ENOMEM;
++ goto err_seq_fops;
++ }
++
++ mptcp_local->next_v4_index = 1;
++
++ rcu_assign_pointer(fm_ns->local, mptcp_local);
++ INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
++ INIT_LIST_HEAD(&fm_ns->events);
++ spin_lock_init(&fm_ns->local_lock);
++ fm_ns->net = net;
++ net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
++
++ return 0;
++err_seq_fops:
++ kfree(mptcp_local);
++err_mptcp_local:
++ kfree(fm_ns);
++ return err;
++}
++
++static void mptcp_fm_exit_net(struct net *net)
++{
++ struct mptcp_addr_event *eventq, *tmp;
++ struct mptcp_fm_ns *fm_ns;
++ struct mptcp_loc_addr *mptcp_local;
++
++ fm_ns = fm_get_ns(net);
++ cancel_delayed_work_sync(&fm_ns->address_worker);
++
++ rcu_read_lock_bh();
++
++ mptcp_local = rcu_dereference_bh(fm_ns->local);
++ kfree(mptcp_local);
++
++ spin_lock(&fm_ns->local_lock);
++ list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
++ list_del(&eventq->list);
++ kfree(eventq);
++ }
++ spin_unlock(&fm_ns->local_lock);
++
++ rcu_read_unlock_bh();
++
++ remove_proc_entry("mptcp_fullmesh", net->proc_net);
++
++ kfree(fm_ns);
++}
++
++static struct pernet_operations full_mesh_net_ops = {
++ .init = mptcp_fm_init_net,
++ .exit = mptcp_fm_exit_net,
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly = {
++ .new_session = full_mesh_new_session,
++ .release_sock = full_mesh_release_sock,
++ .fully_established = full_mesh_create_subflows,
++ .new_remote_address = full_mesh_create_subflows,
++ .get_local_id = full_mesh_get_local_id,
++ .addr_signal = full_mesh_addr_signal,
++ .add_raddr = full_mesh_add_raddr,
++ .rem_raddr = full_mesh_rem_raddr,
++ .name = "fullmesh",
++ .owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init full_mesh_register(void)
++{
++ int ret;
++
++ BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
++
++ ret = register_pernet_subsys(&full_mesh_net_ops);
++ if (ret)
++ goto out;
++
++ ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++ if (ret)
++ goto err_reg_inetaddr;
++ ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
++ if (ret)
++ goto err_reg_netdev;
++
++#if IS_ENABLED(CONFIG_IPV6)
++ ret = register_inet6addr_notifier(&inet6_addr_notifier);
++ if (ret)
++ goto err_reg_inet6addr;
++#endif
++
++ ret = mptcp_register_path_manager(&full_mesh);
++ if (ret)
++ goto err_reg_pm;
++
++out:
++ return ret;
++
++
++err_reg_pm:
++#if IS_ENABLED(CONFIG_IPV6)
++ unregister_inet6addr_notifier(&inet6_addr_notifier);
++err_reg_inet6addr:
++#endif
++ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++err_reg_netdev:
++ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++err_reg_inetaddr:
++ unregister_pernet_subsys(&full_mesh_net_ops);
++ goto out;
++}
++
++static void full_mesh_unregister(void)
++{
++#if IS_ENABLED(CONFIG_IPV6)
++ unregister_inet6addr_notifier(&inet6_addr_notifier);
++#endif
++ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++ unregister_pernet_subsys(&full_mesh_net_ops);
++ mptcp_unregister_path_manager(&full_mesh);
++}
++
++module_init(full_mesh_register);
++module_exit(full_mesh_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("Full-Mesh MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
+new file mode 100644
+index 000000000000..43704ccb639e
+--- /dev/null
++++ b/net/mptcp/mptcp_input.c
+@@ -0,0 +1,2405 @@
++/*
++ * MPTCP implementation - Sending side
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <asm/unaligned.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++
++#include <linux/kconfig.h>
++
++/* is seq1 < seq2 ? */
++static inline bool before64(const u64 seq1, const u64 seq2)
++{
++ return (s64)(seq1 - seq2) < 0;
++}
++
++/* is seq1 > seq2 ? */
++#define after64(seq1, seq2) before64(seq2, seq1)
++
++static inline void mptcp_become_fully_estab(struct sock *sk)
++{
++ tcp_sk(sk)->mptcp->fully_established = 1;
++
++ if (is_master_tp(tcp_sk(sk)) &&
++ tcp_sk(sk)->mpcb->pm_ops->fully_established)
++ tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
++}
++
++/* Similar to tcp_tso_acked without any memory accounting */
++static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
++ struct sk_buff *skb)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ u32 packets_acked, len;
++
++ BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
++
++ packets_acked = tcp_skb_pcount(skb);
++
++ if (skb_unclone(skb, GFP_ATOMIC))
++ return 0;
++
++ len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
++ __pskb_trim_head(skb, len);
++
++ TCP_SKB_CB(skb)->seq += len;
++ skb->ip_summed = CHECKSUM_PARTIAL;
++ skb->truesize -= len;
++
++ /* Any change of skb->len requires recalculation of tso factor. */
++ if (tcp_skb_pcount(skb) > 1)
++ tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
++ packets_acked -= tcp_skb_pcount(skb);
++
++ if (packets_acked) {
++ BUG_ON(tcp_skb_pcount(skb) == 0);
++ BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
++ }
++
++ return packets_acked;
++}
++
++/**
++ * Cleans the meta-socket retransmission queue and the reinject-queue.
++ * @sk must be the metasocket.
++ */
++static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
++{
++ struct sk_buff *skb, *tmp;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ bool acked = false;
++ u32 acked_pcount;
++
++ while ((skb = tcp_write_queue_head(meta_sk)) &&
++ skb != tcp_send_head(meta_sk)) {
++ bool fully_acked = true;
++
++ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++ if (tcp_skb_pcount(skb) == 1 ||
++ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++ break;
++
++ acked_pcount = tcp_tso_acked(meta_sk, skb);
++ if (!acked_pcount)
++ break;
++
++ fully_acked = false;
++ } else {
++ acked_pcount = tcp_skb_pcount(skb);
++ }
++
++ acked = true;
++ meta_tp->packets_out -= acked_pcount;
++ meta_tp->retrans_stamp = 0;
++
++ if (!fully_acked)
++ break;
++
++ tcp_unlink_write_queue(skb, meta_sk);
++
++ if (mptcp_is_data_fin(skb)) {
++ struct sock *sk_it;
++
++ /* DATA_FIN has been acknowledged - now we can close
++ * the subflows
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ unsigned long delay = 0;
++
++ /* If we are the passive closer, don't trigger
++ * subflow-fin until the subflow has been finned
++ * by the peer - thus we add a delay.
++ */
++ if (mpcb->passive_close &&
++ sk_it->sk_state == TCP_ESTABLISHED)
++ delay = inet_csk(sk_it)->icsk_rto << 3;
++
++ mptcp_sub_close(sk_it, delay);
++ }
++ }
++ sk_wmem_free_skb(meta_sk, skb);
++ }
++ /* Remove acknowledged data from the reinject queue */
++ skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
++ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++ if (tcp_skb_pcount(skb) == 1 ||
++ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++ break;
++
++ mptcp_tso_acked_reinject(meta_sk, skb);
++ break;
++ }
++
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ __kfree_skb(skb);
++ }
++
++ if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
++ meta_tp->snd_up = meta_tp->snd_una;
++
++ if (acked) {
++ tcp_rearm_rto(meta_sk);
++ /* Normally this is done in tcp_try_undo_loss - but MPTCP
++ * does not call this function.
++ */
++ inet_csk(meta_sk)->icsk_retransmits = 0;
++ }
++}
++
++/* Inspired by tcp_rcv_state_process */
++static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
++ const struct sk_buff *skb, u32 data_seq,
++ u16 data_len)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++ const struct tcphdr *th = tcp_hdr(skb);
++
++ /* State-machine handling if FIN has been enqueued and he has
++ * been acked (snd_una == write_seq) - it's important that this
++ * here is after sk_wmem_free_skb because otherwise
++ * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
++ */
++ switch (meta_sk->sk_state) {
++ case TCP_FIN_WAIT1: {
++ struct dst_entry *dst;
++ int tmo;
++
++ if (meta_tp->snd_una != meta_tp->write_seq)
++ break;
++
++ tcp_set_state(meta_sk, TCP_FIN_WAIT2);
++ meta_sk->sk_shutdown |= SEND_SHUTDOWN;
++
++ dst = __sk_dst_get(sk);
++ if (dst)
++ dst_confirm(dst);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ /* Wake up lingering close() */
++ meta_sk->sk_state_change(meta_sk);
++ break;
++ }
++
++ if (meta_tp->linger2 < 0 ||
++ (data_len &&
++ after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
++ meta_tp->rcv_nxt))) {
++ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++ tcp_done(meta_sk);
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ return 1;
++ }
++
++ tmo = tcp_fin_time(meta_sk);
++ if (tmo > TCP_TIMEWAIT_LEN) {
++ inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
++ } else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
++ /* Bad case. We could lose such FIN otherwise.
++ * It is not a big problem, but it looks confusing
++ * and not so rare event. We still can lose it now,
++ * if it spins in bh_lock_sock(), but it is really
++ * marginal case.
++ */
++ inet_csk_reset_keepalive_timer(meta_sk, tmo);
++ } else {
++ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
++ }
++ break;
++ }
++ case TCP_CLOSING:
++ case TCP_LAST_ACK:
++ if (meta_tp->snd_una == meta_tp->write_seq) {
++ tcp_done(meta_sk);
++ return 1;
++ }
++ break;
++ }
++
++ /* step 7: process the segment text */
++ switch (meta_sk->sk_state) {
++ case TCP_FIN_WAIT1:
++ case TCP_FIN_WAIT2:
++ /* RFC 793 says to queue data in these states,
++ * RFC 1122 says we MUST send a reset.
++ * BSD 4.4 also does reset.
++ */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++ if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
++ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++ !mptcp_is_data_fin2(skb, tp)) {
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++ tcp_reset(meta_sk);
++ return 1;
++ }
++ }
++ break;
++ }
++
++ return 0;
++}
++
++/**
++ * @return:
++ * i) 1: Everything's fine.
++ * ii) -1: A reset has been sent on the subflow - csum-failure
++ * iii) 0: csum-failure but no reset sent, because it's the last subflow.
++ * Last packet should not be destroyed by the caller because it has
++ * been done here.
++ */
++static int mptcp_verif_dss_csum(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *tmp, *tmp1, *last = NULL;
++ __wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
++ int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
++ int iter = 0;
++
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
++ unsigned int csum_len;
++
++ if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
++ /* Mapping ends in the middle of the packet -
++ * csum only these bytes
++ */
++ csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
++ else
++ csum_len = tmp->len;
++
++ offset = 0;
++ if (overflowed) {
++ char first_word[4];
++ first_word[0] = 0;
++ first_word[1] = 0;
++ first_word[2] = 0;
++ first_word[3] = *(tmp->data);
++ csum_tcp = csum_partial(first_word, 4, csum_tcp);
++ offset = 1;
++ csum_len--;
++ overflowed = 0;
++ }
++
++ csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
++
++ /* Was it on an odd-length? Then we have to merge the next byte
++ * correctly (see above)
++ */
++ if (csum_len != (csum_len & (~1)))
++ overflowed = 1;
++
++ if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
++ __be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
++
++ /* If a 64-bit dss is present, we increase the offset
++ * by 4 bytes, as the high-order 64-bits will be added
++ * in the final csum_partial-call.
++ */
++ u32 offset = skb_transport_offset(tmp) +
++ TCP_SKB_CB(tmp)->dss_off;
++ if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
++ offset += 4;
++
++ csum_tcp = skb_checksum(tmp, offset,
++ MPTCP_SUB_LEN_SEQ_CSUM,
++ csum_tcp);
++
++ csum_tcp = csum_partial(&data_seq,
++ sizeof(data_seq), csum_tcp);
++
++ dss_csum_added = 1; /* Just do it once */
++ }
++ last = tmp;
++ iter++;
++
++ if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
++ !before(TCP_SKB_CB(tmp1)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++
++ /* Now, checksum must be 0 */
++ if (unlikely(csum_fold(csum_tcp))) {
++ pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
++ __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
++ dss_csum_added, overflowed, iter);
++
++ tp->mptcp->send_mp_fail = 1;
++
++ /* map_data_seq is the data-seq number of the
++ * mapping we are currently checking
++ */
++ tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
++
++ if (tp->mpcb->cnt_subflows > 1) {
++ mptcp_send_reset(sk);
++ ans = -1;
++ } else {
++ tp->mpcb->send_infinite_mapping = 1;
++
++ /* Need to purge the rcv-queue as it's no more valid */
++ while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
++ tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
++ kfree_skb(tmp);
++ }
++
++ ans = 0;
++ }
++ }
++
++ return ans;
++}
++
++static inline void mptcp_prepare_skb(struct sk_buff *skb,
++ const struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 inc = 0;
++
++ /* If skb is the end of this mapping (end is always at mapping-boundary
++ * thanks to the splitting/trimming), then we need to increase
++ * data-end-seq by 1 if this here is a data-fin.
++ *
++ * We need to do -1 because end_seq includes the subflow-FIN.
++ */
++ if (tp->mptcp->map_data_fin &&
++ (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
++ (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++ inc = 1;
++
++ /* We manually set the fin-flag if it is a data-fin. For easy
++ * processing in tcp_recvmsg.
++ */
++ tcp_hdr(skb)->fin = 1;
++ } else {
++ /* We may have a subflow-fin with data but without data-fin */
++ tcp_hdr(skb)->fin = 0;
++ }
++
++ /* Adapt data-seq's to the packet itself. We kinda transform the
++ * dss-mapping to a per-packet granularity. This is necessary to
++ * correctly handle overlapping mappings coming from different
++ * subflows. Otherwise it would be a complete mess.
++ */
++ tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
++ tcb->end_seq = tcb->seq + skb->len + inc;
++}
++
++/**
++ * @return: 1 if the segment has been eaten and can be suppressed,
++ * otherwise 0.
++ */
++static inline int mptcp_direct_copy(const struct sk_buff *skb,
++ struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
++ int eaten = 0;
++
++ __set_current_state(TASK_RUNNING);
++
++ local_bh_enable();
++ if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
++ meta_tp->ucopy.len -= chunk;
++ meta_tp->copied_seq += chunk;
++ eaten = (chunk == skb->len);
++ tcp_rcv_space_adjust(meta_sk);
++ }
++ local_bh_disable();
++ return eaten;
++}
++
++static inline void mptcp_reset_mapping(struct tcp_sock *tp)
++{
++ tp->mptcp->map_data_len = 0;
++ tp->mptcp->map_data_seq = 0;
++ tp->mptcp->map_subseq = 0;
++ tp->mptcp->map_data_fin = 0;
++ tp->mptcp->mapping_present = 0;
++}
++
++/* The DSS-mapping received on the sk only covers the second half of the skb
++ * (cut at seq). We trim the head from the skb.
++ * Data will be freed upon kfree().
++ *
++ * Inspired by tcp_trim_head().
++ */
++static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++ int len = seq - TCP_SKB_CB(skb)->seq;
++ u32 new_seq = TCP_SKB_CB(skb)->seq + len;
++
++ if (len < skb_headlen(skb))
++ __skb_pull(skb, len);
++ else
++ __pskb_trim_head(skb, len - skb_headlen(skb));
++
++ TCP_SKB_CB(skb)->seq = new_seq;
++
++ skb->truesize -= len;
++ atomic_sub(len, &sk->sk_rmem_alloc);
++ sk_mem_uncharge(sk, len);
++}
++
++/* The DSS-mapping received on the sk only covers the first half of the skb
++ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
++ * as further packets may resolve the mapping of the second half of data.
++ *
++ * Inspired by tcp_fragment().
++ */
++static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++ struct sk_buff *buff;
++ int nsize;
++ int nlen, len;
++
++ len = seq - TCP_SKB_CB(skb)->seq;
++ nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
++ if (nsize < 0)
++ nsize = 0;
++
++ /* Get a new skb... force flag on. */
++ buff = alloc_skb(nsize, GFP_ATOMIC);
++ if (buff == NULL)
++ return -ENOMEM;
++
++ skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
++ skb_reset_transport_header(buff);
++
++ tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
++ tcp_hdr(skb)->fin = 0;
++
++ /* We absolutly need to call skb_set_owner_r before refreshing the
++ * truesize of buff, otherwise the moved data will account twice.
++ */
++ skb_set_owner_r(buff, sk);
++ nlen = skb->len - len - nsize;
++ buff->truesize += nlen;
++ skb->truesize -= nlen;
++
++ /* Correct the sequence numbers. */
++ TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
++ TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
++ TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
++
++ skb_split(skb, buff, len);
++
++ __skb_queue_after(&sk->sk_receive_queue, skb, buff);
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ /* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
++ if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
++ !tp->mpcb->infinite_mapping_rcv) {
++ /* Remove a pure subflow-fin from the queue and increase
++ * copied_seq.
++ */
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++
++ /* If we are not yet fully established and do not know the mapping for
++ * this segment, this path has to fallback to infinite or be torn down.
++ */
++ if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
++ !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
++ pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
++ __func__, tp->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, __builtin_return_address(0),
++ TCP_SKB_CB(skb)->seq);
++
++ if (!is_master_tp(tp)) {
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mpcb->infinite_mapping_rcv = 1;
++ /* We do a seamless fallback and should not send a inf.mapping. */
++ tp->mpcb->send_infinite_mapping = 0;
++ tp->mptcp->fully_established = 1;
++ }
++
++ /* Receiver-side becomes fully established when a whole rcv-window has
++ * been received without the need to fallback due to the previous
++ * condition.
++ */
++ if (!tp->mptcp->fully_established) {
++ tp->mptcp->init_rcv_wnd -= skb->len;
++ if (tp->mptcp->init_rcv_wnd < 0)
++ mptcp_become_fully_estab(sk);
++ }
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 *ptr;
++ u32 data_seq, sub_seq, data_len, tcp_end_seq;
++
++ /* If we are in infinite-mapping-mode, the subflow is guaranteed to be
++ * in-order at the data-level. Thus data-seq-numbers can be inferred
++ * from what is expected at the data-level.
++ */
++ if (mpcb->infinite_mapping_rcv) {
++ tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
++ tp->mptcp->map_subseq = tcb->seq;
++ tp->mptcp->map_data_len = skb->len;
++ tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
++ tp->mptcp->mapping_present = 1;
++ return 0;
++ }
++
++ /* No mapping here? Exit - it is either already set or still on its way */
++ if (!mptcp_is_data_seq(skb)) {
++ /* Too many packets without a mapping - this subflow is broken */
++ if (!tp->mptcp->mapping_present &&
++ tp->rcv_nxt - tp->copied_seq > 65536) {
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ return 0;
++ }
++
++ ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
++ ptr++;
++ sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
++ ptr++;
++ data_len = get_unaligned_be16(ptr);
++
++ /* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
++ * The draft sets it to 0, but we really would like to have the
++ * real value, to have an easy handling afterwards here in this
++ * function.
++ */
++ if (mptcp_is_data_fin(skb) && skb->len == 0)
++ sub_seq = TCP_SKB_CB(skb)->seq;
++
++ /* If there is already a mapping - we check if it maps with the current
++ * one. If not - we reset.
++ */
++ if (tp->mptcp->mapping_present &&
++ (data_seq != (u32)tp->mptcp->map_data_seq ||
++ sub_seq != tp->mptcp->map_subseq ||
++ data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
++ mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
++ /* Mapping in packet is different from what we want */
++ pr_err("%s Mappings do not match!\n", __func__);
++ pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
++ __func__, data_seq, (u32)tp->mptcp->map_data_seq,
++ sub_seq, tp->mptcp->map_subseq, data_len,
++ tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
++ tp->mptcp->map_data_fin);
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ /* If the previous check was good, the current mapping is valid and we exit. */
++ if (tp->mptcp->mapping_present)
++ return 0;
++
++ /* Mapping not yet set on this subflow - we set it here! */
++
++ if (!data_len) {
++ mpcb->infinite_mapping_rcv = 1;
++ tp->mptcp->fully_established = 1;
++ /* We need to repeat mp_fail's until the sender felt
++ * back to infinite-mapping - here we stop repeating it.
++ */
++ tp->mptcp->send_mp_fail = 0;
++
++ /* We have to fixup data_len - it must be the same as skb->len */
++ data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
++ sub_seq = tcb->seq;
++
++ /* TODO kill all other subflows than this one */
++ /* data_seq and so on are set correctly */
++
++ /* At this point, the meta-ofo-queue has to be emptied,
++ * as the following data is guaranteed to be in-order at
++ * the data and subflow-level
++ */
++ mptcp_purge_ofo_queue(meta_tp);
++ }
++
++ /* We are sending mp-fail's and thus are in fallback mode.
++ * Ignore packets which do not announce the fallback and still
++ * want to provide a mapping.
++ */
++ if (tp->mptcp->send_mp_fail) {
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++
++ /* FIN increased the mapping-length by 1 */
++ if (mptcp_is_data_fin(skb))
++ data_len--;
++
++ /* Subflow-sequences of packet must be
++ * (at least partially) be part of the DSS-mapping's
++ * subflow-sequence-space.
++ *
++ * Basically the mapping is not valid, if either of the
++ * following conditions is true:
++ *
++ * 1. It's not a data_fin and
++ * MPTCP-sub_seq >= TCP-end_seq
++ *
++ * 2. It's a data_fin and TCP-end_seq > TCP-seq and
++ * MPTCP-sub_seq >= TCP-end_seq
++ *
++ * The previous two can be merged into:
++ * TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
++ * Because if it's not a data-fin, TCP-end_seq > TCP-seq
++ *
++ * 3. It's a data_fin and skb->len == 0 and
++ * MPTCP-sub_seq > TCP-end_seq
++ *
++ * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
++ * MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
++ *
++ * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
++ */
++
++ /* subflow-fin is not part of the mapping - ignore it here ! */
++ tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
++ if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
++ (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
++ (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
++ before(sub_seq, tp->copied_seq)) {
++ /* Subflow-sequences of packet is different from what is in the
++ * packet's dss-mapping. The peer is misbehaving - reset
++ */
++ pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
++ "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
++ "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
++ skb->len, data_len, tp->copied_seq);
++ mptcp_send_reset(sk);
++ return 1;
++ }
++
++ /* Does the DSS had 64-bit seqnum's ? */
++ if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
++ /* Wrapped around? */
++ if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
++ } else {
++ /* Else, access the default high-order bits */
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
++ }
++ } else {
++ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
++
++ if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
++ /* We make sure that the data_seq is invalid.
++ * It will be dropped later.
++ */
++ tp->mptcp->map_data_seq += 0xFFFFFFFF;
++ tp->mptcp->map_data_seq += 0xFFFFFFFF;
++ }
++ }
++
++ tp->mptcp->map_data_len = data_len;
++ tp->mptcp->map_subseq = sub_seq;
++ tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
++ tp->mptcp->mapping_present = 1;
++
++ return 0;
++}
++
++/* Similar to tcp_sequence(...) */
++static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
++ u64 data_seq, u64 end_data_seq)
++{
++ const struct mptcp_cb *mpcb = meta_tp->mpcb;
++ u64 rcv_wup64;
++
++ /* Wrap-around? */
++ if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
++ rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
++ meta_tp->rcv_wup;
++ } else {
++ rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++ meta_tp->rcv_wup);
++ }
++
++ return !before64(end_data_seq, rcv_wup64) &&
++ !after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sk_buff *tmp, *tmp1;
++ u32 tcp_end_seq;
++
++ if (!tp->mptcp->mapping_present)
++ return 0;
++
++ /* either, the new skb gave us the mapping and the first segment
++ * in the sub-rcv-queue has to be trimmed ...
++ */
++ tmp = skb_peek(&sk->sk_receive_queue);
++ if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
++ after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
++ mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
++
++ /* ... or the new skb (tail) has to be split at the end. */
++ tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
++ if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++ u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
++ if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
++ /* TODO : maybe handle this here better.
++ * We now just force meta-retransmission.
++ */
++ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++ __skb_unlink(skb, &sk->sk_receive_queue);
++ __kfree_skb(skb);
++ return -1;
++ }
++ }
++
++ /* Now, remove old sk_buff's from the receive-queue.
++ * This may happen if the mapping has been lost for these segments and
++ * the next mapping has already been received.
++ */
++ if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
++ break;
++
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++
++ /* Impossible that we could free skb here, because his
++ * mapping is known to be valid from previous checks
++ */
++ __kfree_skb(tmp1);
++ }
++ }
++
++ return 0;
++}
++
++/* @return: 0 everything is fine. Just continue processing
++ * 1 subflow is broken stop everything
++ * -1 this mapping has been put in the meta-receive-queue
++ * -2 this mapping has been eaten by the application
++ */
++static int mptcp_queue_skb(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sk_buff *tmp, *tmp1;
++ u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
++ bool data_queued = false;
++
++ /* Have we not yet received the full mapping? */
++ if (!tp->mptcp->mapping_present ||
++ before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ return 0;
++
++ /* Is this an overlapping mapping? rcv_nxt >= end_data_seq
++ * OR
++ * This mapping is out of window
++ */
++ if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
++ !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
++ tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ __kfree_skb(tmp1);
++
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++
++ mptcp_reset_mapping(tp);
++
++ return -1;
++ }
++
++ /* Record it, because we want to send our data_fin on the same path */
++ if (tp->mptcp->map_data_fin) {
++ mpcb->dfin_path_index = tp->mptcp->path_index;
++ mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
++ }
++
++ /* Verify the checksum */
++ if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
++ int ret = mptcp_verif_dss_csum(sk);
++
++ if (ret <= 0) {
++ mptcp_reset_mapping(tp);
++ return 1;
++ }
++ }
++
++ if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
++ /* Seg's have to go to the meta-ofo-queue */
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_prepare_skb(tmp1, sk);
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ /* MUST be done here, because fragstolen may be true later.
++ * Then, kfree_skb_partial will not account the memory.
++ */
++ skb_orphan(tmp1);
++
++ if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
++ mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
++ else
++ __kfree_skb(tmp1);
++
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++ tcp_enter_quickack_mode(sk);
++ } else {
++ /* Ready for the meta-rcv-queue */
++ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++ int eaten = 0;
++ const bool copied_early = false;
++ bool fragstolen = false;
++ u32 old_rcv_nxt = meta_tp->rcv_nxt;
++
++ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_prepare_skb(tmp1, sk);
++ __skb_unlink(tmp1, &sk->sk_receive_queue);
++ /* MUST be done here, because fragstolen may be true.
++ * Then, kfree_skb_partial will not account the memory.
++ */
++ skb_orphan(tmp1);
++
++ /* This segment has already been received */
++ if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
++ __kfree_skb(tmp1);
++ goto next;
++ }
++
++#ifdef CONFIG_NET_DMA
++ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.task == current &&
++ meta_tp->copied_seq == meta_tp->rcv_nxt &&
++ tmp1->len <= meta_tp->ucopy.len &&
++ sock_owned_by_user(meta_sk) &&
++ tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
++ copied_early = true;
++ eaten = 1;
++ }
++#endif
++
++ /* Is direct copy possible ? */
++ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.task == current &&
++ meta_tp->copied_seq == meta_tp->rcv_nxt &&
++ meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
++ !copied_early)
++ eaten = mptcp_direct_copy(tmp1, meta_sk);
++
++ if (mpcb->in_time_wait) /* In time-wait, do not receive data */
++ eaten = 1;
++
++ if (!eaten)
++ eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
++
++ meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
++ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++#ifdef CONFIG_NET_DMA
++ if (copied_early)
++ meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
++#endif
++
++ if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
++ mptcp_fin(meta_sk);
++
++ /* Check if this fills a gap in the ofo queue */
++ if (!skb_queue_empty(&meta_tp->out_of_order_queue))
++ mptcp_ofo_queue(meta_sk);
++
++#ifdef CONFIG_NET_DMA
++ if (copied_early)
++ __skb_queue_tail(&meta_sk->sk_async_wait_queue,
++ tmp1);
++ else
++#endif
++ if (eaten)
++ kfree_skb_partial(tmp1, fragstolen);
++
++ data_queued = true;
++next:
++ if (!skb_queue_empty(&sk->sk_receive_queue) &&
++ !before(TCP_SKB_CB(tmp)->seq,
++ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++ break;
++ }
++ }
++
++ inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
++ mptcp_reset_mapping(tp);
++
++ return data_queued ? -1 : -2;
++}
++
++void mptcp_data_ready(struct sock *sk)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct sk_buff *skb, *tmp;
++ int queued = 0;
++
++ /* restart before the check, because mptcp_fin might have changed the
++ * state.
++ */
++restart:
++ /* If the meta cannot receive data, there is no point in pushing data.
++ * If we are in time-wait, we may still be waiting for the final FIN.
++ * So, we should proceed with the processing.
++ */
++ if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
++ skb_queue_purge(&sk->sk_receive_queue);
++ tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
++ goto exit;
++ }
++
++ /* Iterate over all segments, detect their mapping (if we don't have
++ * one yet), validate them and push everything one level higher.
++ */
++ skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
++ int ret;
++ /* Pre-validation - e.g., early fallback */
++ ret = mptcp_prevalidate_skb(sk, skb);
++ if (ret < 0)
++ goto restart;
++ else if (ret > 0)
++ break;
++
++ /* Set the current mapping */
++ ret = mptcp_detect_mapping(sk, skb);
++ if (ret < 0)
++ goto restart;
++ else if (ret > 0)
++ break;
++
++ /* Validation */
++ if (mptcp_validate_mapping(sk, skb) < 0)
++ goto restart;
++
++ /* Push a level higher */
++ ret = mptcp_queue_skb(sk);
++ if (ret < 0) {
++ if (ret == -1)
++ queued = ret;
++ goto restart;
++ } else if (ret == 0) {
++ continue;
++ } else { /* ret == 1 */
++ break;
++ }
++ }
++
++exit:
++ if (tcp_sk(sk)->close_it) {
++ tcp_send_ack(sk);
++ tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
++ }
++
++ if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
++ meta_sk->sk_data_ready(meta_sk);
++}
++
++
++int mptcp_check_req(struct sk_buff *skb, struct net *net)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ struct sock *meta_sk = NULL;
++
++ /* MPTCP structures not initialized */
++ if (mptcp_init_failed)
++ return 0;
++
++ if (skb->protocol == htons(ETH_P_IP))
++ meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr, net);
++#if IS_ENABLED(CONFIG_IPV6)
++ else /* IPv6 */
++ meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
++ &ipv6_hdr(skb)->daddr, net);
++#endif /* CONFIG_IPV6 */
++
++ if (!meta_sk)
++ return 0;
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
++ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++ sock_put(meta_sk); /* Taken by mptcp_search_req */
++ kfree_skb(skb);
++ return 1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else { /* IPv6 */
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
++ return 1;
++}
++
++struct mp_join *mptcp_find_join(const struct sk_buff *skb)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ unsigned char *ptr;
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++ /* Jump through the options to check whether JOIN is there */
++ ptr = (unsigned char *)(th + 1);
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return NULL;
++ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2) /* "silly options" */
++ return NULL;
++ if (opsize > length)
++ return NULL; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
++ return (struct mp_join *)(ptr - 2);
++ }
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++ }
++ return NULL;
++}
++
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
++{
++ const struct mptcp_cb *mpcb;
++ struct sock *meta_sk;
++ u32 token;
++ bool meta_v4;
++ struct mp_join *join_opt = mptcp_find_join(skb);
++ if (!join_opt)
++ return 0;
++
++ /* MPTCP structures were not initialized, so return error */
++ if (mptcp_init_failed)
++ return -1;
++
++ token = join_opt->u.syn.token;
++ meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
++ if (!meta_sk) {
++ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++ return -1;
++ }
++
++ meta_v4 = meta_sk->sk_family == AF_INET;
++ if (meta_v4) {
++ if (skb->protocol == htons(ETH_P_IPV6)) {
++ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP) &&
++ inet6_sk(meta_sk)->ipv6only) {
++ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ mpcb = tcp_sk(meta_sk)->mpcb;
++ if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
++ /* We are in fallback-mode on the reception-side -
++ * no new subflows!
++ */
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ /* Coming from time-wait-sock processing in tcp_v4_rcv.
++ * We have to deschedule it before continuing, because otherwise
++ * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
++ */
++ if (tw) {
++ inet_twsk_deschedule(tw, &tcp_death_row);
++ inet_twsk_put(tw);
++ }
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++ /* OK, this is a new syn/join, let's create a new open request and
++ * send syn+ack
++ */
++ bh_lock_sock_nested(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++ bh_unlock_sock(meta_sk);
++ NET_INC_STATS_BH(sock_net(meta_sk),
++ LINUX_MIB_TCPBACKLOGDROP);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ kfree_skb(skb);
++ return 1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else {
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return 1;
++}
++
++int mptcp_do_join_short(struct sk_buff *skb,
++ const struct mptcp_options_received *mopt,
++ struct net *net)
++{
++ struct sock *meta_sk;
++ u32 token;
++ bool meta_v4;
++
++ token = mopt->mptcp_rem_token;
++ meta_sk = mptcp_hash_find(net, token);
++ if (!meta_sk) {
++ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++ return -1;
++ }
++
++ meta_v4 = meta_sk->sk_family == AF_INET;
++ if (meta_v4) {
++ if (skb->protocol == htons(ETH_P_IPV6)) {
++ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++ } else if (skb->protocol == htons(ETH_P_IP) &&
++ inet6_sk(meta_sk)->ipv6only) {
++ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++ /* OK, this is a new syn/join, let's create a new open request and
++ * send syn+ack
++ */
++ bh_lock_sock(meta_sk);
++
++ /* This check is also done in mptcp_vX_do_rcv. But, there we cannot
++ * call tcp_vX_send_reset, because we hold already two socket-locks.
++ * (the listener and the meta from above)
++ *
++ * And the send-reset will try to take yet another one (ip_send_reply).
++ * Thus, we propagate the reset up to tcp_rcv_state_process.
++ */
++ if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
++ tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
++ meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return -1;
++ }
++
++ if (sock_owned_by_user(meta_sk)) {
++ skb->sk = meta_sk;
++ if (unlikely(sk_add_backlog(meta_sk, skb,
++ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
++ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++ else
++ /* Must make sure that upper layers won't free the
++ * skb if it is added to the backlog-queue.
++ */
++ skb_get(skb);
++ } else {
++ /* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
++ * the skb will finally be freed by tcp_v4_do_rcv (where we are
++ * coming from)
++ */
++ skb_get(skb);
++ if (skb->protocol == htons(ETH_P_IP)) {
++ tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++ } else { /* IPv6 */
++ tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++ }
++ }
++
++ bh_unlock_sock(meta_sk);
++ sock_put(meta_sk); /* Taken by mptcp_hash_find */
++ return 0;
++}
++
++/**
++ * Equivalent of tcp_fin() for MPTCP
++ * Can be called only when the FIN is validly part
++ * of the data seqnum space. Not before when we get holes.
++ */
++void mptcp_fin(struct sock *meta_sk)
++{
++ struct sock *sk = NULL, *sk_it;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
++ sk = sk_it;
++ break;
++ }
++ }
++
++ if (!sk || sk->sk_state == TCP_CLOSE)
++ sk = mptcp_select_ack_sock(meta_sk);
++
++ inet_csk_schedule_ack(sk);
++
++ meta_sk->sk_shutdown |= RCV_SHUTDOWN;
++ sock_set_flag(meta_sk, SOCK_DONE);
++
++ switch (meta_sk->sk_state) {
++ case TCP_SYN_RECV:
++ case TCP_ESTABLISHED:
++ /* Move to CLOSE_WAIT */
++ tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
++ inet_csk(sk)->icsk_ack.pingpong = 1;
++ break;
++
++ case TCP_CLOSE_WAIT:
++ case TCP_CLOSING:
++ /* Received a retransmission of the FIN, do
++ * nothing.
++ */
++ break;
++ case TCP_LAST_ACK:
++ /* RFC793: Remain in the LAST-ACK state. */
++ break;
++
++ case TCP_FIN_WAIT1:
++ /* This case occurs when a simultaneous close
++ * happens, we must ack the received FIN and
++ * enter the CLOSING state.
++ */
++ tcp_send_ack(sk);
++ tcp_set_state(meta_sk, TCP_CLOSING);
++ break;
++ case TCP_FIN_WAIT2:
++ /* Received a FIN -- send ACK and enter TIME_WAIT. */
++ tcp_send_ack(sk);
++ meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
++ break;
++ default:
++ /* Only TCP_LISTEN and TCP_CLOSE are left, in these
++ * cases we should never reach this piece of code.
++ */
++ pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
++ meta_sk->sk_state);
++ break;
++ }
++
++ /* It _is_ possible, that we have something out-of-order _after_ FIN.
++ * Probably, we should reset in this case. For now drop them.
++ */
++ mptcp_purge_ofo_queue(meta_tp);
++ sk_mem_reclaim(meta_sk);
++
++ if (!sock_flag(meta_sk, SOCK_DEAD)) {
++ meta_sk->sk_state_change(meta_sk);
++
++ /* Do not send POLL_HUP for half duplex close. */
++ if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
++ meta_sk->sk_state == TCP_CLOSE)
++ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
++ else
++ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
++ }
++
++ return;
++}
++
++static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++
++ if (!meta_tp->packets_out)
++ return;
++
++ tcp_for_write_queue(skb, meta_sk) {
++ if (skb == tcp_send_head(meta_sk))
++ break;
++
++ if (mptcp_retransmit_skb(meta_sk, skb))
++ return;
++
++ if (skb == tcp_write_queue_head(meta_sk))
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++ inet_csk(meta_sk)->icsk_rto,
++ TCP_RTO_MAX);
++ }
++}
++
++/* Handle the DATA_ACK */
++static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
++{
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ u32 prior_snd_una = meta_tp->snd_una;
++ int prior_packets;
++ u32 nwin, data_ack, data_seq;
++ u16 data_len = 0;
++
++ /* A valid packet came in - subflow is operational again */
++ tp->pf = 0;
++
++ /* Even if there is no data-ack, we stop retransmitting.
++ * Except if this is a SYN/ACK. Then it is just a retransmission
++ */
++ if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
++ tp->mptcp->pre_established = 0;
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++ }
++
++ /* If we are in infinite mapping mode, rx_opt.data_ack has been
++ * set by mptcp_clean_rtx_infinite.
++ */
++ if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
++ goto exit;
++
++ data_ack = tp->mptcp->rx_opt.data_ack;
++
++ if (unlikely(!tp->mptcp->fully_established) &&
++ tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
++ /* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
++ * includes a data-ack, we are fully established
++ */
++ mptcp_become_fully_estab(sk);
++
++ /* Get the data_seq */
++ if (mptcp_is_data_seq(skb)) {
++ data_seq = tp->mptcp->rx_opt.data_seq;
++ data_len = tp->mptcp->rx_opt.data_len;
++ } else {
++ data_seq = meta_tp->snd_wl1;
++ }
++
++ /* If the ack is older than previous acks
++ * then we can probably ignore it.
++ */
++ if (before(data_ack, prior_snd_una))
++ goto exit;
++
++ /* If the ack includes data we haven't sent yet, discard
++ * this segment (RFC793 Section 3.9).
++ */
++ if (after(data_ack, meta_tp->snd_nxt))
++ goto exit;
++
++ /*** Now, update the window - inspired by tcp_ack_update_window ***/
++ nwin = ntohs(tcp_hdr(skb)->window);
++
++ if (likely(!tcp_hdr(skb)->syn))
++ nwin <<= tp->rx_opt.snd_wscale;
++
++ if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
++ tcp_update_wl(meta_tp, data_seq);
++
++ /* Draft v09, Section 3.3.5:
++ * [...] It should only update its local receive window values
++ * when the largest sequence number allowed (i.e. DATA_ACK +
++ * receive window) increases. [...]
++ */
++ if (meta_tp->snd_wnd != nwin &&
++ !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
++ meta_tp->snd_wnd = nwin;
++
++ if (nwin > meta_tp->max_window)
++ meta_tp->max_window = nwin;
++ }
++ }
++ /*** Done, update the window ***/
++
++ /* We passed data and got it acked, remove any soft error
++ * log. Something worked...
++ */
++ sk->sk_err_soft = 0;
++ inet_csk(meta_sk)->icsk_probes_out = 0;
++ meta_tp->rcv_tstamp = tcp_time_stamp;
++ prior_packets = meta_tp->packets_out;
++ if (!prior_packets)
++ goto no_queue;
++
++ meta_tp->snd_una = data_ack;
++
++ mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
++
++ /* We are in loss-state, and something got acked, retransmit the whole
++ * queue now!
++ */
++ if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
++ after(data_ack, prior_snd_una)) {
++ mptcp_xmit_retransmit_queue(meta_sk);
++ inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
++ }
++
++ /* Simplified version of tcp_new_space, because the snd-buffer
++ * is handled by all the subflows.
++ */
++ if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
++ sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
++ if (meta_sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++ meta_sk->sk_write_space(meta_sk);
++ }
++
++ if (meta_sk->sk_state != TCP_ESTABLISHED &&
++ mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
++ return;
++
++exit:
++ mptcp_push_pending_frames(meta_sk);
++
++ return;
++
++no_queue:
++ if (tcp_send_head(meta_sk))
++ tcp_ack_probe(meta_sk);
++
++ mptcp_push_pending_frames(meta_sk);
++
++ return;
++}
++
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
++
++ if (!tp->mpcb->infinite_mapping_snd)
++ return;
++
++ /* The difference between both write_seq's represents the offset between
++ * data-sequence and subflow-sequence. As we are infinite, this must
++ * match.
++ *
++ * Thus, from this difference we can infer the meta snd_una.
++ */
++ tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
++ tp->snd_una;
++
++ mptcp_data_ack(sk, skb);
++}
++
++/**** static functions used by mptcp_parse_options */
++
++static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
++{
++ struct sock *sk_it, *tmpsk;
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
++ mptcp_reinject_data(sk_it, 0);
++ sk_it->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk_it->sk_state))
++ tcp_sk(sk_it)->ops->send_active_reset(sk_it,
++ GFP_ATOMIC);
++ mptcp_sub_force_close(sk_it);
++ }
++ }
++}
++
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++ struct mptcp_options_received *mopt,
++ const struct sk_buff *skb)
++{
++ const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
++
++ /* If the socket is mp-capable we would have a mopt. */
++ if (!mopt)
++ return;
++
++ switch (mp_opt->sub) {
++ case MPTCP_SUB_CAPABLE:
++ {
++ const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
++ opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
++ mptcp_debug("%s: mp_capable: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ if (!sysctl_mptcp_enabled)
++ break;
++
++ /* We only support MPTCP version 0 */
++ if (mpcapable->ver != 0)
++ break;
++
++ /* MPTCP-RFC 6824:
++ * "If receiving a message with the 'B' flag set to 1, and this
++ * is not understood, then this SYN MUST be silently ignored;
++ */
++ if (mpcapable->b) {
++ mopt->drop_me = 1;
++ break;
++ }
++
++ /* MPTCP-RFC 6824:
++ * "An implementation that only supports this method MUST set
++ * bit "H" to 1, and bits "C" through "G" to 0."
++ */
++ if (!mpcapable->h)
++ break;
++
++ mopt->saw_mpc = 1;
++ mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
++
++ if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
++ mopt->mptcp_key = mpcapable->sender_key;
++
++ break;
++ }
++ case MPTCP_SUB_JOIN:
++ {
++ const struct mp_join *mpjoin = (struct mp_join *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
++ opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
++ opsize != MPTCP_SUB_LEN_JOIN_ACK) {
++ mptcp_debug("%s: mp_join: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ /* saw_mpc must be set, because in tcp_check_req we assume that
++ * it is set to support falling back to reg. TCP if a rexmitted
++ * SYN has no MP_CAPABLE or MP_JOIN
++ */
++ switch (opsize) {
++ case MPTCP_SUB_LEN_JOIN_SYN:
++ mopt->is_mp_join = 1;
++ mopt->saw_mpc = 1;
++ mopt->low_prio = mpjoin->b;
++ mopt->rem_id = mpjoin->addr_id;
++ mopt->mptcp_rem_token = mpjoin->u.syn.token;
++ mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
++ break;
++ case MPTCP_SUB_LEN_JOIN_SYNACK:
++ mopt->saw_mpc = 1;
++ mopt->low_prio = mpjoin->b;
++ mopt->rem_id = mpjoin->addr_id;
++ mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
++ mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
++ break;
++ case MPTCP_SUB_LEN_JOIN_ACK:
++ mopt->saw_mpc = 1;
++ mopt->join_ack = 1;
++ memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
++ break;
++ }
++ break;
++ }
++ case MPTCP_SUB_DSS:
++ {
++ const struct mp_dss *mdss = (struct mp_dss *)ptr;
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++
++ /* We check opsize for the csum and non-csum case. We do this,
++ * because the draft says that the csum SHOULD be ignored if
++ * it has not been negotiated in the MP_CAPABLE but still is
++ * present in the data.
++ *
++ * It will get ignored later in mptcp_queue_skb.
++ */
++ if (opsize != mptcp_sub_len_dss(mdss, 0) &&
++ opsize != mptcp_sub_len_dss(mdss, 1)) {
++ mptcp_debug("%s: mp_dss: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ ptr += 4;
++
++ if (mdss->A) {
++ tcb->mptcp_flags |= MPTCPHDR_ACK;
++
++ if (mdss->a) {
++ mopt->data_ack = (u32) get_unaligned_be64(ptr);
++ ptr += MPTCP_SUB_LEN_ACK_64;
++ } else {
++ mopt->data_ack = get_unaligned_be32(ptr);
++ ptr += MPTCP_SUB_LEN_ACK;
++ }
++ }
++
++ tcb->dss_off = (ptr - skb_transport_header(skb));
++
++ if (mdss->M) {
++ if (mdss->m) {
++ u64 data_seq64 = get_unaligned_be64(ptr);
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
++ mopt->data_seq = (u32) data_seq64;
++
++ ptr += 12; /* 64-bit dseq + subseq */
++ } else {
++ mopt->data_seq = get_unaligned_be32(ptr);
++ ptr += 8; /* 32-bit dseq + subseq */
++ }
++ mopt->data_len = get_unaligned_be16(ptr);
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++ /* Is a check-sum present? */
++ if (opsize == mptcp_sub_len_dss(mdss, 1))
++ tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
++
++ /* DATA_FIN only possible with DSS-mapping */
++ if (mdss->F)
++ tcb->mptcp_flags |= MPTCPHDR_FIN;
++ }
++
++ break;
++ }
++ case MPTCP_SUB_ADD_ADDR:
++ {
++#if IS_ENABLED(CONFIG_IPV6)
++ const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
++#else
++ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
++#endif /* CONFIG_IPV6 */
++ mptcp_debug("%s: mp_add_addr: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ /* We have to manually parse the options if we got two of them. */
++ if (mopt->saw_add_addr) {
++ mopt->more_add_addr = 1;
++ break;
++ }
++ mopt->saw_add_addr = 1;
++ mopt->add_addr_ptr = ptr;
++ break;
++ }
++ case MPTCP_SUB_REMOVE_ADDR:
++ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
++ mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ if (mopt->saw_rem_addr) {
++ mopt->more_rem_addr = 1;
++ break;
++ }
++ mopt->saw_rem_addr = 1;
++ mopt->rem_addr_ptr = ptr;
++ break;
++ case MPTCP_SUB_PRIO:
++ {
++ const struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++ if (opsize != MPTCP_SUB_LEN_PRIO &&
++ opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
++ mptcp_debug("%s: mp_prio: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ mopt->saw_low_prio = 1;
++ mopt->low_prio = mpprio->b;
++
++ if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
++ mopt->saw_low_prio = 2;
++ mopt->prio_addr_id = mpprio->addr_id;
++ }
++ break;
++ }
++ case MPTCP_SUB_FAIL:
++ if (opsize != MPTCP_SUB_LEN_FAIL) {
++ mptcp_debug("%s: mp_fail: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++ mopt->mp_fail = 1;
++ break;
++ case MPTCP_SUB_FCLOSE:
++ if (opsize != MPTCP_SUB_LEN_FCLOSE) {
++ mptcp_debug("%s: mp_fclose: bad option size %d\n",
++ __func__, opsize);
++ break;
++ }
++
++ mopt->mp_fclose = 1;
++ mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
++
++ break;
++ default:
++ mptcp_debug("%s: Received unkown subtype: %d\n",
++ __func__, mp_opt->sub);
++ break;
++ }
++}
++
++/** Parse only MPTCP options */
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++ struct mptcp_options_received *mopt)
++{
++ const struct tcphdr *th = tcp_hdr(skb);
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++ const unsigned char *ptr = (const unsigned char *)(th + 1);
++
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return;
++ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2) /* "silly options" */
++ return;
++ if (opsize > length)
++ return; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP)
++ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++ }
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++}
++
++int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++ struct mptcp_cb *mpcb = tp->mpcb;
++ struct sock *sk;
++ u32 rtt_max = 0;
++
++ /* In MPTCP, we take the max delay across all flows,
++ * in order to take into account meta-reordering buffers.
++ */
++ mptcp_for_each_sk(mpcb, sk) {
++ if (!mptcp_sk_can_recv(sk))
++ continue;
++
++ if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
++ rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
++ }
++ if (time < (rtt_max >> 3) || !rtt_max)
++ return 1;
++
++ return 0;
++}
++
++static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
++{
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ __be16 port = 0;
++ union inet_addr addr;
++ sa_family_t family;
++
++ if (mpadd->ipver == 4) {
++ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++ port = mpadd->u.v4.port;
++ family = AF_INET;
++ addr.in = mpadd->u.v4.addr;
++#if IS_ENABLED(CONFIG_IPV6)
++ } else if (mpadd->ipver == 6) {
++ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
++ port = mpadd->u.v6.port;
++ family = AF_INET6;
++ addr.in6 = mpadd->u.v6.addr;
++#endif /* CONFIG_IPV6 */
++ } else {
++ return;
++ }
++
++ if (mpcb->pm_ops->add_raddr)
++ mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
++}
++
++static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
++{
++ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++ int i;
++ u8 rem_id;
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
++ rem_id = (&mprem->addrs_id)[i];
++
++ if (mpcb->pm_ops->rem_raddr)
++ mpcb->pm_ops->rem_raddr(mpcb, rem_id);
++ mptcp_send_reset_rem_id(mpcb, rem_id);
++ }
++}
++
++static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
++{
++ struct tcphdr *th = tcp_hdr(skb);
++ unsigned char *ptr;
++ int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++ /* Jump through the options to check whether ADD_ADDR is there */
++ ptr = (unsigned char *)(th + 1);
++ while (length > 0) {
++ int opcode = *ptr++;
++ int opsize;
++
++ switch (opcode) {
++ case TCPOPT_EOL:
++ return;
++ case TCPOPT_NOP:
++ length--;
++ continue;
++ default:
++ opsize = *ptr++;
++ if (opsize < 2)
++ return;
++ if (opsize > length)
++ return; /* don't parse partial options */
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
++#if IS_ENABLED(CONFIG_IPV6)
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
++#else
++ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++#endif /* CONFIG_IPV6 */
++ goto cont;
++
++ mptcp_handle_add_addr(ptr, sk);
++ }
++ if (opcode == TCPOPT_MPTCP &&
++ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
++ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
++ goto cont;
++
++ mptcp_handle_rem_addr(ptr, sk);
++ }
++cont:
++ ptr += opsize - 2;
++ length -= opsize;
++ }
++ }
++ return;
++}
++
++static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
++{
++ struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ if (unlikely(mptcp->rx_opt.mp_fail)) {
++ mptcp->rx_opt.mp_fail = 0;
++
++ if (!th->rst && !mpcb->infinite_mapping_snd) {
++ struct sock *sk_it;
++
++ mpcb->send_infinite_mapping = 1;
++ /* We resend everything that has not been acknowledged */
++ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++
++ /* We artificially restart the whole send-queue. Thus,
++ * it is as if no packets are in flight
++ */
++ tcp_sk(meta_sk)->packets_out = 0;
++
++ /* If the snd_nxt already wrapped around, we have to
++ * undo the wrapping, as we are restarting from snd_una
++ * on.
++ */
++ if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
++ mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
++ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++ }
++ tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
++
++ /* Trigger a sending on the meta. */
++ mptcp_push_pending_frames(meta_sk);
++
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (sk != sk_it)
++ mptcp_sub_force_close(sk_it);
++ }
++ }
++
++ return 0;
++ }
++
++ if (unlikely(mptcp->rx_opt.mp_fclose)) {
++ struct sock *sk_it, *tmpsk;
++
++ mptcp->rx_opt.mp_fclose = 0;
++ if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
++ return 0;
++
++ if (tcp_need_reset(sk->sk_state))
++ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
++ mptcp_sub_force_close(sk_it);
++
++ tcp_reset(meta_sk);
++
++ return 1;
++ }
++
++ return 0;
++}
++
++static inline void mptcp_path_array_check(struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++
++ if (unlikely(mpcb->list_rcvd)) {
++ mpcb->list_rcvd = 0;
++ if (mpcb->pm_ops->new_remote_address)
++ mpcb->pm_ops->new_remote_address(meta_sk);
++ }
++}
++
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++ const struct sk_buff *skb)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++ if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
++ return 0;
++
++ if (mptcp_mp_fail_rcvd(sk, th))
++ return 1;
++
++ /* RFC 6824, Section 3.3:
++ * If a checksum is not present when its use has been negotiated, the
++ * receiver MUST close the subflow with a RST as it is considered broken.
++ */
++ if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
++ !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
++ if (tcp_need_reset(sk->sk_state))
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
++
++ mptcp_sub_force_close(sk);
++ return 1;
++ }
++
++ /* We have to acknowledge retransmissions of the third
++ * ack.
++ */
++ if (mopt->join_ack) {
++ tcp_send_delayed_ack(sk);
++ mopt->join_ack = 0;
++ }
++
++ if (mopt->saw_add_addr || mopt->saw_rem_addr) {
++ if (mopt->more_add_addr || mopt->more_rem_addr) {
++ mptcp_parse_addropt(skb, sk);
++ } else {
++ if (mopt->saw_add_addr)
++ mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
++ if (mopt->saw_rem_addr)
++ mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
++ }
++
++ mopt->more_add_addr = 0;
++ mopt->saw_add_addr = 0;
++ mopt->more_rem_addr = 0;
++ mopt->saw_rem_addr = 0;
++ }
++ if (mopt->saw_low_prio) {
++ if (mopt->saw_low_prio == 1) {
++ tp->mptcp->rcv_low_prio = mopt->low_prio;
++ } else {
++ struct sock *sk_it;
++ mptcp_for_each_sk(tp->mpcb, sk_it) {
++ struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
++ if (mptcp->rem_id == mopt->prio_addr_id)
++ mptcp->rcv_low_prio = mopt->low_prio;
++ }
++ }
++ mopt->saw_low_prio = 0;
++ }
++
++ mptcp_data_ack(sk, skb);
++
++ mptcp_path_array_check(mptcp_meta_sk(sk));
++ /* Socket may have been mp_killed by a REMOVE_ADDR */
++ if (tp->mp_killed)
++ return 1;
++
++ return 0;
++}
++
++/* In case of fastopen, some data can already be in the write queue.
++ * We need to update the sequence number of the segments as they
++ * were initially TCP sequence numbers.
++ */
++static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
++ struct sk_buff *skb;
++ u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
++
++ /* There should only be one skb in write queue: the data not
++ * acknowledged in the SYN+ACK. In this case, we need to map
++ * this data to data sequence numbers.
++ */
++ skb_queue_walk(&meta_sk->sk_write_queue, skb) {
++ /* If the server only acknowledges partially the data sent in
++ * the SYN, we need to trim the acknowledged part because
++ * we don't want to retransmit this already received data.
++ * When we reach this point, tcp_ack() has already cleaned up
++ * fully acked segments. However, tcp trims partially acked
++ * segments only when retransmitting. Since MPTCP comes into
++ * play only now, we will fake an initial transmit, and
++ * retransmit_skb() will not be called. The following fragment
++ * comes from __tcp_retransmit_skb().
++ */
++ if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
++ BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
++ master_tp->snd_una));
++ /* tcp_trim_head can only returns ENOMEM if skb is
++ * cloned. It is not the case here (see
++ * tcp_send_syn_data).
++ */
++ BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
++ TCP_SKB_CB(skb)->seq));
++ }
++
++ TCP_SKB_CB(skb)->seq += new_mapping;
++ TCP_SKB_CB(skb)->end_seq += new_mapping;
++ }
++
++ /* We can advance write_seq by the number of bytes unacknowledged
++ * and that were mapped in the previous loop.
++ */
++ meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
++
++ /* The packets from the master_sk will be entailed to it later
++ * Until that time, its write queue is empty, and
++ * write_seq must align with snd_una
++ */
++ master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
++ master_tp->packets_out = 0;
++
++ /* Although these data have been sent already over the subsk,
++ * They have never been sent over the meta_sk, so we rewind
++ * the send_head so that tcp considers it as an initial send
++ * (instead of retransmit).
++ */
++ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++}
++
++/* The skptr is needed, because if we become MPTCP-capable, we have to switch
++ * from meta-socket to master-socket.
++ *
++ * @return: 1 - we want to reset this connection
++ * 2 - we want to discard the received syn/ack
++ * 0 - everything is fine - continue
++ */
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++ const struct sk_buff *skb,
++ const struct mptcp_options_received *mopt)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (mptcp(tp)) {
++ u8 hash_mac_check[20];
++ struct mptcp_cb *mpcb = tp->mpcb;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++ (u8 *)&tp->mptcp->mptcp_loc_nonce,
++ (u32 *)hash_mac_check);
++ if (memcmp(hash_mac_check,
++ (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
++ mptcp_sub_force_close(sk);
++ return 1;
++ }
++
++ /* Set this flag in order to postpone data sending
++ * until the 4th ack arrives.
++ */
++ tp->mptcp->pre_established = 1;
++ tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
++
++ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++ (u8 *)&mpcb->mptcp_rem_key,
++ (u8 *)&tp->mptcp->mptcp_loc_nonce,
++ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++ (u32 *)&tp->mptcp->sender_mac[0]);
++
++ } else if (mopt->saw_mpc) {
++ struct sock *meta_sk = sk;
++
++ if (mptcp_create_master_sk(sk, mopt->mptcp_key,
++ ntohs(tcp_hdr(skb)->window)))
++ return 2;
++
++ sk = tcp_sk(sk)->mpcb->master_sk;
++ *skptr = sk;
++ tp = tcp_sk(sk);
++
++ /* If fastopen was used data might be in the send queue. We
++ * need to update their sequence number to MPTCP-level seqno.
++ * Note that it can happen in rare cases that fastopen_req is
++ * NULL and syn_data is 0 but fastopen indeed occurred and
++ * data has been queued in the write queue (but not sent).
++ * Example of such rare cases: connect is non-blocking and
++ * TFO is configured to work without cookies.
++ */
++ if (!skb_queue_empty(&meta_sk->sk_write_queue))
++ mptcp_rcv_synsent_fastopen(meta_sk);
++
++ /* -1, because the SYN consumed 1 byte. In case of TFO, we
++ * start the subflow-sequence number as if the data of the SYN
++ * is not part of any mapping.
++ */
++ tp->mptcp->snt_isn = tp->snd_una - 1;
++ tp->mpcb->dss_csum = mopt->dss_csum;
++ tp->mptcp->include_mpc = 1;
++
++ /* Ensure that fastopen is handled at the meta-level. */
++ tp->fastopen_req = NULL;
++
++ sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
++ sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
++
++ /* hold in sk_clone_lock due to initialization to 2 */
++ sock_put(sk);
++ } else {
++ tp->request_mptcp = 0;
++
++ if (tp->inside_tk_table)
++ mptcp_hash_remove(tp);
++ }
++
++ if (mptcp(tp))
++ tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
++
++ return 0;
++}
++
++bool mptcp_should_expand_sndbuf(const struct sock *sk)
++{
++ const struct sock *sk_it;
++ const struct sock *meta_sk = mptcp_meta_sk(sk);
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int cnt_backups = 0;
++ int backup_available = 0;
++
++ /* We circumvent this check in tcp_check_space, because we want to
++ * always call sk_write_space. So, we reproduce the check here.
++ */
++ if (!meta_sk->sk_socket ||
++ !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++ return false;
++
++ /* If the user specified a specific send buffer setting, do
++ * not modify it.
++ */
++ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++ return false;
++
++ /* If we are under global TCP memory pressure, do not expand. */
++ if (sk_under_memory_pressure(meta_sk))
++ return false;
++
++ /* If we are under soft global TCP memory pressure, do not expand. */
++ if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
++ return false;
++
++
++ /* For MPTCP we look for a subsocket that could send data.
++ * If we found one, then we update the send-buffer.
++ */
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++
++ if (!mptcp_sk_can_send(sk_it))
++ continue;
++
++ /* Backup-flows have to be counted - if there is no other
++ * subflow we take the backup-flow into account.
++ */
++ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
++ cnt_backups++;
++
++ if (tp_it->packets_out < tp_it->snd_cwnd) {
++ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
++ backup_available = 1;
++ continue;
++ }
++ return true;
++ }
++ }
++
++ /* Backup-flow is available for sending - update send-buffer */
++ if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
++ return true;
++ return false;
++}
++
++void mptcp_init_buffer_space(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ int space;
++
++ tcp_init_buffer_space(sk);
++
++ if (is_master_tp(tp)) {
++ meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
++ meta_tp->rcvq_space.time = tcp_time_stamp;
++ meta_tp->rcvq_space.seq = meta_tp->copied_seq;
++
++ /* If there is only one subflow, we just use regular TCP
++ * autotuning. User-locks are handled already by
++ * tcp_init_buffer_space
++ */
++ meta_tp->window_clamp = tp->window_clamp;
++ meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
++ meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
++ meta_sk->sk_sndbuf = sk->sk_sndbuf;
++
++ return;
++ }
++
++ if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
++ goto snd_buf;
++
++ /* Adding a new subflow to the rcv-buffer space. We make a simple
++ * addition, to give some space to allow traffic on the new subflow.
++ * Autotuning will increase it further later on.
++ */
++ space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
++ if (space > meta_sk->sk_rcvbuf) {
++ meta_tp->window_clamp += tp->window_clamp;
++ meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
++ meta_sk->sk_rcvbuf = space;
++ }
++
++snd_buf:
++ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++ return;
++
++ /* Adding a new subflow to the send-buffer space. We make a simple
++ * addition, to give some space to allow traffic on the new subflow.
++ * Autotuning will increase it further later on.
++ */
++ space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
++ if (space > meta_sk->sk_sndbuf) {
++ meta_sk->sk_sndbuf = space;
++ meta_sk->sk_write_space(meta_sk);
++ }
++}
++
++void mptcp_tcp_set_rto(struct sock *sk)
++{
++ tcp_set_rto(sk);
++ mptcp_set_rto(sk);
++}
+diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
+new file mode 100644
+index 000000000000..1183d1305d35
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv4.c
+@@ -0,0 +1,483 @@
++/*
++ * MPTCP implementation - IPv4-specific functions
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/ip.h>
++#include <linux/list.h>
++#include <linux/skbuff.h>
++#include <linux/spinlock.h>
++#include <linux/tcp.h>
++
++#include <net/inet_common.h>
++#include <net/inet_connection_sock.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/request_sock.h>
++#include <net/tcp.h>
++
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++ u32 hash[MD5_DIGEST_WORDS];
++
++ hash[0] = (__force u32)saddr;
++ hash[1] = (__force u32)daddr;
++ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++ hash[3] = mptcp_seed++;
++
++ md5_transform(hash, mptcp_secret);
++
++ return hash[0];
++}
++
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++ u32 hash[MD5_DIGEST_WORDS];
++
++ hash[0] = (__force u32)saddr;
++ hash[1] = (__force u32)daddr;
++ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++ hash[3] = mptcp_seed++;
++
++ md5_transform(hash, mptcp_secret);
++
++ return *((u64 *)hash);
++}
++
++
++static void mptcp_v4_reqsk_destructor(struct request_sock *req)
++{
++ mptcp_reqsk_destructor(req);
++
++ tcp_v4_reqsk_destructor(req);
++}
++
++static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++ mptcp_reqsk_init(req, skb);
++
++ return 0;
++}
++
++static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ union inet_addr addr;
++ int loc_id;
++ bool low_prio = false;
++
++ /* We need to do this as early as possible. Because, if we fail later
++ * (e.g., get_local_id), then reqsk_free tries to remove the
++ * request-socket from the htb in mptcp_hash_request_remove as pprev
++ * may be different from NULL.
++ */
++ mtreq->hash_entry.pprev = NULL;
++
++ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++
++ mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
++ ip_hdr(skb)->daddr,
++ tcp_hdr(skb)->source,
++ tcp_hdr(skb)->dest);
++ addr.ip = inet_rsk(req)->ir_loc_addr;
++ loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
++ if (loc_id == -1)
++ return -1;
++ mtreq->loc_id = loc_id;
++ mtreq->low_prio = low_prio;
++
++ mptcp_join_reqsk_init(mpcb, req, skb);
++
++ return 0;
++}
++
++/* Similar to tcp_request_sock_ops */
++struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
++ .family = PF_INET,
++ .obj_size = sizeof(struct mptcp_request_sock),
++ .rtx_syn_ack = tcp_rtx_synack,
++ .send_ack = tcp_v4_reqsk_send_ack,
++ .destructor = mptcp_v4_reqsk_destructor,
++ .send_reset = tcp_v4_send_reset,
++ .syn_ack_timeout = tcp_syn_ack_timeout,
++};
++
++static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
++ struct request_sock *req,
++ const unsigned long timeout)
++{
++ const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ 0, MPTCP_HASH_SIZE);
++ /* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
++ * want to reset the keepalive-timer (responsible for retransmitting
++ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++ * overload the keepalive timer. Also, it's not a big deal, because the
++ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++ * if the third ACK gets lost, the client will handle the retransmission
++ * anyways. If our SYN/ACK gets lost, the client will retransmit the
++ * SYN.
++ */
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++ const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ lopt->hash_rnd, lopt->nr_table_entries);
++
++ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++ mptcp_reset_synack_timer(meta_sk, timeout);
++
++ rcu_read_lock();
++ spin_lock(&mptcp_reqsk_hlock);
++ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++ spin_unlock(&mptcp_reqsk_hlock);
++ rcu_read_unlock();
++}
++
++/* Similar to tcp_v4_conn_request */
++static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return tcp_conn_request(&mptcp_request_sock_ops,
++ &mptcp_join_request_sock_ipv4_ops,
++ meta_sk, skb);
++}
++
++/* We only process join requests here. (either the SYN or the final ACK) */
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *child, *rsk = NULL;
++ int ret;
++
++ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++ struct tcphdr *th = tcp_hdr(skb);
++ const struct iphdr *iph = ip_hdr(skb);
++ struct sock *sk;
++
++ sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
++ iph->saddr, th->source, iph->daddr,
++ th->dest, inet_iif(skb));
++
++ if (!sk) {
++ kfree_skb(skb);
++ return 0;
++ }
++ if (is_meta_sk(sk)) {
++ WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
++ kfree_skb(skb);
++ sock_put(sk);
++ return 0;
++ }
++
++ if (sk->sk_state == TCP_TIME_WAIT) {
++ inet_twsk_put(inet_twsk(sk));
++ kfree_skb(skb);
++ return 0;
++ }
++
++ ret = tcp_v4_do_rcv(sk, skb);
++ sock_put(sk);
++
++ return ret;
++ }
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++ /* Has been removed from the tk-table. Thus, no new subflows.
++ *
++ * Check for close-state is necessary, because we may have been closed
++ * without passing by mptcp_close().
++ *
++ * When falling back, no new subflows are allowed either.
++ */
++ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++ goto reset_and_discard;
++
++ child = tcp_v4_hnd_req(meta_sk, skb);
++
++ if (!child)
++ goto discard;
++
++ if (child != meta_sk) {
++ sock_rps_save_rxhash(child, skb);
++ /* We don't call tcp_child_process here, because we hold
++ * already the meta-sk-lock and are sure that it is not owned
++ * by the user.
++ */
++ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++ bh_unlock_sock(child);
++ sock_put(child);
++ if (ret) {
++ rsk = child;
++ goto reset_and_discard;
++ }
++ } else {
++ if (tcp_hdr(skb)->syn) {
++ mptcp_v4_join_request(meta_sk, skb);
++ goto discard;
++ }
++ goto reset_and_discard;
++ }
++ return 0;
++
++reset_and_discard:
++ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++ const struct tcphdr *th = tcp_hdr(skb);
++ const struct iphdr *iph = ip_hdr(skb);
++ struct request_sock **prev, *req;
++ /* If we end up here, it means we should not have matched on the
++ * request-socket. But, because the request-sock queue is only
++ * destroyed in mptcp_close, the socket may actually already be
++ * in close-state (e.g., through shutdown()) while still having
++ * pending request sockets.
++ */
++ req = inet_csk_search_req(meta_sk, &prev, th->source,
++ iph->saddr, iph->daddr);
++ if (req) {
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++ req);
++ reqsk_free(req);
++ }
++ }
++
++ tcp_v4_send_reset(rsk, skb);
++discard:
++ kfree_skb(skb);
++ return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++ const __be32 laddr, const struct net *net)
++{
++ const struct mptcp_request_sock *mtreq;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++ const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++ hash_entry) {
++ struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
++ meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++ if (ireq->ir_rmt_port == rport &&
++ ireq->ir_rmt_addr == raddr &&
++ ireq->ir_loc_addr == laddr &&
++ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
++ net_eq(net, sock_net(meta_sk)))
++ goto found;
++ meta_sk = NULL;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++ goto begin;
++
++found:
++ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ meta_sk = NULL;
++ rcu_read_unlock();
++
++ return meta_sk;
++}
++
++/* Create a new IPv4 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++ struct mptcp_rem4 *rem)
++{
++ struct tcp_sock *tp;
++ struct sock *sk;
++ struct sockaddr_in loc_in, rem_in;
++ struct socket sock;
++ int ret;
++
++ /** First, create and prepare the new socket */
++
++ sock.type = meta_sk->sk_socket->type;
++ sock.state = SS_UNCONNECTED;
++ sock.wq = meta_sk->sk_socket->wq;
++ sock.file = meta_sk->sk_socket->file;
++ sock.ops = NULL;
++
++ ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++ if (unlikely(ret < 0)) {
++ mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
++ return ret;
++ }
++
++ sk = sock.sk;
++ tp = tcp_sk(sk);
++
++ /* All subsockets need the MPTCP-lock-class */
++ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++ if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
++ goto error;
++
++ tp->mptcp->slave_sk = 1;
++ tp->mptcp->low_prio = loc->low_prio;
++
++ /* Initializing the timer for an MPTCP subflow */
++ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++ /** Then, connect the socket to the peer */
++ loc_in.sin_family = AF_INET;
++ rem_in.sin_family = AF_INET;
++ loc_in.sin_port = 0;
++ if (rem->port)
++ rem_in.sin_port = rem->port;
++ else
++ rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
++ loc_in.sin_addr = loc->addr;
++ rem_in.sin_addr = rem->addr;
++
++ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
++ if (ret < 0) {
++ mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
++ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &loc_in.sin_addr,
++ ntohs(loc_in.sin_port), &rem_in.sin_addr,
++ ntohs(rem_in.sin_port));
++
++ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
++ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
++
++ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++ sizeof(struct sockaddr_in), O_NONBLOCK);
++ if (ret < 0 && ret != -EINPROGRESS) {
++ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ sk_set_socket(sk, meta_sk->sk_socket);
++ sk->sk_wq = meta_sk->sk_wq;
++
++ return 0;
++
++error:
++ /* May happen if mptcp_add_sock fails first */
++ if (!mptcp(tp)) {
++ tcp_close(sk, 0);
++ } else {
++ local_bh_disable();
++ mptcp_sub_force_close(sk);
++ local_bh_enable();
++ }
++ return ret;
++}
++EXPORT_SYMBOL(mptcp_init4_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v4_specific = {
++ .queue_xmit = ip_queue_xmit,
++ .send_check = tcp_v4_send_check,
++ .rebuild_header = inet_sk_rebuild_header,
++ .sk_rx_dst_set = inet_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v4_syn_recv_sock,
++ .net_header_len = sizeof(struct iphdr),
++ .setsockopt = ip_setsockopt,
++ .getsockopt = ip_getsockopt,
++ .addr2sockaddr = inet_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in),
++ .bind_conflict = inet_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ip_setsockopt,
++ .compat_getsockopt = compat_ip_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++/* General initialization of IPv4 for MPTCP */
++int mptcp_pm_v4_init(void)
++{
++ int ret = 0;
++ struct request_sock_ops *ops = &mptcp_request_sock_ops;
++
++ mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++ mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
++
++ mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++ mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
++ mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
++
++ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
++ if (ops->slab_name == NULL) {
++ ret = -ENOMEM;
++ goto out;
++ }
++
++ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++
++ if (ops->slab == NULL) {
++ ret = -ENOMEM;
++ goto err_reqsk_create;
++ }
++
++out:
++ return ret;
++
++err_reqsk_create:
++ kfree(ops->slab_name);
++ ops->slab_name = NULL;
++ goto out;
++}
++
++void mptcp_pm_v4_undo(void)
++{
++ kmem_cache_destroy(mptcp_request_sock_ops.slab);
++ kfree(mptcp_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
+new file mode 100644
+index 000000000000..1036973aa855
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv6.c
+@@ -0,0 +1,518 @@
++/*
++ * MPTCP implementation - IPv6-specific functions
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/in6.h>
++#include <linux/kernel.h>
++
++#include <net/addrconf.h>
++#include <net/flow.h>
++#include <net/inet6_connection_sock.h>
++#include <net/inet6_hashtables.h>
++#include <net/inet_common.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/ip6_route.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
++#include <net/tcp.h>
++#include <net/transp_v6.h>
++
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport)
++{
++ u32 secret[MD5_MESSAGE_BYTES / 4];
++ u32 hash[MD5_DIGEST_WORDS];
++ u32 i;
++
++ memcpy(hash, saddr, 16);
++ for (i = 0; i < 4; i++)
++ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++ secret[4] = mptcp_secret[4] +
++ (((__force u16)sport << 16) + (__force u16)dport);
++ secret[5] = mptcp_seed++;
++ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++ secret[i] = mptcp_secret[i];
++
++ md5_transform(hash, secret);
++
++ return hash[0];
++}
++
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++ __be16 sport, __be16 dport)
++{
++ u32 secret[MD5_MESSAGE_BYTES / 4];
++ u32 hash[MD5_DIGEST_WORDS];
++ u32 i;
++
++ memcpy(hash, saddr, 16);
++ for (i = 0; i < 4; i++)
++ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++ secret[4] = mptcp_secret[4] +
++ (((__force u16)sport << 16) + (__force u16)dport);
++ secret[5] = mptcp_seed++;
++ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++ secret[i] = mptcp_secret[i];
++
++ md5_transform(hash, secret);
++
++ return *((u64 *)hash);
++}
++
++static void mptcp_v6_reqsk_destructor(struct request_sock *req)
++{
++ mptcp_reqsk_destructor(req);
++
++ tcp_v6_reqsk_destructor(req);
++}
++
++static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++ mptcp_reqsk_init(req, skb);
++
++ return 0;
++}
++
++static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
++ struct sk_buff *skb)
++{
++ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++ union inet_addr addr;
++ int loc_id;
++ bool low_prio = false;
++
++ /* We need to do this as early as possible. Because, if we fail later
++ * (e.g., get_local_id), then reqsk_free tries to remove the
++ * request-socket from the htb in mptcp_hash_request_remove as pprev
++ * may be different from NULL.
++ */
++ mtreq->hash_entry.pprev = NULL;
++
++ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++
++ mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
++ ipv6_hdr(skb)->daddr.s6_addr32,
++ tcp_hdr(skb)->source,
++ tcp_hdr(skb)->dest);
++ addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
++ loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
++ if (loc_id == -1)
++ return -1;
++ mtreq->loc_id = loc_id;
++ mtreq->low_prio = low_prio;
++
++ mptcp_join_reqsk_init(mpcb, req, skb);
++
++ return 0;
++}
++
++/* Similar to tcp6_request_sock_ops */
++struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
++ .family = AF_INET6,
++ .obj_size = sizeof(struct mptcp_request_sock),
++ .rtx_syn_ack = tcp_v6_rtx_synack,
++ .send_ack = tcp_v6_reqsk_send_ack,
++ .destructor = mptcp_v6_reqsk_destructor,
++ .send_reset = tcp_v6_send_reset,
++ .syn_ack_timeout = tcp_syn_ack_timeout,
++};
++
++static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
++ struct request_sock *req,
++ const unsigned long timeout)
++{
++ const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ 0, MPTCP_HASH_SIZE);
++ /* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
++ * want to reset the keepalive-timer (responsible for retransmitting
++ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++ * overload the keepalive timer. Also, it's not a big deal, because the
++ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++ * if the third ACK gets lost, the client will handle the retransmission
++ * anyways. If our SYN/ACK gets lost, the client will retransmit the
++ * SYN.
++ */
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++ const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++ inet_rsk(req)->ir_rmt_port,
++ lopt->hash_rnd, lopt->nr_table_entries);
++
++ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++ mptcp_reset_synack_timer(meta_sk, timeout);
++
++ rcu_read_lock();
++ spin_lock(&mptcp_reqsk_hlock);
++ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++ spin_unlock(&mptcp_reqsk_hlock);
++ rcu_read_unlock();
++}
++
++static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++ return tcp_conn_request(&mptcp6_request_sock_ops,
++ &mptcp_join_request_sock_ipv6_ops,
++ meta_sk, skb);
++}
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *child, *rsk = NULL;
++ int ret;
++
++ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++ struct tcphdr *th = tcp_hdr(skb);
++ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
++ struct sock *sk;
++
++ sk = __inet6_lookup_established(sock_net(meta_sk),
++ &tcp_hashinfo,
++ &ip6h->saddr, th->source,
++ &ip6h->daddr, ntohs(th->dest),
++ inet6_iif(skb));
++
++ if (!sk) {
++ kfree_skb(skb);
++ return 0;
++ }
++ if (is_meta_sk(sk)) {
++ WARN("%s Did not find a sub-sk!\n", __func__);
++ kfree_skb(skb);
++ sock_put(sk);
++ return 0;
++ }
++
++ if (sk->sk_state == TCP_TIME_WAIT) {
++ inet_twsk_put(inet_twsk(sk));
++ kfree_skb(skb);
++ return 0;
++ }
++
++ ret = tcp_v6_do_rcv(sk, skb);
++ sock_put(sk);
++
++ return ret;
++ }
++ TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++ /* Has been removed from the tk-table. Thus, no new subflows.
++ *
++ * Check for close-state is necessary, because we may have been closed
++ * without passing by mptcp_close().
++ *
++ * When falling back, no new subflows are allowed either.
++ */
++ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++ goto reset_and_discard;
++
++ child = tcp_v6_hnd_req(meta_sk, skb);
++
++ if (!child)
++ goto discard;
++
++ if (child != meta_sk) {
++ sock_rps_save_rxhash(child, skb);
++ /* We don't call tcp_child_process here, because we hold
++ * already the meta-sk-lock and are sure that it is not owned
++ * by the user.
++ */
++ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++ bh_unlock_sock(child);
++ sock_put(child);
++ if (ret) {
++ rsk = child;
++ goto reset_and_discard;
++ }
++ } else {
++ if (tcp_hdr(skb)->syn) {
++ mptcp_v6_join_request(meta_sk, skb);
++ goto discard;
++ }
++ goto reset_and_discard;
++ }
++ return 0;
++
++reset_and_discard:
++ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++ const struct tcphdr *th = tcp_hdr(skb);
++ struct request_sock **prev, *req;
++ /* If we end up here, it means we should not have matched on the
++ * request-socket. But, because the request-sock queue is only
++ * destroyed in mptcp_close, the socket may actually already be
++ * in close-state (e.g., through shutdown()) while still having
++ * pending request sockets.
++ */
++ req = inet6_csk_search_req(meta_sk, &prev, th->source,
++ &ipv6_hdr(skb)->saddr,
++ &ipv6_hdr(skb)->daddr, inet6_iif(skb));
++ if (req) {
++ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++ req);
++ reqsk_free(req);
++ }
++ }
++
++ tcp_v6_send_reset(rsk, skb);
++discard:
++ kfree_skb(skb);
++ return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++ const struct in6_addr *laddr, const struct net *net)
++{
++ const struct mptcp_request_sock *mtreq;
++ struct sock *meta_sk = NULL;
++ const struct hlist_nulls_node *node;
++ const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++ rcu_read_lock();
++begin:
++ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++ hash_entry) {
++ struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
++ meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++ if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
++ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
++ ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
++ ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
++ net_eq(net, sock_net(meta_sk)))
++ goto found;
++ meta_sk = NULL;
++ }
++ /* A request-socket is destroyed by RCU. So, it might have been recycled
++ * and put into another hash-table list. So, after the lookup we may
++ * end up in a different list. So, we may need to restart.
++ *
++ * See also the comment in __inet_lookup_established.
++ */
++ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++ goto begin;
++
++found:
++ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++ meta_sk = NULL;
++ rcu_read_unlock();
++
++ return meta_sk;
++}
++
++/* Create a new IPv6 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++ struct mptcp_rem6 *rem)
++{
++ struct tcp_sock *tp;
++ struct sock *sk;
++ struct sockaddr_in6 loc_in, rem_in;
++ struct socket sock;
++ int ret;
++
++ /** First, create and prepare the new socket */
++
++ sock.type = meta_sk->sk_socket->type;
++ sock.state = SS_UNCONNECTED;
++ sock.wq = meta_sk->sk_socket->wq;
++ sock.file = meta_sk->sk_socket->file;
++ sock.ops = NULL;
++
++ ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++ if (unlikely(ret < 0)) {
++ mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
++ return ret;
++ }
++
++ sk = sock.sk;
++ tp = tcp_sk(sk);
++
++ /* All subsockets need the MPTCP-lock-class */
++ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++ if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
++ goto error;
++
++ tp->mptcp->slave_sk = 1;
++ tp->mptcp->low_prio = loc->low_prio;
++
++ /* Initializing the timer for an MPTCP subflow */
++ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++ /** Then, connect the socket to the peer */
++ loc_in.sin6_family = AF_INET6;
++ rem_in.sin6_family = AF_INET6;
++ loc_in.sin6_port = 0;
++ if (rem->port)
++ rem_in.sin6_port = rem->port;
++ else
++ rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
++ loc_in.sin6_addr = loc->addr;
++ rem_in.sin6_addr = rem->addr;
++
++ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
++ if (ret < 0) {
++ mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
++ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++ tp->mptcp->path_index, &loc_in.sin6_addr,
++ ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
++ ntohs(rem_in.sin6_port));
++
++ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
++ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
++
++ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++ sizeof(struct sockaddr_in6), O_NONBLOCK);
++ if (ret < 0 && ret != -EINPROGRESS) {
++ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++ __func__, ret);
++ goto error;
++ }
++
++ sk_set_socket(sk, meta_sk->sk_socket);
++ sk->sk_wq = meta_sk->sk_wq;
++
++ return 0;
++
++error:
++ /* May happen if mptcp_add_sock fails first */
++ if (!mptcp(tp)) {
++ tcp_close(sk, 0);
++ } else {
++ local_bh_disable();
++ mptcp_sub_force_close(sk);
++ local_bh_enable();
++ }
++ return ret;
++}
++EXPORT_SYMBOL(mptcp_init6_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v6_specific = {
++ .queue_xmit = inet6_csk_xmit,
++ .send_check = tcp_v6_send_check,
++ .rebuild_header = inet6_sk_rebuild_header,
++ .sk_rx_dst_set = inet6_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v6_syn_recv_sock,
++ .net_header_len = sizeof(struct ipv6hdr),
++ .net_frag_header_len = sizeof(struct frag_hdr),
++ .setsockopt = ipv6_setsockopt,
++ .getsockopt = ipv6_getsockopt,
++ .addr2sockaddr = inet6_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in6),
++ .bind_conflict = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ipv6_setsockopt,
++ .compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
++ .queue_xmit = ip_queue_xmit,
++ .send_check = tcp_v4_send_check,
++ .rebuild_header = inet_sk_rebuild_header,
++ .sk_rx_dst_set = inet_sk_rx_dst_set,
++ .conn_request = mptcp_conn_request,
++ .syn_recv_sock = tcp_v6_syn_recv_sock,
++ .net_header_len = sizeof(struct iphdr),
++ .setsockopt = ipv6_setsockopt,
++ .getsockopt = ipv6_getsockopt,
++ .addr2sockaddr = inet6_csk_addr2sockaddr,
++ .sockaddr_len = sizeof(struct sockaddr_in6),
++ .bind_conflict = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++ .compat_setsockopt = compat_ipv6_setsockopt,
++ .compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_pm_v6_init(void)
++{
++ int ret = 0;
++ struct request_sock_ops *ops = &mptcp6_request_sock_ops;
++
++ mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++ mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
++
++ mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++ mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
++ mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
++
++ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
++ if (ops->slab_name == NULL) {
++ ret = -ENOMEM;
++ goto out;
++ }
++
++ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++ NULL);
++
++ if (ops->slab == NULL) {
++ ret = -ENOMEM;
++ goto err_reqsk_create;
++ }
++
++out:
++ return ret;
++
++err_reqsk_create:
++ kfree(ops->slab_name);
++ ops->slab_name = NULL;
++ goto out;
++}
++
++void mptcp_pm_v6_undo(void)
++{
++ kmem_cache_destroy(mptcp6_request_sock_ops.slab);
++ kfree(mptcp6_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
+new file mode 100644
+index 000000000000..6f5087983175
+--- /dev/null
++++ b/net/mptcp/mptcp_ndiffports.c
+@@ -0,0 +1,161 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++
++struct ndiffports_priv {
++ /* Worker struct for subflow establishment */
++ struct work_struct subflow_work;
++
++ struct mptcp_cb *mpcb;
++};
++
++static int num_subflows __read_mostly = 2;
++module_param(num_subflows, int, 0644);
++MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++ const struct ndiffports_priv *pm_priv = container_of(work,
++ struct ndiffports_priv,
++ subflow_work);
++ struct mptcp_cb *mpcb = pm_priv->mpcb;
++ struct sock *meta_sk = mpcb->meta_sk;
++ int iter = 0;
++
++next_subflow:
++ if (iter) {
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++
++ cond_resched();
++ }
++ mutex_lock(&mpcb->mpcb_mutex);
++ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++ iter++;
++
++ if (sock_flag(meta_sk, SOCK_DEAD))
++ goto exit;
++
++ if (mpcb->master_sk &&
++ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++ goto exit;
++
++ if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
++ if (meta_sk->sk_family == AF_INET ||
++ mptcp_v6_is_v4_mapped(meta_sk)) {
++ struct mptcp_loc4 loc;
++ struct mptcp_rem4 rem;
++
++ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++ loc.loc4_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem4_id = 0; /* Default 0 */
++
++ mptcp_init4_subsockets(meta_sk, &loc, &rem);
++ } else {
++#if IS_ENABLED(CONFIG_IPV6)
++ struct mptcp_loc6 loc;
++ struct mptcp_rem6 rem;
++
++ loc.addr = inet6_sk(meta_sk)->saddr;
++ loc.loc6_id = 0;
++ loc.low_prio = 0;
++
++ rem.addr = meta_sk->sk_v6_daddr;
++ rem.port = inet_sk(meta_sk)->inet_dport;
++ rem.rem6_id = 0; /* Default 0 */
++
++ mptcp_init6_subsockets(meta_sk, &loc, &rem);
++#endif
++ }
++ goto next_subflow;
++ }
++
++exit:
++ release_sock(meta_sk);
++ mutex_unlock(&mpcb->mpcb_mutex);
++ sock_put(meta_sk);
++}
++
++static void ndiffports_new_session(const struct sock *meta_sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++ /* Initialize workqueue-struct */
++ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++ fmp->mpcb = mpcb;
++}
++
++static void ndiffports_create_subflows(struct sock *meta_sk)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++ mpcb->send_infinite_mapping ||
++ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++ return;
++
++ if (!work_pending(&pm_priv->subflow_work)) {
++ sock_hold(meta_sk);
++ queue_work(mptcp_wq, &pm_priv->subflow_work);
++ }
++}
++
++static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++static struct mptcp_pm_ops ndiffports __read_mostly = {
++ .new_session = ndiffports_new_session,
++ .fully_established = ndiffports_create_subflows,
++ .get_local_id = ndiffports_get_local_id,
++ .name = "ndiffports",
++ .owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init ndiffports_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
++
++ if (mptcp_register_path_manager(&ndiffports))
++ goto exit;
++
++ return 0;
++
++exit:
++ return -1;
++}
++
++static void ndiffports_unregister(void)
++{
++ mptcp_unregister_path_manager(&ndiffports);
++}
++
++module_init(ndiffports_register);
++module_exit(ndiffports_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
+new file mode 100644
+index 000000000000..ec4e98622637
+--- /dev/null
++++ b/net/mptcp/mptcp_ofo_queue.c
+@@ -0,0 +1,295 @@
++/*
++ * MPTCP implementation - Fast algorithm for MPTCP meta-reordering
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <linux/slab.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++ const struct sk_buff *skb)
++{
++ struct tcp_sock *tp;
++
++ mptcp_for_each_tp(mpcb, tp) {
++ if (tp->mptcp->shortcut_ofoqueue == skb) {
++ tp->mptcp->shortcut_ofoqueue = NULL;
++ return;
++ }
++ }
++}
++
++/* Does 'skb' fits after 'here' in the queue 'head' ?
++ * If yes, we queue it and return 1
++ */
++static int mptcp_ofo_queue_after(struct sk_buff_head *head,
++ struct sk_buff *skb, struct sk_buff *here,
++ const struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk;
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++ /* We want to queue skb after here, thus seq >= end_seq */
++ if (before(seq, TCP_SKB_CB(here)->end_seq))
++ return 0;
++
++ if (seq == TCP_SKB_CB(here)->end_seq) {
++ bool fragstolen = false;
++
++ if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
++ __skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
++ return 1;
++ } else {
++ kfree_skb_partial(skb, fragstolen);
++ return -1;
++ }
++ }
++
++ /* If here is the last one, we can always queue it */
++ if (skb_queue_is_last(head, here)) {
++ __skb_queue_after(head, here, skb);
++ return 1;
++ } else {
++ struct sk_buff *skb1 = skb_queue_next(head, here);
++ /* It's not the last one, but does it fits between 'here' and
++ * the one after 'here' ? Thus, does end_seq <= after_here->seq
++ */
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
++ __skb_queue_after(head, here, skb);
++ return 1;
++ }
++ }
++
++ return 0;
++}
++
++static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
++ struct sk_buff_head *head, struct tcp_sock *tp)
++{
++ struct sock *meta_sk = tp->meta_sk;
++ struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb1, *best_shortcut = NULL;
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++ u32 distance = 0xffffffff;
++
++ /* First, check the tp's shortcut */
++ if (!shortcut) {
++ if (skb_queue_empty(head)) {
++ __skb_queue_head(head, skb);
++ goto end;
++ }
++ } else {
++ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++ /* Does the tp's shortcut is a hit? If yes, we insert. */
++
++ if (ret) {
++ skb = (ret > 0) ? skb : NULL;
++ goto end;
++ }
++ }
++
++ /* Check the shortcuts of the other subsockets. */
++ mptcp_for_each_tp(mpcb, tp_it) {
++ shortcut = tp_it->mptcp->shortcut_ofoqueue;
++ /* Can we queue it here? If yes, do so! */
++ if (shortcut) {
++ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++
++ if (ret) {
++ skb = (ret > 0) ? skb : NULL;
++ goto end;
++ }
++ }
++
++ /* Could not queue it, check if we are close.
++ * We are looking for a shortcut, close enough to seq to
++ * set skb1 prematurely and thus improve the subsequent lookup,
++ * which tries to find a skb1 so that skb1->seq <= seq.
++ *
++ * So, here we only take shortcuts, whose shortcut->seq > seq,
++ * and minimize the distance between shortcut->seq and seq and
++ * set best_shortcut to this one with the minimal distance.
++ *
++ * That way, the subsequent while-loop is shortest.
++ */
++ if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
++ /* Are we closer than the current best shortcut? */
++ if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
++ distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
++ best_shortcut = shortcut;
++ }
++ }
++ }
++
++ if (best_shortcut)
++ skb1 = best_shortcut;
++ else
++ skb1 = skb_peek_tail(head);
++
++ if (seq == TCP_SKB_CB(skb1)->end_seq) {
++ bool fragstolen = false;
++
++ if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
++ __skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
++ } else {
++ kfree_skb_partial(skb, fragstolen);
++ skb = NULL;
++ }
++
++ goto end;
++ }
++
++ /* Find the insertion point, starting from best_shortcut if available.
++ *
++ * Inspired from tcp_data_queue_ofo.
++ */
++ while (1) {
++ /* skb1->seq <= seq */
++ if (!after(TCP_SKB_CB(skb1)->seq, seq))
++ break;
++ if (skb_queue_is_first(head, skb1)) {
++ skb1 = NULL;
++ break;
++ }
++ skb1 = skb_queue_prev(head, skb1);
++ }
++
++ /* Do skb overlap to previous one? */
++ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* All the bits are present. */
++ __kfree_skb(skb);
++ skb = NULL;
++ goto end;
++ }
++ if (seq == TCP_SKB_CB(skb1)->seq) {
++ if (skb_queue_is_first(head, skb1))
++ skb1 = NULL;
++ else
++ skb1 = skb_queue_prev(head, skb1);
++ }
++ }
++ if (!skb1)
++ __skb_queue_head(head, skb);
++ else
++ __skb_queue_after(head, skb1, skb);
++
++ /* And clean segments covered by new one as whole. */
++ while (!skb_queue_is_last(head, skb)) {
++ skb1 = skb_queue_next(head, skb);
++
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++ break;
++
++ __skb_unlink(skb1, head);
++ mptcp_remove_shortcuts(mpcb, skb1);
++ __kfree_skb(skb1);
++ }
++
++end:
++ if (skb) {
++ skb_set_owner_r(skb, meta_sk);
++ tp->mptcp->shortcut_ofoqueue = skb;
++ }
++
++ return;
++}
++
++/**
++ * @sk: the subflow that received this skb.
++ */
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++ struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
++ &tcp_sk(meta_sk)->out_of_order_queue, tp);
++}
++
++bool mptcp_prune_ofo_queue(struct sock *sk)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ bool res = false;
++
++ if (!skb_queue_empty(&tp->out_of_order_queue)) {
++ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
++ mptcp_purge_ofo_queue(tp);
++
++ /* No sack at the mptcp-level */
++ sk_mem_reclaim(sk);
++ res = true;
++ }
++
++ return res;
++}
++
++void mptcp_ofo_queue(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++
++ while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
++ u32 old_rcv_nxt = meta_tp->rcv_nxt;
++ if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
++ break;
++
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
++ __skb_unlink(skb, &meta_tp->out_of_order_queue);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++ __kfree_skb(skb);
++ continue;
++ }
++
++ __skb_unlink(skb, &meta_tp->out_of_order_queue);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++
++ __skb_queue_tail(&meta_sk->sk_receive_queue, skb);
++ meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++ if (tcp_hdr(skb)->fin)
++ mptcp_fin(meta_sk);
++ }
++}
++
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
++{
++ struct sk_buff_head *head = &meta_tp->out_of_order_queue;
++ struct sk_buff *skb, *tmp;
++
++ skb_queue_walk_safe(head, skb, tmp) {
++ __skb_unlink(skb, head);
++ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++ kfree_skb(skb);
++ }
++}
+diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
+new file mode 100644
+index 000000000000..53f5c43bb488
+--- /dev/null
++++ b/net/mptcp/mptcp_olia.c
+@@ -0,0 +1,311 @@
++/*
++ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
++ *
++ * Algorithm design:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ * Nicolas Gast <nicolas.gast@epfl.ch>
++ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
++ *
++ * Implementation:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++static int scale = 10;
++
++struct mptcp_olia {
++ u32 mptcp_loss1;
++ u32 mptcp_loss2;
++ u32 mptcp_loss3;
++ int epsilon_num;
++ u32 epsilon_den;
++ int mptcp_snd_cwnd_cnt;
++};
++
++static inline int mptcp_olia_sk_can_send(const struct sock *sk)
++{
++ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_olia_scale(u64 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++/* take care of artificially inflate (see RFC5681)
++ * of cwnd during fast-retransmit phase
++ */
++static u32 mptcp_get_crt_cwnd(struct sock *sk)
++{
++ const struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (icsk->icsk_ca_state == TCP_CA_Recovery)
++ return tcp_sk(sk)->snd_ssthresh;
++ else
++ return tcp_sk(sk)->snd_cwnd;
++}
++
++/* return the dominator of the first term of the increasing term */
++static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
++{
++ struct sock *sk;
++ u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
++
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ u64 scaled_num;
++ u32 tmp_cwnd;
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
++ rate += div_u64(scaled_num , tp->srtt_us);
++ }
++ rate *= rate;
++ return rate;
++}
++
++/* find the maximum cwnd, used to find set M */
++static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
++{
++ struct sock *sk;
++ u32 best_cwnd = 0;
++
++ mptcp_for_each_sk(mpcb, sk) {
++ u32 tmp_cwnd;
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if (tmp_cwnd > best_cwnd)
++ best_cwnd = tmp_cwnd;
++ }
++ return best_cwnd;
++}
++
++static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
++{
++ struct mptcp_olia *ca;
++ struct tcp_sock *tp;
++ struct sock *sk;
++ u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
++ u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
++ u8 M = 0, B_not_M = 0;
++
++ /* TODO - integrate this in the following loop - we just want to iterate once */
++
++ max_cwnd = mptcp_get_max_cwnd(mpcb);
++
++ /* find the best path */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ /* TODO - check here and rename variables */
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
++ best_rtt = tmp_rtt;
++ best_int = tmp_int;
++ best_cwnd = tmp_cwnd;
++ }
++ }
++
++ /* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
++ /* find the size of M and B_not_M */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++ if (tmp_cwnd == max_cwnd) {
++ M++;
++ } else {
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++
++ if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
++ B_not_M++;
++ }
++ }
++
++ /* check if the path is in M or B_not_M and set the value of epsilon accordingly */
++ mptcp_for_each_sk(mpcb, sk) {
++ tp = tcp_sk(sk);
++ ca = inet_csk_ca(sk);
++
++ if (!mptcp_olia_sk_can_send(sk))
++ continue;
++
++ if (B_not_M == 0) {
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ } else {
++ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++ ca->mptcp_loss2 - ca->mptcp_loss1);
++ tmp_cwnd = mptcp_get_crt_cwnd(sk);
++
++ if (tmp_cwnd < max_cwnd &&
++ (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
++ ca->epsilon_num = 1;
++ ca->epsilon_den = mpcb->cnt_established * B_not_M;
++ } else if (tmp_cwnd == max_cwnd) {
++ ca->epsilon_num = -1;
++ ca->epsilon_den = mpcb->cnt_established * M;
++ } else {
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ }
++ }
++ }
++}
++
++/* setting the initial values */
++static void mptcp_olia_init(struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++
++ if (mptcp(tp)) {
++ ca->mptcp_loss1 = tp->snd_una;
++ ca->mptcp_loss2 = tp->snd_una;
++ ca->mptcp_loss3 = tp->snd_una;
++ ca->mptcp_snd_cwnd_cnt = 0;
++ ca->epsilon_num = 0;
++ ca->epsilon_den = 1;
++ }
++}
++
++/* updating inter-loss distance and ssthresh */
++static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
++{
++ if (!mptcp(tcp_sk(sk)))
++ return;
++
++ if (new_state == TCP_CA_Loss ||
++ new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++
++ if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
++ !inet_csk(sk)->icsk_retransmits) {
++ ca->mptcp_loss1 = ca->mptcp_loss2;
++ ca->mptcp_loss2 = ca->mptcp_loss3;
++ }
++ }
++}
++
++/* main algorithm */
++static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_olia *ca = inet_csk_ca(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++
++ u64 inc_num, inc_den, rate, cwnd_scaled;
++
++ if (!mptcp(tp)) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ ca->mptcp_loss3 = tp->snd_una;
++
++ if (!tcp_is_cwnd_limited(sk))
++ return;
++
++ /* slow start if it is in the safe area */
++ if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ tcp_slow_start(tp, acked);
++ return;
++ }
++
++ mptcp_get_epsilon(mpcb);
++ rate = mptcp_get_rate(mpcb, tp->srtt_us);
++ cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
++ inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
++
++ /* calculate the increasing term, scaling is used to reduce the rounding effect */
++ if (ca->epsilon_num == -1) {
++ if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
++ inc_num = rate - ca->epsilon_den *
++ cwnd_scaled * cwnd_scaled;
++ ca->mptcp_snd_cwnd_cnt -= div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ } else {
++ inc_num = ca->epsilon_den *
++ cwnd_scaled * cwnd_scaled - rate;
++ ca->mptcp_snd_cwnd_cnt += div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ }
++ } else {
++ inc_num = ca->epsilon_num * rate +
++ ca->epsilon_den * cwnd_scaled * cwnd_scaled;
++ ca->mptcp_snd_cwnd_cnt += div64_u64(
++ mptcp_olia_scale(inc_num , scale) , inc_den);
++ }
++
++
++ if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
++ if (tp->snd_cwnd < tp->snd_cwnd_clamp)
++ tp->snd_cwnd++;
++ ca->mptcp_snd_cwnd_cnt = 0;
++ } else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
++ tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
++ ca->mptcp_snd_cwnd_cnt = 0;
++ }
++}
++
++static struct tcp_congestion_ops mptcp_olia = {
++ .init = mptcp_olia_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_olia_cong_avoid,
++ .set_state = mptcp_olia_set_state,
++ .owner = THIS_MODULE,
++ .name = "olia",
++};
++
++static int __init mptcp_olia_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
++ return tcp_register_congestion_control(&mptcp_olia);
++}
++
++static void __exit mptcp_olia_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_olia);
++}
++
++module_init(mptcp_olia_register);
++module_exit(mptcp_olia_unregister);
++
++MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
+new file mode 100644
+index 000000000000..400ea254c078
+--- /dev/null
++++ b/net/mptcp/mptcp_output.c
+@@ -0,0 +1,1743 @@
++/*
++ * MPTCP implementation - Sending side
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/kconfig.h>
++#include <linux/skbuff.h>
++#include <linux/tcp.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++#include <net/sock.h>
++
++static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
++ MPTCP_SUB_LEN_ACK_ALIGN +
++ MPTCP_SUB_LEN_SEQ_ALIGN;
++
++static inline int mptcp_sub_len_remove_addr(u16 bitfield)
++{
++ unsigned int c;
++ for (c = 0; bitfield; c++)
++ bitfield &= bitfield - 1;
++ return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
++}
++
++int mptcp_sub_len_remove_addr_align(u16 bitfield)
++{
++ return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
++}
++EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
++
++/* get the data-seq and end-data-seq and store them again in the
++ * tcp_skb_cb
++ */
++static int mptcp_reconstruct_mapping(struct sk_buff *skb)
++{
++ const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
++ u32 *p32;
++ u16 *p16;
++
++ if (!mpdss->M)
++ return 1;
++
++ /* Move the pointer to the data-seq */
++ p32 = (u32 *)mpdss;
++ p32++;
++ if (mpdss->A) {
++ p32++;
++ if (mpdss->a)
++ p32++;
++ }
++
++ TCP_SKB_CB(skb)->seq = ntohl(*p32);
++
++ /* Get the data_len to calculate the end_data_seq */
++ p32++;
++ p32++;
++ p16 = (u16 *)p32;
++ TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
++
++ return 0;
++}
++
++static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
++{
++ struct sk_buff *skb_it;
++
++ skb_it = tcp_write_queue_head(meta_sk);
++
++ tcp_for_write_queue_from(skb_it, meta_sk) {
++ if (skb_it == tcp_send_head(meta_sk))
++ break;
++
++ if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
++ TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
++ break;
++ }
++ }
++}
++
++/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
++ * coming from the meta-retransmit-timer
++ */
++static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
++ struct sock *sk, int clone_it)
++{
++ struct sk_buff *skb, *skb1;
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ u32 seq, end_seq;
++
++ if (clone_it) {
++ /* pskb_copy is necessary here, because the TCP/IP-headers
++ * will be changed when it's going to be reinjected on another
++ * subflow.
++ */
++ skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
++ } else {
++ __skb_unlink(orig_skb, &sk->sk_write_queue);
++ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++ sk->sk_wmem_queued -= orig_skb->truesize;
++ sk_mem_uncharge(sk, orig_skb->truesize);
++ skb = orig_skb;
++ }
++ if (unlikely(!skb))
++ return;
++
++ if (sk && mptcp_reconstruct_mapping(skb)) {
++ __kfree_skb(skb);
++ return;
++ }
++
++ skb->sk = meta_sk;
++
++ /* If it reached already the destination, we don't have to reinject it */
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++ __kfree_skb(skb);
++ return;
++ }
++
++ /* Only reinject segments that are fully covered by the mapping */
++ if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
++ TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
++ u32 seq = TCP_SKB_CB(skb)->seq;
++ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++ __kfree_skb(skb);
++
++ /* Ok, now we have to look for the full mapping in the meta
++ * send-queue :S
++ */
++ tcp_for_write_queue(skb, meta_sk) {
++ /* Not yet at the mapping? */
++ if (before(TCP_SKB_CB(skb)->seq, seq))
++ continue;
++ /* We have passed by the mapping */
++ if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
++ return;
++
++ __mptcp_reinject_data(skb, meta_sk, NULL, 1);
++ }
++ return;
++ }
++
++ /* Segment goes back to the MPTCP-layer. So, we need to zero the
++ * path_mask/dss.
++ */
++ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++ /* We need to find out the path-mask from the meta-write-queue
++ * to properly select a subflow.
++ */
++ mptcp_find_and_set_pathmask(meta_sk, skb);
++
++ /* If it's empty, just add */
++ if (skb_queue_empty(&mpcb->reinject_queue)) {
++ skb_queue_head(&mpcb->reinject_queue, skb);
++ return;
++ }
++
++ /* Find place to insert skb - or even we can 'drop' it, as the
++ * data is already covered by other skb's in the reinject-queue.
++ *
++ * This is inspired by code from tcp_data_queue.
++ */
++
++ skb1 = skb_peek_tail(&mpcb->reinject_queue);
++ seq = TCP_SKB_CB(skb)->seq;
++ while (1) {
++ if (!after(TCP_SKB_CB(skb1)->seq, seq))
++ break;
++ if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
++ skb1 = NULL;
++ break;
++ }
++ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++ }
++
++ /* Do skb overlap to previous one? */
++ end_seq = TCP_SKB_CB(skb)->end_seq;
++ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++ /* All the bits are present. Don't reinject */
++ __kfree_skb(skb);
++ return;
++ }
++ if (seq == TCP_SKB_CB(skb1)->seq) {
++ if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
++ skb1 = NULL;
++ else
++ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++ }
++ }
++ if (!skb1)
++ __skb_queue_head(&mpcb->reinject_queue, skb);
++ else
++ __skb_queue_after(&mpcb->reinject_queue, skb1, skb);
++
++ /* And clean segments covered by new one as whole. */
++ while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
++ skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
++
++ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++ break;
++
++ __skb_unlink(skb1, &mpcb->reinject_queue);
++ __kfree_skb(skb1);
++ }
++ return;
++}
++
++/* Inserts data into the reinject queue */
++void mptcp_reinject_data(struct sock *sk, int clone_it)
++{
++ struct sk_buff *skb_it, *tmp;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct sock *meta_sk = tp->meta_sk;
++
++ /* It has already been closed - there is really no point in reinjecting */
++ if (meta_sk->sk_state == TCP_CLOSE)
++ return;
++
++ skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
++ /* Subflow syn's and fin's are not reinjected.
++ *
++ * As well as empty subflow-fins with a data-fin.
++ * They are reinjected below (without the subflow-fin-flag)
++ */
++ if (tcb->tcp_flags & TCPHDR_SYN ||
++ (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
++ (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
++ continue;
++
++ __mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
++ }
++
++ skb_it = tcp_write_queue_tail(meta_sk);
++ /* If sk has sent the empty data-fin, we have to reinject it too. */
++ if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
++ TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
++ __mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
++ }
++
++ mptcp_push_pending_frames(meta_sk);
++
++ tp->pf = 1;
++}
++EXPORT_SYMBOL(mptcp_reinject_data);
++
++static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
++ struct sock *subsk)
++{
++ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *sk_it;
++ int all_empty = 1, all_acked;
++
++ /* In infinite mapping we always try to combine */
++ if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
++ subsk->sk_shutdown |= SEND_SHUTDOWN;
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ return;
++ }
++
++ /* Don't combine, if they didn't combine - otherwise we end up in
++ * TIME_WAIT, even if our app is smart enough to avoid it
++ */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++ if (!mpcb->dfin_combined)
++ return;
++ }
++
++ /* If no other subflow has data to send, we can combine */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ if (!mptcp_sk_can_send(sk_it))
++ continue;
++
++ if (!tcp_write_queue_empty(sk_it))
++ all_empty = 0;
++ }
++
++ /* If all data has been DATA_ACKed, we can combine.
++ * -1, because the data_fin consumed one byte
++ */
++ all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
++
++ if ((all_empty || all_acked) && tcp_close_state(subsk)) {
++ subsk->sk_shutdown |= SEND_SHUTDOWN;
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++ }
++}
++
++static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
++ __be32 *ptr)
++{
++ const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ __be32 *start = ptr;
++ __u16 data_len;
++
++ *ptr++ = htonl(tcb->seq); /* data_seq */
++
++ /* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
++ if (mptcp_is_data_fin(skb) && skb->len == 0)
++ *ptr++ = 0; /* subseq */
++ else
++ *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
++
++ if (tcb->mptcp_flags & MPTCPHDR_INF)
++ data_len = 0;
++ else
++ data_len = tcb->end_seq - tcb->seq;
++
++ if (tp->mpcb->dss_csum && data_len) {
++ __be16 *p16 = (__be16 *)ptr;
++ __be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
++ __wsum csum;
++
++ *ptr = htonl(((data_len) << 16) |
++ (TCPOPT_EOL << 8) |
++ (TCPOPT_EOL));
++ csum = csum_partial(ptr - 2, 12, skb->csum);
++ p16++;
++ *p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
++ } else {
++ *ptr++ = htonl(((data_len) << 16) |
++ (TCPOPT_NOP << 8) |
++ (TCPOPT_NOP));
++ }
++
++ return ptr - start;
++}
++
++static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
++ __be32 *ptr)
++{
++ struct mp_dss *mdss = (struct mp_dss *)ptr;
++ __be32 *start = ptr;
++
++ mdss->kind = TCPOPT_MPTCP;
++ mdss->sub = MPTCP_SUB_DSS;
++ mdss->rsv1 = 0;
++ mdss->rsv2 = 0;
++ mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
++ mdss->m = 0;
++ mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
++ mdss->a = 0;
++ mdss->A = 1;
++ mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
++ ptr++;
++
++ *ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++ return ptr - start;
++}
++
++/* RFC6824 states that once a particular subflow mapping has been sent
++ * out it must never be changed. However, packets may be split while
++ * they are in the retransmission queue (due to SACK or ACKs) and that
++ * arguably means that we would change the mapping (e.g. it splits it,
++ * our sends out a subset of the initial mapping).
++ *
++ * Furthermore, the skb checksum is not always preserved across splits
++ * (e.g. mptcp_fragment) which would mean that we need to recompute
++ * the DSS checksum in this case.
++ *
++ * To avoid this we save the initial DSS mapping which allows us to
++ * send the same DSS mapping even for fragmented retransmits.
++ */
++static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
++{
++ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++ __be32 *ptr = (__be32 *)tcb->dss;
++
++ tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++ ptr += mptcp_write_dss_mapping(tp, skb, ptr);
++}
++
++/* Write the saved DSS mapping to the header */
++static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
++ __be32 *ptr)
++{
++ __be32 *start = ptr;
++
++ memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
++
++ /* update the data_ack */
++ start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++ /* dss is in a union with inet_skb_parm and
++ * the IP layer expects zeroed IPCB fields.
++ */
++ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++ return mptcp_dss_len/sizeof(*ptr);
++}
++
++static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct sock *meta_sk = mptcp_meta_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++ struct tcp_skb_cb *tcb;
++ struct sk_buff *subskb = NULL;
++
++ if (!reinject)
++ TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
++ MPTCPHDR_SEQ64_INDEX : 0);
++
++ subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
++ if (!subskb)
++ return false;
++
++ /* At the subflow-level we need to call again tcp_init_tso_segs. We
++ * force this, by setting gso_segs to 0. It has been set to 1 prior to
++ * the call to mptcp_skb_entail.
++ */
++ skb_shinfo(subskb)->gso_segs = 0;
++
++ TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
++
++ if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
++ skb->ip_summed == CHECKSUM_PARTIAL) {
++ subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
++ subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
++ }
++
++ tcb = TCP_SKB_CB(subskb);
++
++ if (tp->mpcb->send_infinite_mapping &&
++ !tp->mpcb->infinite_mapping_snd &&
++ !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
++ tp->mptcp->fully_established = 1;
++ tp->mpcb->infinite_mapping_snd = 1;
++ tp->mptcp->infinite_cutoff_seq = tp->write_seq;
++ tcb->mptcp_flags |= MPTCPHDR_INF;
++ }
++
++ if (mptcp_is_data_fin(subskb))
++ mptcp_combine_dfin(subskb, meta_sk, sk);
++
++ mptcp_save_dss_data_seq(tp, subskb);
++
++ tcb->seq = tp->write_seq;
++ tcb->sacked = 0; /* reset the sacked field: from the point of view
++ * of this subflow, we are sending a brand new
++ * segment
++ */
++ /* Take into account seg len */
++ tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
++ tcb->end_seq = tp->write_seq;
++
++ /* If it's a non-payload DATA_FIN (also no subflow-fin), the
++ * segment is not part of the subflow but on a meta-only-level.
++ */
++ if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
++ tcp_add_write_queue_tail(sk, subskb);
++ sk->sk_wmem_queued += subskb->truesize;
++ sk_mem_charge(sk, subskb->truesize);
++ } else {
++ int err;
++
++ /* Necessary to initialize for tcp_transmit_skb. mss of 1, as
++ * skb->len = 0 will force tso_segs to 1.
++ */
++ tcp_init_tso_segs(sk, subskb, 1);
++ /* Empty data-fins are sent immediatly on the subflow */
++ TCP_SKB_CB(subskb)->when = tcp_time_stamp;
++ err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
++
++ /* It has not been queued, we can free it now. */
++ kfree_skb(subskb);
++
++ if (err)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ tp->mptcp->second_packet = 1;
++ tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
++ }
++
++ return true;
++}
++
++/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
++ * might need to undo some operations done by tcp_fragment.
++ */
++static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
++ gfp_t gfp, int reinject)
++{
++ int ret, diff, old_factor;
++ struct sk_buff *buff;
++ u8 flags;
++
++ if (skb_headlen(skb) < len)
++ diff = skb->len - len;
++ else
++ diff = skb->data_len;
++ old_factor = tcp_skb_pcount(skb);
++
++ /* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
++ * At the MPTCP-level we do not care about the absolute value. All we
++ * care about is that it is set to 1 for accurate packets_out
++ * accounting.
++ */
++ ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
++ if (ret)
++ return ret;
++
++ buff = skb->next;
++
++ flags = TCP_SKB_CB(skb)->mptcp_flags;
++ TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
++ TCP_SKB_CB(buff)->mptcp_flags = flags;
++ TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
++
++ /* If reinject == 1, the buff will be added to the reinject
++ * queue, which is currently not part of memory accounting. So
++ * undo the changes done by tcp_fragment and update the
++ * reinject queue. Also, undo changes to the packet counters.
++ */
++ if (reinject == 1) {
++ int undo = buff->truesize - diff;
++ meta_sk->sk_wmem_queued -= undo;
++ sk_mem_uncharge(meta_sk, undo);
++
++ tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
++ meta_sk->sk_write_queue.qlen--;
++
++ if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
++ undo = old_factor - tcp_skb_pcount(skb) -
++ tcp_skb_pcount(buff);
++ if (undo)
++ tcp_adjust_pcount(meta_sk, skb, -undo);
++ }
++ }
++
++ return 0;
++}
++
++/* Inspired by tcp_write_wakeup */
++int mptcp_write_wakeup(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb;
++ struct sock *sk_it;
++ int ans = 0;
++
++ if (meta_sk->sk_state == TCP_CLOSE)
++ return -1;
++
++ skb = tcp_send_head(meta_sk);
++ if (skb &&
++ before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
++ unsigned int mss;
++ unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
++ struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
++ struct tcp_sock *subtp;
++ if (!subsk)
++ goto window_probe;
++ subtp = tcp_sk(subsk);
++ mss = tcp_current_mss(subsk);
++
++ seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
++ tcp_wnd_end(subtp) - subtp->write_seq);
++
++ if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
++ meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
++
++ /* We are probing the opening of a window
++ * but the window size is != 0
++ * must have been a result SWS avoidance ( sender )
++ */
++ if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
++ skb->len > mss) {
++ seg_size = min(seg_size, mss);
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++ if (mptcp_fragment(meta_sk, skb, seg_size,
++ GFP_ATOMIC, 0))
++ return -1;
++ } else if (!tcp_skb_pcount(skb)) {
++ /* see mptcp_write_xmit on why we use UINT_MAX */
++ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++ }
++
++ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++ if (!mptcp_skb_entail(subsk, skb, 0))
++ return -1;
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
++ TCP_SKB_CB(skb)->seq);
++ tcp_event_new_data_sent(meta_sk, skb);
++
++ __tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
++
++ return 0;
++ } else {
++window_probe:
++ if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
++ meta_tp->snd_una + 0xFFFF)) {
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ if (mptcp_sk_can_send_ack(sk_it))
++ tcp_xmit_probe_skb(sk_it, 1);
++ }
++ }
++
++ /* At least one of the tcp_xmit_probe_skb's has to succeed */
++ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++ int ret;
++
++ if (!mptcp_sk_can_send_ack(sk_it))
++ continue;
++
++ ret = tcp_xmit_probe_skb(sk_it, 0);
++ if (unlikely(ret > 0))
++ ans = ret;
++ }
++ return ans;
++ }
++}
++
++bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
++ int push_one, gfp_t gfp)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
++ struct sock *subsk = NULL;
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sk_buff *skb;
++ unsigned int sent_pkts;
++ int reinject = 0;
++ unsigned int sublimit;
++
++ sent_pkts = 0;
++
++ while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
++ &sublimit))) {
++ unsigned int limit;
++
++ subtp = tcp_sk(subsk);
++ mss_now = tcp_current_mss(subsk);
++
++ if (reinject == 1) {
++ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++ /* Segment already reached the peer, take the next one */
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ __kfree_skb(skb);
++ continue;
++ }
++ }
++
++ /* If the segment was cloned (e.g. a meta retransmission),
++ * the header must be expanded/copied so that there is no
++ * corruption of TSO information.
++ */
++ if (skb_unclone(skb, GFP_ATOMIC))
++ break;
++
++ if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
++ break;
++
++ /* Force tso_segs to 1 by using UINT_MAX.
++ * We actually don't care about the exact number of segments
++ * emitted on the subflow. We need just to set tso_segs, because
++ * we still need an accurate packets_out count in
++ * tcp_event_new_data_sent.
++ */
++ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++
++ /* Check for nagle, irregardless of tso_segs. If the segment is
++ * actually larger than mss_now (TSO segment), then
++ * tcp_nagle_check will have partial == false and always trigger
++ * the transmission.
++ * tcp_write_xmit has a TSO-level nagle check which is not
++ * subject to the MPTCP-level. It is based on the properties of
++ * the subflow, not the MPTCP-level.
++ */
++ if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
++ (tcp_skb_is_last(meta_sk, skb) ?
++ nonagle : TCP_NAGLE_PUSH))))
++ break;
++
++ limit = mss_now;
++ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
++ * tcp_write_xmit. Otherwise split-point would return 0.
++ */
++ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++ /* We limit the size of the skb so that it fits into the
++ * window. Call tcp_mss_split_point to avoid duplicating
++ * code.
++ * We really only care about fitting the skb into the
++ * window. That's why we use UINT_MAX. If the skb does
++ * not fit into the cwnd_quota or the NIC's max-segs
++ * limitation, it will be split by the subflow's
++ * tcp_write_xmit which does the appropriate call to
++ * tcp_mss_split_point.
++ */
++ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++ UINT_MAX / mss_now,
++ nonagle);
++
++ if (sublimit)
++ limit = min(limit, sublimit);
++
++ if (skb->len > limit &&
++ unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
++ break;
++
++ if (!mptcp_skb_entail(subsk, skb, reinject))
++ break;
++ /* Nagle is handled at the MPTCP-layer, so
++ * always push on the subflow
++ */
++ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ if (!reinject) {
++ mptcp_check_sndseq_wrap(meta_tp,
++ TCP_SKB_CB(skb)->end_seq -
++ TCP_SKB_CB(skb)->seq);
++ tcp_event_new_data_sent(meta_sk, skb);
++ }
++
++ tcp_minshall_update(meta_tp, mss_now, skb);
++ sent_pkts += tcp_skb_pcount(skb);
++
++ if (reinject > 0) {
++ __skb_unlink(skb, &mpcb->reinject_queue);
++ kfree_skb(skb);
++ }
++
++ if (push_one)
++ break;
++ }
++
++ return !meta_tp->packets_out && tcp_send_head(meta_sk);
++}
++
++void mptcp_write_space(struct sock *sk)
++{
++ mptcp_push_pending_frames(mptcp_meta_sk(sk));
++}
++
++u32 __mptcp_select_window(struct sock *sk)
++{
++ struct inet_connection_sock *icsk = inet_csk(sk);
++ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ int mss, free_space, full_space, window;
++
++ /* MSS for the peer's data. Previous versions used mss_clamp
++ * here. I don't know if the value based on our guesses
++ * of peer's MSS is better for the performance. It's more correct
++ * but may be worse for the performance because of rcv_mss
++ * fluctuations. --SAW 1998/11/1
++ */
++ mss = icsk->icsk_ack.rcv_mss;
++ free_space = tcp_space(sk);
++ full_space = min_t(int, meta_tp->window_clamp,
++ tcp_full_space(sk));
++
++ if (mss > full_space)
++ mss = full_space;
++
++ if (free_space < (full_space >> 1)) {
++ icsk->icsk_ack.quick = 0;
++
++ if (tcp_memory_pressure)
++ /* TODO this has to be adapted when we support different
++ * MSS's among the subflows.
++ */
++ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
++ 4U * meta_tp->advmss);
++
++ if (free_space < mss)
++ return 0;
++ }
++
++ if (free_space > meta_tp->rcv_ssthresh)
++ free_space = meta_tp->rcv_ssthresh;
++
++ /* Don't do rounding if we are using window scaling, since the
++ * scaled window will not line up with the MSS boundary anyway.
++ */
++ window = meta_tp->rcv_wnd;
++ if (tp->rx_opt.rcv_wscale) {
++ window = free_space;
++
++ /* Advertise enough space so that it won't get scaled away.
++ * Import case: prevent zero window announcement if
++ * 1<<rcv_wscale > mss.
++ */
++ if (((window >> tp->rx_opt.rcv_wscale) << tp->
++ rx_opt.rcv_wscale) != window)
++ window = (((window >> tp->rx_opt.rcv_wscale) + 1)
++ << tp->rx_opt.rcv_wscale);
++ } else {
++ /* Get the largest window that is a nice multiple of mss.
++ * Window clamp already applied above.
++ * If our current window offering is within 1 mss of the
++ * free space we just keep it. This prevents the divide
++ * and multiply from happening most of the time.
++ * We also don't do any window rounding when the free space
++ * is too small.
++ */
++ if (window <= free_space - mss || window > free_space)
++ window = (free_space / mss) * mss;
++ else if (mss == full_space &&
++ free_space > window + (full_space >> 1))
++ window = free_space;
++ }
++
++ return window;
++}
++
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++ unsigned *remaining)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++
++ opts->options |= OPTION_MPTCP;
++ if (is_master_tp(tp)) {
++ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
++ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++ opts->mp_capable.sender_key = tp->mptcp_loc_key;
++ opts->dss_csum = !!sysctl_mptcp_checksum;
++ } else {
++ const struct mptcp_cb *mpcb = tp->mpcb;
++
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
++ *remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
++ opts->mp_join_syns.token = mpcb->mptcp_rem_token;
++ opts->mp_join_syns.low_prio = tp->mptcp->low_prio;
++ opts->addr_id = tp->mptcp->loc_id;
++ opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
++ }
++}
++
++void mptcp_synack_options(struct request_sock *req,
++ struct tcp_out_options *opts, unsigned *remaining)
++{
++ struct mptcp_request_sock *mtreq;
++ mtreq = mptcp_rsk(req);
++
++ opts->options |= OPTION_MPTCP;
++ /* MPCB not yet set - thus it's a new MPTCP-session */
++ if (!mtreq->is_sub) {
++ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
++ opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
++ opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
++ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++ } else {
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
++ opts->mp_join_syns.sender_truncated_mac =
++ mtreq->mptcp_hash_tmac;
++ opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
++ opts->mp_join_syns.low_prio = mtreq->low_prio;
++ opts->addr_id = mtreq->loc_id;
++ *remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
++ }
++}
++
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++ struct tcp_out_options *opts, unsigned *size)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct mptcp_cb *mpcb = tp->mpcb;
++ const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
++
++ /* We are coming from tcp_current_mss with the meta_sk as an argument.
++ * It does not make sense to check for the options, because when the
++ * segment gets sent, another subflow will be chosen.
++ */
++ if (!skb && is_meta_sk(sk))
++ return;
++
++ /* In fallback mp_fail-mode, we have to repeat it until the fallback
++ * has been done by the sender
++ */
++ if (unlikely(tp->mptcp->send_mp_fail)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_FAIL;
++ *size += MPTCP_SUB_LEN_FAIL;
++ return;
++ }
++
++ if (unlikely(tp->send_mp_fclose)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_FCLOSE;
++ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++ *size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
++ return;
++ }
++
++ /* 1. If we are the sender of the infinite-mapping, we need the
++ * MPTCPHDR_INF-flag, because a retransmission of the
++ * infinite-announcment still needs the mptcp-option.
++ *
++ * We need infinite_cutoff_seq, because retransmissions from before
++ * the infinite-cutoff-moment still need the MPTCP-signalling to stay
++ * consistent.
++ *
++ * 2. If we are the receiver of the infinite-mapping, we always skip
++ * mptcp-options, because acknowledgments from before the
++ * infinite-mapping point have already been sent out.
++ *
++ * I know, the whole infinite-mapping stuff is ugly...
++ *
++ * TODO: Handle wrapped data-sequence numbers
++ * (even if it's very unlikely)
++ */
++ if (unlikely(mpcb->infinite_mapping_snd) &&
++ ((mpcb->send_infinite_mapping && tcb &&
++ mptcp_is_data_seq(skb) &&
++ !(tcb->mptcp_flags & MPTCPHDR_INF) &&
++ !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
++ !mpcb->send_infinite_mapping))
++ return;
++
++ if (unlikely(tp->mptcp->include_mpc)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_CAPABLE |
++ OPTION_TYPE_ACK;
++ *size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
++ opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
++ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++ opts->dss_csum = mpcb->dss_csum;
++
++ if (skb)
++ tp->mptcp->include_mpc = 0;
++ }
++ if (unlikely(tp->mptcp->pre_established)) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
++ *size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
++ }
++
++ if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_DATA_ACK;
++ /* If !skb, we come from tcp_current_mss and thus we always
++ * assume that the DSS-option will be set for the data-packet.
++ */
++ if (skb && !mptcp_is_data_seq(skb)) {
++ *size += MPTCP_SUB_LEN_ACK_ALIGN;
++ } else {
++ /* Doesn't matter, if csum included or not. It will be
++ * either 10 or 12, and thus aligned = 12
++ */
++ *size += MPTCP_SUB_LEN_ACK_ALIGN +
++ MPTCP_SUB_LEN_SEQ_ALIGN;
++ }
++
++ *size += MPTCP_SUB_LEN_DSS_ALIGN;
++ }
++
++ if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
++ mpcb->pm_ops->addr_signal(sk, size, opts, skb);
++
++ if (unlikely(tp->mptcp->send_mp_prio) &&
++ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
++ opts->options |= OPTION_MPTCP;
++ opts->mptcp_options |= OPTION_MP_PRIO;
++ if (skb)
++ tp->mptcp->send_mp_prio = 0;
++ *size += MPTCP_SUB_LEN_PRIO_ALIGN;
++ }
++
++ return;
++}
++
++u16 mptcp_select_window(struct sock *sk)
++{
++ u16 new_win = tcp_select_window(sk);
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
++
++ meta_tp->rcv_wnd = tp->rcv_wnd;
++ meta_tp->rcv_wup = meta_tp->rcv_nxt;
++
++ return new_win;
++}
++
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++ const struct tcp_out_options *opts,
++ struct sk_buff *skb)
++{
++ if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
++ struct mp_capable *mpc = (struct mp_capable *)ptr;
++
++ mpc->kind = TCPOPT_MPTCP;
++
++ if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
++ (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
++ mpc->sender_key = opts->mp_capable.sender_key;
++ mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
++ ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
++ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++ mpc->sender_key = opts->mp_capable.sender_key;
++ mpc->receiver_key = opts->mp_capable.receiver_key;
++ mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
++ ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
++ }
++
++ mpc->sub = MPTCP_SUB_CAPABLE;
++ mpc->ver = 0;
++ mpc->a = opts->dss_csum;
++ mpc->b = 0;
++ mpc->rsv = 0;
++ mpc->h = 1;
++ }
++
++ if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
++ struct mp_join *mpj = (struct mp_join *)ptr;
++
++ mpj->kind = TCPOPT_MPTCP;
++ mpj->sub = MPTCP_SUB_JOIN;
++ mpj->rsv = 0;
++
++ if (OPTION_TYPE_SYN & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
++ mpj->u.syn.token = opts->mp_join_syns.token;
++ mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
++ mpj->b = opts->mp_join_syns.low_prio;
++ mpj->addr_id = opts->addr_id;
++ ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
++ } else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
++ mpj->u.synack.mac =
++ opts->mp_join_syns.sender_truncated_mac;
++ mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
++ mpj->b = opts->mp_join_syns.low_prio;
++ mpj->addr_id = opts->addr_id;
++ ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
++ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++ mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
++ mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
++ memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
++ ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
++ }
++ }
++ if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
++ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++ mpadd->kind = TCPOPT_MPTCP;
++ if (opts->add_addr_v4) {
++ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
++ mpadd->sub = MPTCP_SUB_ADD_ADDR;
++ mpadd->ipver = 4;
++ mpadd->addr_id = opts->add_addr4.addr_id;
++ mpadd->u.v4.addr = opts->add_addr4.addr;
++ ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
++ } else if (opts->add_addr_v6) {
++ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
++ mpadd->sub = MPTCP_SUB_ADD_ADDR;
++ mpadd->ipver = 6;
++ mpadd->addr_id = opts->add_addr6.addr_id;
++ memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
++ sizeof(mpadd->u.v6.addr));
++ ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
++ }
++ }
++ if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
++ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++ u8 *addrs_id;
++ int id, len, len_align;
++
++ len = mptcp_sub_len_remove_addr(opts->remove_addrs);
++ len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
++
++ mprem->kind = TCPOPT_MPTCP;
++ mprem->len = len;
++ mprem->sub = MPTCP_SUB_REMOVE_ADDR;
++ mprem->rsv = 0;
++ addrs_id = &mprem->addrs_id;
++
++ mptcp_for_each_bit_set(opts->remove_addrs, id)
++ *(addrs_id++) = id;
++
++ /* Fill the rest with NOP's */
++ if (len_align > len) {
++ int i;
++ for (i = 0; i < len_align - len; i++)
++ *(addrs_id++) = TCPOPT_NOP;
++ }
++
++ ptr += len_align >> 2;
++ }
++ if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
++ struct mp_fail *mpfail = (struct mp_fail *)ptr;
++
++ mpfail->kind = TCPOPT_MPTCP;
++ mpfail->len = MPTCP_SUB_LEN_FAIL;
++ mpfail->sub = MPTCP_SUB_FAIL;
++ mpfail->rsv1 = 0;
++ mpfail->rsv2 = 0;
++ mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
++
++ ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
++ }
++ if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
++ struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
++
++ mpfclose->kind = TCPOPT_MPTCP;
++ mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
++ mpfclose->sub = MPTCP_SUB_FCLOSE;
++ mpfclose->rsv1 = 0;
++ mpfclose->rsv2 = 0;
++ mpfclose->key = opts->mp_capable.receiver_key;
++
++ ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
++ }
++
++ if (OPTION_DATA_ACK & opts->mptcp_options) {
++ if (!mptcp_is_data_seq(skb))
++ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++ else
++ ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
++ }
++ if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
++ struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++ mpprio->kind = TCPOPT_MPTCP;
++ mpprio->len = MPTCP_SUB_LEN_PRIO;
++ mpprio->sub = MPTCP_SUB_PRIO;
++ mpprio->rsv = 0;
++ mpprio->b = tp->mptcp->low_prio;
++ mpprio->addr_id = TCPOPT_NOP;
++
++ ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
++ }
++}
++
++/* Sends the datafin */
++void mptcp_send_fin(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
++ int mss_now;
++
++ if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
++ meta_tp->mpcb->passive_close = 1;
++
++ /* Optimization, tack on the FIN if we have a queue of
++ * unsent frames. But be careful about outgoing SACKS
++ * and IP options.
++ */
++ mss_now = mptcp_current_mss(meta_sk);
++
++ if (tcp_send_head(meta_sk) != NULL) {
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++ TCP_SKB_CB(skb)->end_seq++;
++ meta_tp->write_seq++;
++ } else {
++ /* Socket is locked, keep trying until memory is available. */
++ for (;;) {
++ skb = alloc_skb_fclone(MAX_TCP_HEADER,
++ meta_sk->sk_allocation);
++ if (skb)
++ break;
++ yield();
++ }
++ /* Reserve space for headers and prepare control bits. */
++ skb_reserve(skb, MAX_TCP_HEADER);
++
++ tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
++ TCP_SKB_CB(skb)->end_seq++;
++ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++ tcp_queue_skb(meta_sk, skb);
++ }
++ __tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
++}
++
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
++
++ if (!mpcb->cnt_subflows)
++ return;
++
++ WARN_ON(meta_tp->send_mp_fclose);
++
++ /* First - select a socket */
++ sk = mptcp_select_ack_sock(meta_sk);
++
++ /* May happen if no subflow is in an appropriate state */
++ if (!sk)
++ return;
++
++ /* We are in infinite mode - just send a reset */
++ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
++ sk->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk->sk_state))
++ tcp_send_active_reset(sk, priority);
++ mptcp_sub_force_close(sk);
++ return;
++ }
++
++
++ tcp_sk(sk)->send_mp_fclose = 1;
++ /** Reset all other subflows */
++
++ /* tcp_done must be handled with bh disabled */
++ if (!in_serving_softirq())
++ local_bh_disable();
++
++ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++ if (tcp_sk(sk_it)->send_mp_fclose)
++ continue;
++
++ sk_it->sk_err = ECONNRESET;
++ if (tcp_need_reset(sk_it->sk_state))
++ tcp_send_active_reset(sk_it, GFP_ATOMIC);
++ mptcp_sub_force_close(sk_it);
++ }
++
++ if (!in_serving_softirq())
++ local_bh_enable();
++
++ tcp_send_ack(sk);
++ inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
++
++ meta_tp->send_mp_fclose = 1;
++}
++
++static void mptcp_ack_retransmit_timer(struct sock *sk)
++{
++ struct sk_buff *skb;
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct inet_connection_sock *icsk = inet_csk(sk);
++
++ if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
++ goto out; /* Routing failure or similar */
++
++ if (!tp->retrans_stamp)
++ tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++ if (tcp_write_timeout(sk)) {
++ tp->mptcp->pre_established = 0;
++ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++ tp->ops->send_active_reset(sk, GFP_ATOMIC);
++ goto out;
++ }
++
++ skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
++ if (skb == NULL) {
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ return;
++ }
++
++ /* Reserve space for headers and prepare control bits */
++ skb_reserve(skb, MAX_TCP_HEADER);
++ tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
++
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++ if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
++ /* Retransmission failed because of local congestion,
++ * do not backoff.
++ */
++ if (!icsk->icsk_retransmits)
++ icsk->icsk_retransmits = 1;
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ return;
++ }
++
++
++ icsk->icsk_retransmits++;
++ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++ jiffies + icsk->icsk_rto);
++ if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
++ __sk_dst_reset(sk);
++
++out:;
++}
++
++void mptcp_ack_handler(unsigned long data)
++{
++ struct sock *sk = (struct sock *)data;
++ struct sock *meta_sk = mptcp_meta_sk(sk);
++
++ bh_lock_sock(meta_sk);
++ if (sock_owned_by_user(meta_sk)) {
++ /* Try again later */
++ sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
++ jiffies + (HZ / 20));
++ goto out_unlock;
++ }
++
++ if (sk->sk_state == TCP_CLOSE)
++ goto out_unlock;
++ if (!tcp_sk(sk)->mptcp->pre_established)
++ goto out_unlock;
++
++ mptcp_ack_retransmit_timer(sk);
++
++ sk_mem_reclaim(sk);
++
++out_unlock:
++ bh_unlock_sock(meta_sk);
++ sock_put(sk);
++}
++
++/* Similar to tcp_retransmit_skb
++ *
++ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
++ * meta-level.
++ */
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct sock *subsk;
++ unsigned int limit, mss_now;
++ int err = -1;
++
++ /* Do not sent more than we queued. 1/4 is reserved for possible
++ * copying overhead: fragmentation, tunneling, mangling etc.
++ *
++ * This is a meta-retransmission thus we check on the meta-socket.
++ */
++ if (atomic_read(&meta_sk->sk_wmem_alloc) >
++ min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
++ return -EAGAIN;
++ }
++
++ /* We need to make sure that the retransmitted segment can be sent on a
++ * subflow right now. If it is too big, it needs to be fragmented.
++ */
++ subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
++ if (!subsk) {
++ /* We want to increase icsk_retransmits, thus return 0, so that
++ * mptcp_retransmit_timer enters the desired branch.
++ */
++ err = 0;
++ goto failed;
++ }
++ mss_now = tcp_current_mss(subsk);
++
++ /* If the segment was cloned (e.g. a meta retransmission), the header
++ * must be expanded/copied so that there is no corruption of TSO
++ * information.
++ */
++ if (skb_unclone(skb, GFP_ATOMIC)) {
++ err = -ENOMEM;
++ goto failed;
++ }
++
++ /* Must have been set by mptcp_write_xmit before */
++ BUG_ON(!tcp_skb_pcount(skb));
++
++ limit = mss_now;
++ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
++ * tcp_write_xmit. Otherwise split-point would return 0.
++ */
++ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++ UINT_MAX / mss_now,
++ TCP_NAGLE_OFF);
++
++ if (skb->len > limit &&
++ unlikely(mptcp_fragment(meta_sk, skb, limit,
++ GFP_ATOMIC, 0)))
++ goto failed;
++
++ if (!mptcp_skb_entail(subsk, skb, -1))
++ goto failed;
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++ /* Update global TCP statistics. */
++ TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
++
++ /* Diff to tcp_retransmit_skb */
++
++ /* Save stamp of the first retransmit. */
++ if (!meta_tp->retrans_stamp)
++ meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
++
++ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++
++ return 0;
++
++failed:
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
++ return err;
++}
++
++/* Similar to tcp_retransmit_timer
++ *
++ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
++ * and that we don't have an srtt estimation at the meta-level.
++ */
++void mptcp_retransmit_timer(struct sock *meta_sk)
++{
++ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++ struct mptcp_cb *mpcb = meta_tp->mpcb;
++ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++ int err;
++
++ /* In fallback, retransmission is handled at the subflow-level */
++ if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
++ mpcb->send_infinite_mapping)
++ return;
++
++ WARN_ON(tcp_write_queue_empty(meta_sk));
++
++ if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
++ !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
++ /* Receiver dastardly shrinks window. Our retransmits
++ * become zero probes, but we should not timeout this
++ * connection. If the socket is an orphan, time it out,
++ * we cannot allow such beasts to hang infinitely.
++ */
++ struct inet_sock *meta_inet = inet_sk(meta_sk);
++ if (meta_sk->sk_family == AF_INET) {
++ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++ &meta_inet->inet_daddr,
++ ntohs(meta_inet->inet_dport),
++ meta_inet->inet_num, meta_tp->snd_una,
++ meta_tp->snd_nxt);
++ }
++#if IS_ENABLED(CONFIG_IPV6)
++ else if (meta_sk->sk_family == AF_INET6) {
++ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++ &meta_sk->sk_v6_daddr,
++ ntohs(meta_inet->inet_dport),
++ meta_inet->inet_num, meta_tp->snd_una,
++ meta_tp->snd_nxt);
++ }
++#endif
++ if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
++ tcp_write_err(meta_sk);
++ return;
++ }
++
++ mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++ goto out_reset_timer;
++ }
++
++ if (tcp_write_timeout(meta_sk))
++ return;
++
++ if (meta_icsk->icsk_retransmits == 0)
++ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
++
++ meta_icsk->icsk_ca_state = TCP_CA_Loss;
++
++ err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++ if (err > 0) {
++ /* Retransmission failed because of local congestion,
++ * do not backoff.
++ */
++ if (!meta_icsk->icsk_retransmits)
++ meta_icsk->icsk_retransmits = 1;
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++ min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
++ TCP_RTO_MAX);
++ return;
++ }
++
++ /* Increase the timeout each time we retransmit. Note that
++ * we do not increase the rtt estimate. rto is initialized
++ * from rtt, but increases here. Jacobson (SIGCOMM 88) suggests
++ * that doubling rto each time is the least we can get away with.
++ * In KA9Q, Karn uses this for the first few times, and then
++ * goes to quadratic. netBSD doubles, but only goes up to *64,
++ * and clamps at 1 to 64 sec afterwards. Note that 120 sec is
++ * defined in the protocol as the maximum possible RTT. I guess
++ * we'll have to use something other than TCP to talk to the
++ * University of Mars.
++ *
++ * PAWS allows us longer timeouts and large windows, so once
++ * implemented ftp to mars will work nicely. We will have to fix
++ * the 120 second clamps though!
++ */
++ meta_icsk->icsk_backoff++;
++ meta_icsk->icsk_retransmits++;
++
++out_reset_timer:
++ /* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
++ * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
++ * might be increased if the stream oscillates between thin and thick,
++ * thus the old value might already be too high compared to the value
++ * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
++ * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
++ * exponential backoff behaviour to avoid continue hammering
++ * linear-timeout retransmissions into a black hole
++ */
++ if (meta_sk->sk_state == TCP_ESTABLISHED &&
++ (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
++ tcp_stream_is_thin(meta_tp) &&
++ meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
++ meta_icsk->icsk_backoff = 0;
++ /* We cannot do the same as in tcp_write_timer because the
++ * srtt is not set here.
++ */
++ mptcp_set_rto(meta_sk);
++ } else {
++ /* Use normal (exponential) backoff */
++ meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
++ }
++ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
++
++ return;
++}
++
++/* Modify values to an mptcp-level for the initial window of new subflows */
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++ __u32 *window_clamp, int wscale_ok,
++ __u8 *rcv_wscale, __u32 init_rcv_wnd,
++ const struct sock *sk)
++{
++ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++ *window_clamp = mpcb->orig_window_clamp;
++ __space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
++
++ tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
++ wscale_ok, rcv_wscale, init_rcv_wnd, sk);
++}
++
++static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
++ unsigned int (*mss_cb)(struct sock *sk))
++{
++ struct sock *sk;
++ u64 rate = 0;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++ int this_mss;
++ u64 this_rate;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ /* Do not consider subflows without a RTT estimation yet
++ * otherwise this_rate >>> rate.
++ */
++ if (unlikely(!tp->srtt_us))
++ continue;
++
++ this_mss = mss_cb(sk);
++
++ /* If this_mss is smaller than mss, it means that a segment will
++ * be splitted in two (or more) when pushed on this subflow. If
++ * you consider that mss = 1428 and this_mss = 1420 then two
++ * segments will be generated: a 1420-byte and 8-byte segment.
++ * The latter will introduce a large overhead as for a single
++ * data segment 2 slots will be used in the congestion window.
++ * Therefore reducing by ~2 the potential throughput of this
++ * subflow. Indeed, 1428 will be send while 2840 could have been
++ * sent if mss == 1420 reducing the throughput by 2840 / 1428.
++ *
++ * The following algorithm take into account this overhead
++ * when computing the potential throughput that MPTCP can
++ * achieve when generating mss-byte segments.
++ *
++ * The formulae is the following:
++ * \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
++ * Where ratio is computed as follows:
++ * \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
++ *
++ * ratio gives the reduction factor of the theoretical
++ * throughput a subflow can achieve if MPTCP uses a specific
++ * MSS value.
++ */
++ this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
++ max(tp->snd_cwnd, tp->packets_out),
++ (u64)tp->srtt_us *
++ DIV_ROUND_UP(mss, this_mss) * this_mss);
++ rate += this_rate;
++ }
++
++ return rate;
++}
++
++static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
++ unsigned int (*mss_cb)(struct sock *sk))
++{
++ unsigned int mss = 0;
++ u64 rate = 0;
++ struct sock *sk;
++
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ int this_mss;
++ u64 this_rate;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ this_mss = mss_cb(sk);
++
++ /* Same mss values will produce the same throughput. */
++ if (this_mss == mss)
++ continue;
++
++ /* See whether using this mss value can theoretically improve
++ * the performances.
++ */
++ this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
++ if (this_rate >= rate) {
++ mss = this_mss;
++ rate = this_rate;
++ }
++ }
++
++ return mss;
++}
++
++unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++ unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
++
++ /* If no subflow is available, we take a default-mss from the
++ * meta-socket.
++ */
++ return !mss ? tcp_current_mss(meta_sk) : mss;
++}
++
++static unsigned int mptcp_select_size_mss(struct sock *sk)
++{
++ return tcp_sk(sk)->mss_cache;
++}
++
++int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++ unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
++
++ if (sg) {
++ if (mptcp_sk_can_gso(meta_sk)) {
++ mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
++ } else {
++ int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
++
++ if (mss >= pgbreak &&
++ mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
++ mss = pgbreak;
++ }
++ }
++
++ return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
++}
++
++int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++ const struct sock *sk;
++ u32 rtt_max = tp->srtt_us;
++ u64 bw_est;
++
++ if (!tp->srtt_us)
++ return tp->reordering + 1;
++
++ mptcp_for_each_sk(tp->mpcb, sk) {
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ if (rtt_max < tcp_sk(sk)->srtt_us)
++ rtt_max = tcp_sk(sk)->srtt_us;
++ }
++
++ bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
++ (u64)tp->srtt_us);
++
++ return max_t(unsigned int, (u32)(bw_est >> 16),
++ tp->reordering + 1);
++}
++
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++ int large_allowed)
++{
++ struct sock *sk;
++ u32 xmit_size_goal = 0;
++
++ if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
++ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++ int this_size_goal;
++
++ if (!mptcp_sk_can_send(sk))
++ continue;
++
++ this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
++ if (this_size_goal > xmit_size_goal)
++ xmit_size_goal = this_size_goal;
++ }
++ }
++
++ return max(xmit_size_goal, mss_now);
++}
++
++/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++ if (skb_cloned(skb)) {
++ if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
++ return -ENOMEM;
++ }
++
++ __pskb_trim_head(skb, len);
++
++ TCP_SKB_CB(skb)->seq += len;
++ skb->ip_summed = CHECKSUM_PARTIAL;
++
++ skb->truesize -= len;
++ sk->sk_wmem_queued -= len;
++ sk_mem_uncharge(sk, len);
++ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++
++ /* Any change of skb->len requires recalculation of tso factor. */
++ if (tcp_skb_pcount(skb) > 1)
++ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
++
++ return 0;
++}
+diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
+new file mode 100644
+index 000000000000..9542f950729f
+--- /dev/null
++++ b/net/mptcp/mptcp_pm.c
+@@ -0,0 +1,169 @@
++/*
++ * MPTCP implementation - MPTCP-subflow-management
++ *
++ * Initial Design & Implementation:
++ * Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ * Current Maintainer & Author:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * Additional authors:
++ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ * Gregory Detal <gregory.detal@uclouvain.be>
++ * Fabien Duchêne <fabien.duchene@uclouvain.be>
++ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ * Lavkesh Lahngir <lavkesh51@gmail.com>
++ * Andreas Ripke <ripke@neclab.eu>
++ * Vlad Dogaru <vlad.dogaru@intel.com>
++ * Octavian Purdila <octavian.purdila@intel.com>
++ * John Ronan <jronan@tssg.org>
++ * Catalin Nicutar <catalin.nicutar@gmail.com>
++ * Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_pm_list_lock);
++static LIST_HEAD(mptcp_pm_list);
++
++static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
++ struct net *net, bool *low_prio)
++{
++ return 0;
++}
++
++struct mptcp_pm_ops mptcp_pm_default = {
++ .get_local_id = mptcp_default_id, /* We do not care */
++ .name = "default",
++ .owner = THIS_MODULE,
++};
++
++static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
++{
++ struct mptcp_pm_ops *e;
++
++ list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
++ if (strcmp(e->name, name) == 0)
++ return e;
++ }
++
++ return NULL;
++}
++
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
++{
++ int ret = 0;
++
++ if (!pm->get_local_id)
++ return -EINVAL;
++
++ spin_lock(&mptcp_pm_list_lock);
++ if (mptcp_pm_find(pm->name)) {
++ pr_notice("%s already registered\n", pm->name);
++ ret = -EEXIST;
++ } else {
++ list_add_tail_rcu(&pm->list, &mptcp_pm_list);
++ pr_info("%s registered\n", pm->name);
++ }
++ spin_unlock(&mptcp_pm_list_lock);
++
++ return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
++
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
++{
++ spin_lock(&mptcp_pm_list_lock);
++ list_del_rcu(&pm->list);
++ spin_unlock(&mptcp_pm_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
++
++void mptcp_get_default_path_manager(char *name)
++{
++ struct mptcp_pm_ops *pm;
++
++ BUG_ON(list_empty(&mptcp_pm_list));
++
++ rcu_read_lock();
++ pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
++ strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
++ rcu_read_unlock();
++}
++
++int mptcp_set_default_path_manager(const char *name)
++{
++ struct mptcp_pm_ops *pm;
++ int ret = -ENOENT;
++
++ spin_lock(&mptcp_pm_list_lock);
++ pm = mptcp_pm_find(name);
++#ifdef CONFIG_MODULES
++ if (!pm && capable(CAP_NET_ADMIN)) {
++ spin_unlock(&mptcp_pm_list_lock);
++
++ request_module("mptcp_%s", name);
++ spin_lock(&mptcp_pm_list_lock);
++ pm = mptcp_pm_find(name);
++ }
++#endif
++
++ if (pm) {
++ list_move(&pm->list, &mptcp_pm_list);
++ ret = 0;
++ } else {
++ pr_info("%s is not available\n", name);
++ }
++ spin_unlock(&mptcp_pm_list_lock);
++
++ return ret;
++}
++
++void mptcp_init_path_manager(struct mptcp_cb *mpcb)
++{
++ struct mptcp_pm_ops *pm;
++
++ rcu_read_lock();
++ list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
++ if (try_module_get(pm->owner)) {
++ mpcb->pm_ops = pm;
++ break;
++ }
++ }
++ rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
++{
++ module_put(mpcb->pm_ops->owner);
++}
++
++/* Fallback to the default path-manager. */
++void mptcp_fallback_default(struct mptcp_cb *mpcb)
++{
++ struct mptcp_pm_ops *pm;
++
++ mptcp_cleanup_path_manager(mpcb);
++ pm = mptcp_pm_find("default");
++
++ /* Cannot fail - it's the default module */
++ try_module_get(pm->owner);
++ mpcb->pm_ops = pm;
++}
++EXPORT_SYMBOL_GPL(mptcp_fallback_default);
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_path_manager_default(void)
++{
++ return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
++}
++late_initcall(mptcp_path_manager_default);
+diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
+new file mode 100644
+index 000000000000..93278f684069
+--- /dev/null
++++ b/net/mptcp/mptcp_rr.c
+@@ -0,0 +1,301 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static unsigned char num_segments __read_mostly = 1;
++module_param(num_segments, byte, 0644);
++MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
++
++static bool cwnd_limited __read_mostly = 1;
++module_param(cwnd_limited, bool, 0644);
++MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
++
++struct rrsched_priv {
++ unsigned char quota;
++};
++
++static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
++{
++ return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
++ bool zero_wnd_test, bool cwnd_test)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ unsigned int space, in_flight;
++
++ /* Set of states for which we are allowed to send data */
++ if (!mptcp_sk_can_send(sk))
++ return false;
++
++ /* We do not send data on this subflow unless it is
++ * fully established, i.e. the 4th ack has been received.
++ */
++ if (tp->mptcp->pre_established)
++ return false;
++
++ if (tp->pf)
++ return false;
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++ /* If SACK is disabled, and we got a loss, TCP does not exit
++ * the loss-state until something above high_seq has been acked.
++ * (see tcp_try_undo_recovery)
++ *
++ * high_seq is the snd_nxt at the moment of the RTO. As soon
++ * as we have an RTO, we won't push data on the subflow.
++ * Thus, snd_una can never go beyond high_seq.
++ */
++ if (!tcp_is_reno(tp))
++ return false;
++ else if (tp->snd_una != tp->high_seq)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ /* Make sure that we send in-order data */
++ if (skb && tp->mptcp->second_packet &&
++ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++ return false;
++ }
++
++ if (!cwnd_test)
++ goto zero_wnd_test;
++
++ in_flight = tcp_packets_in_flight(tp);
++ /* Not even a single spot in the cwnd */
++ if (in_flight >= tp->snd_cwnd)
++ return false;
++
++ /* Now, check if what is queued in the subflow's send-queue
++ * already fills the cwnd.
++ */
++ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++ if (tp->write_seq - tp->snd_nxt > space)
++ return false;
++
++zero_wnd_test:
++ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++ return false;
++
++ return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++ /* If the skb has already been enqueued in this sk, try to find
++ * another one.
++ */
++ return skb &&
++ /* Has the skb already been enqueued into this subsocket? */
++ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* We just look for any subflow that is available */
++static struct sock *rr_get_available_subflow(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk, *bestsk = NULL, *backupsk = NULL;
++
++ /* Answer data_fin on same subflow!!! */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++ skb && mptcp_is_data_fin(skb)) {
++ mptcp_for_each_sk(mpcb, sk) {
++ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++ mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++ return sk;
++ }
++ }
++
++ /* First, find the best subflow */
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++ continue;
++
++ if (mptcp_rr_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ bestsk = sk;
++ }
++
++ if (bestsk) {
++ sk = bestsk;
++ } else if (backupsk) {
++ /* It has been sent on all subflows once - let's give it a
++ * chance again by restarting its pathmask.
++ */
++ if (skb)
++ TCP_SKB_CB(skb)->path_mask = 0;
++ sk = backupsk;
++ }
++
++ return sk;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sk_buff *skb = NULL;
++
++ *reinject = 0;
++
++ /* If we are in fallback-mode, just take from the meta-send-queue */
++ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++ return tcp_send_head(meta_sk);
++
++ skb = skb_peek(&mpcb->reinject_queue);
++
++ if (skb)
++ *reinject = 1;
++ else
++ skb = tcp_send_head(meta_sk);
++ return skb;
++}
++
++static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk_it, *choose_sk = NULL;
++ struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
++ unsigned char split = num_segments;
++ unsigned char iter = 0, full_subs = 0;
++
++ /* As we set it, we have to reset it as well. */
++ *limit = 0;
++
++ if (!skb)
++ return NULL;
++
++ if (*reinject) {
++ *subsk = rr_get_available_subflow(meta_sk, skb, false);
++ if (!*subsk)
++ return NULL;
++
++ return skb;
++ }
++
++retry:
++
++ /* First, we look for a subflow who is currently being used */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++ continue;
++
++ iter++;
++
++ /* Is this subflow currently being used? */
++ if (rsp->quota > 0 && rsp->quota < num_segments) {
++ split = num_segments - rsp->quota;
++ choose_sk = sk_it;
++ goto found;
++ }
++
++ /* Or, it's totally unused */
++ if (!rsp->quota) {
++ split = num_segments;
++ choose_sk = sk_it;
++ }
++
++ /* Or, it must then be fully used */
++ if (rsp->quota == num_segments)
++ full_subs++;
++ }
++
++ /* All considered subflows have a full quota, and we considered at
++ * least one.
++ */
++ if (iter && iter == full_subs) {
++ /* So, we restart this round by setting quota to 0 and retry
++ * to find a subflow.
++ */
++ mptcp_for_each_sk(mpcb, sk_it) {
++ struct tcp_sock *tp_it = tcp_sk(sk_it);
++ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++ continue;
++
++ rsp->quota = 0;
++ }
++
++ goto retry;
++ }
++
++found:
++ if (choose_sk) {
++ unsigned int mss_now;
++ struct tcp_sock *choose_tp = tcp_sk(choose_sk);
++ struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
++
++ if (!mptcp_rr_is_available(choose_sk, skb, false, true))
++ return NULL;
++
++ *subsk = choose_sk;
++ mss_now = tcp_current_mss(*subsk);
++ *limit = split * mss_now;
++
++ if (skb->len > mss_now)
++ rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
++ else
++ rsp->quota++;
++
++ return skb;
++ }
++
++ return NULL;
++}
++
++static struct mptcp_sched_ops mptcp_sched_rr = {
++ .get_subflow = rr_get_available_subflow,
++ .next_segment = mptcp_rr_next_segment,
++ .name = "roundrobin",
++ .owner = THIS_MODULE,
++};
++
++static int __init rr_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
++
++ if (mptcp_register_scheduler(&mptcp_sched_rr))
++ return -1;
++
++ return 0;
++}
++
++static void rr_unregister(void)
++{
++ mptcp_unregister_scheduler(&mptcp_sched_rr);
++}
++
++module_init(rr_register);
++module_exit(rr_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
++MODULE_VERSION("0.89");
+diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
+new file mode 100644
+index 000000000000..6c7ff4eceac1
+--- /dev/null
++++ b/net/mptcp/mptcp_sched.c
+@@ -0,0 +1,493 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_sched_list_lock);
++static LIST_HEAD(mptcp_sched_list);
++
++struct defsched_priv {
++ u32 last_rbuf_opti;
++};
++
++static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
++{
++ return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ unsigned int mss_now, space, in_flight;
++
++ /* Set of states for which we are allowed to send data */
++ if (!mptcp_sk_can_send(sk))
++ return false;
++
++ /* We do not send data on this subflow unless it is
++ * fully established, i.e. the 4th ack has been received.
++ */
++ if (tp->mptcp->pre_established)
++ return false;
++
++ if (tp->pf)
++ return false;
++
++ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++ /* If SACK is disabled, and we got a loss, TCP does not exit
++ * the loss-state until something above high_seq has been acked.
++ * (see tcp_try_undo_recovery)
++ *
++ * high_seq is the snd_nxt at the moment of the RTO. As soon
++ * as we have an RTO, we won't push data on the subflow.
++ * Thus, snd_una can never go beyond high_seq.
++ */
++ if (!tcp_is_reno(tp))
++ return false;
++ else if (tp->snd_una != tp->high_seq)
++ return false;
++ }
++
++ if (!tp->mptcp->fully_established) {
++ /* Make sure that we send in-order data */
++ if (skb && tp->mptcp->second_packet &&
++ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++ return false;
++ }
++
++ /* If TSQ is already throttling us, do not send on this subflow. When
++ * TSQ gets cleared the subflow becomes eligible again.
++ */
++ if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
++ return false;
++
++ in_flight = tcp_packets_in_flight(tp);
++ /* Not even a single spot in the cwnd */
++ if (in_flight >= tp->snd_cwnd)
++ return false;
++
++ /* Now, check if what is queued in the subflow's send-queue
++ * already fills the cwnd.
++ */
++ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++ if (tp->write_seq - tp->snd_nxt > space)
++ return false;
++
++ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++ return false;
++
++ mss_now = tcp_current_mss(sk);
++
++ /* Don't send on this subflow if we bypass the allowed send-window at
++ * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
++ * calculated end_seq (because here at this point end_seq is still at
++ * the meta-level).
++ */
++ if (skb && !zero_wnd_test &&
++ after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
++ return false;
++
++ return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++ /* If the skb has already been enqueued in this sk, try to find
++ * another one.
++ */
++ return skb &&
++ /* Has the skb already been enqueued into this subsocket? */
++ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* This is the scheduler. This function decides on which flow to send
++ * a given MSS. If all subflows are found to be busy, NULL is returned
++ * The flow is selected based on the shortest RTT.
++ * If all paths have full cong windows, we simply return NULL.
++ *
++ * Additionally, this function is aware of the backup-subflows.
++ */
++static struct sock *get_available_subflow(struct sock *meta_sk,
++ struct sk_buff *skb,
++ bool zero_wnd_test)
++{
++ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
++ u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
++ int cnt_backups = 0;
++
++ /* if there is only one subflow, bypass the scheduling function */
++ if (mpcb->cnt_subflows == 1) {
++ bestsk = (struct sock *)mpcb->connection_list;
++ if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
++ bestsk = NULL;
++ return bestsk;
++ }
++
++ /* Answer data_fin on same subflow!!! */
++ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++ skb && mptcp_is_data_fin(skb)) {
++ mptcp_for_each_sk(mpcb, sk) {
++ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++ mptcp_is_available(sk, skb, zero_wnd_test))
++ return sk;
++ }
++ }
++
++ /* First, find the best subflow */
++ mptcp_for_each_sk(mpcb, sk) {
++ struct tcp_sock *tp = tcp_sk(sk);
++
++ if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
++ cnt_backups++;
++
++ if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++ tp->srtt_us < lowprio_min_time_to_peer) {
++ if (!mptcp_is_available(sk, skb, zero_wnd_test))
++ continue;
++
++ if (mptcp_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ lowprio_min_time_to_peer = tp->srtt_us;
++ lowpriosk = sk;
++ } else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++ tp->srtt_us < min_time_to_peer) {
++ if (!mptcp_is_available(sk, skb, zero_wnd_test))
++ continue;
++
++ if (mptcp_dont_reinject_skb(tp, skb)) {
++ backupsk = sk;
++ continue;
++ }
++
++ min_time_to_peer = tp->srtt_us;
++ bestsk = sk;
++ }
++ }
++
++ if (mpcb->cnt_established == cnt_backups && lowpriosk) {
++ sk = lowpriosk;
++ } else if (bestsk) {
++ sk = bestsk;
++ } else if (backupsk) {
++ /* It has been sent on all subflows once - let's give it a
++ * chance again by restarting its pathmask.
++ */
++ if (skb)
++ TCP_SKB_CB(skb)->path_mask = 0;
++ sk = backupsk;
++ }
++
++ return sk;
++}
++
++static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
++{
++ struct sock *meta_sk;
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct tcp_sock *tp_it;
++ struct sk_buff *skb_head;
++ struct defsched_priv *dsp = defsched_get_priv(tp);
++
++ if (tp->mpcb->cnt_subflows == 1)
++ return NULL;
++
++ meta_sk = mptcp_meta_sk(sk);
++ skb_head = tcp_write_queue_head(meta_sk);
++
++ if (!skb_head || skb_head == tcp_send_head(meta_sk))
++ return NULL;
++
++ /* If penalization is optional (coming from mptcp_next_segment() and
++ * We are not send-buffer-limited we do not penalize. The retransmission
++ * is just an optimization to fix the idle-time due to the delay before
++ * we wake up the application.
++ */
++ if (!penal && sk_stream_memory_free(meta_sk))
++ goto retrans;
++
++ /* Only penalize again after an RTT has elapsed */
++ if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
++ goto retrans;
++
++ /* Half the cwnd of the slow flow */
++ mptcp_for_each_tp(tp->mpcb, tp_it) {
++ if (tp_it != tp &&
++ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++ if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
++ tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
++ if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
++ tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
++
++ dsp->last_rbuf_opti = tcp_time_stamp;
++ }
++ break;
++ }
++ }
++
++retrans:
++
++ /* Segment not yet injected into this path? Take it!!! */
++ if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
++ bool do_retrans = false;
++ mptcp_for_each_tp(tp->mpcb, tp_it) {
++ if (tp_it != tp &&
++ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++ if (tp_it->snd_cwnd <= 4) {
++ do_retrans = true;
++ break;
++ }
++
++ if (4 * tp->srtt_us >= tp_it->srtt_us) {
++ do_retrans = false;
++ break;
++ } else {
++ do_retrans = true;
++ }
++ }
++ }
++
++ if (do_retrans && mptcp_is_available(sk, skb_head, false))
++ return skb_head;
++ }
++ return NULL;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
++{
++ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++ struct sk_buff *skb = NULL;
++
++ *reinject = 0;
++
++ /* If we are in fallback-mode, just take from the meta-send-queue */
++ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++ return tcp_send_head(meta_sk);
++
++ skb = skb_peek(&mpcb->reinject_queue);
++
++ if (skb) {
++ *reinject = 1;
++ } else {
++ skb = tcp_send_head(meta_sk);
++
++ if (!skb && meta_sk->sk_socket &&
++ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
++ sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
++ struct sock *subsk = get_available_subflow(meta_sk, NULL,
++ false);
++ if (!subsk)
++ return NULL;
++
++ skb = mptcp_rcv_buf_optimization(subsk, 0);
++ if (skb)
++ *reinject = -1;
++ }
++ }
++ return skb;
++}
++
++static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
++ int *reinject,
++ struct sock **subsk,
++ unsigned int *limit)
++{
++ struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
++ unsigned int mss_now;
++ struct tcp_sock *subtp;
++ u16 gso_max_segs;
++ u32 max_len, max_segs, window, needed;
++
++ /* As we set it, we have to reset it as well. */
++ *limit = 0;
++
++ if (!skb)
++ return NULL;
++
++ *subsk = get_available_subflow(meta_sk, skb, false);
++ if (!*subsk)
++ return NULL;
++
++ subtp = tcp_sk(*subsk);
++ mss_now = tcp_current_mss(*subsk);
++
++ if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
++ skb = mptcp_rcv_buf_optimization(*subsk, 1);
++ if (skb)
++ *reinject = -1;
++ else
++ return NULL;
++ }
++
++ /* No splitting required, as we will only send one single segment */
++ if (skb->len <= mss_now)
++ return skb;
++
++ /* The following is similar to tcp_mss_split_point, but
++ * we do not care about nagle, because we will anyways
++ * use TCP_NAGLE_PUSH, which overrides this.
++ *
++ * So, we first limit according to the cwnd/gso-size and then according
++ * to the subflow's window.
++ */
++
++ gso_max_segs = (*subsk)->sk_gso_max_segs;
++ if (!gso_max_segs) /* No gso supported on the subflow's NIC */
++ gso_max_segs = 1;
++ max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
++ if (!max_segs)
++ return NULL;
++
++ max_len = mss_now * max_segs;
++ window = tcp_wnd_end(subtp) - subtp->write_seq;
++
++ needed = min(skb->len, window);
++ if (max_len <= skb->len)
++ /* Take max_win, which is actually the cwnd/gso-size */
++ *limit = max_len;
++ else
++ /* Or, take the window */
++ *limit = needed;
++
++ return skb;
++}
++
++static void defsched_init(struct sock *sk)
++{
++ struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
++
++ dsp->last_rbuf_opti = tcp_time_stamp;
++}
++
++struct mptcp_sched_ops mptcp_sched_default = {
++ .get_subflow = get_available_subflow,
++ .next_segment = mptcp_next_segment,
++ .init = defsched_init,
++ .name = "default",
++ .owner = THIS_MODULE,
++};
++
++static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
++{
++ struct mptcp_sched_ops *e;
++
++ list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
++ if (strcmp(e->name, name) == 0)
++ return e;
++ }
++
++ return NULL;
++}
++
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
++{
++ int ret = 0;
++
++ if (!sched->get_subflow || !sched->next_segment)
++ return -EINVAL;
++
++ spin_lock(&mptcp_sched_list_lock);
++ if (mptcp_sched_find(sched->name)) {
++ pr_notice("%s already registered\n", sched->name);
++ ret = -EEXIST;
++ } else {
++ list_add_tail_rcu(&sched->list, &mptcp_sched_list);
++ pr_info("%s registered\n", sched->name);
++ }
++ spin_unlock(&mptcp_sched_list_lock);
++
++ return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
++
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
++{
++ spin_lock(&mptcp_sched_list_lock);
++ list_del_rcu(&sched->list);
++ spin_unlock(&mptcp_sched_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
++
++void mptcp_get_default_scheduler(char *name)
++{
++ struct mptcp_sched_ops *sched;
++
++ BUG_ON(list_empty(&mptcp_sched_list));
++
++ rcu_read_lock();
++ sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
++ strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
++ rcu_read_unlock();
++}
++
++int mptcp_set_default_scheduler(const char *name)
++{
++ struct mptcp_sched_ops *sched;
++ int ret = -ENOENT;
++
++ spin_lock(&mptcp_sched_list_lock);
++ sched = mptcp_sched_find(name);
++#ifdef CONFIG_MODULES
++ if (!sched && capable(CAP_NET_ADMIN)) {
++ spin_unlock(&mptcp_sched_list_lock);
++
++ request_module("mptcp_%s", name);
++ spin_lock(&mptcp_sched_list_lock);
++ sched = mptcp_sched_find(name);
++ }
++#endif
++
++ if (sched) {
++ list_move(&sched->list, &mptcp_sched_list);
++ ret = 0;
++ } else {
++ pr_info("%s is not available\n", name);
++ }
++ spin_unlock(&mptcp_sched_list_lock);
++
++ return ret;
++}
++
++void mptcp_init_scheduler(struct mptcp_cb *mpcb)
++{
++ struct mptcp_sched_ops *sched;
++
++ rcu_read_lock();
++ list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
++ if (try_module_get(sched->owner)) {
++ mpcb->sched_ops = sched;
++ break;
++ }
++ }
++ rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
++{
++ module_put(mpcb->sched_ops->owner);
++}
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_scheduler_default(void)
++{
++ BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
++
++ return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
++}
++late_initcall(mptcp_scheduler_default);
+diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
+new file mode 100644
+index 000000000000..29ca1d868d17
+--- /dev/null
++++ b/net/mptcp/mptcp_wvegas.c
+@@ -0,0 +1,268 @@
++/*
++ * MPTCP implementation - WEIGHTED VEGAS
++ *
++ * Algorithm design:
++ * Yu Cao <cyAnalyst@126.com>
++ * Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
++ * Xiaoming Fu <fu@cs.uni-goettinggen.de>
++ *
++ * Implementation:
++ * Yu Cao <cyAnalyst@126.com>
++ * Enhuan Dong <deh13@mails.tsinghua.edu.cn>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++#include <linux/module.h>
++#include <linux/tcp.h>
++
++static int initial_alpha = 2;
++static int total_alpha = 10;
++static int gamma = 1;
++
++module_param(initial_alpha, int, 0644);
++MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
++module_param(total_alpha, int, 0644);
++MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
++module_param(gamma, int, 0644);
++MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
++
++#define MPTCP_WVEGAS_SCALE 16
++
++/* wVegas variables */
++struct wvegas {
++ u32 beg_snd_nxt; /* right edge during last RTT */
++ u8 doing_wvegas_now;/* if true, do wvegas for this RTT */
++
++ u16 cnt_rtt; /* # of RTTs measured within last RTT */
++ u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
++ u32 base_rtt; /* the min of all wVegas RTT measurements seen (in usec) */
++
++ u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
++ u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
++ int alpha; /* alpha for each subflows */
++
++ u32 queue_delay; /* queue delay*/
++};
++
++
++static inline u64 mptcp_wvegas_scale(u32 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++static void wvegas_enable(const struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->doing_wvegas_now = 1;
++
++ wvegas->beg_snd_nxt = tp->snd_nxt;
++
++ wvegas->cnt_rtt = 0;
++ wvegas->sampled_rtt = 0;
++
++ wvegas->instant_rate = 0;
++ wvegas->alpha = initial_alpha;
++ wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
++
++ wvegas->queue_delay = 0;
++}
++
++static inline void wvegas_disable(const struct sock *sk)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->doing_wvegas_now = 0;
++}
++
++static void mptcp_wvegas_init(struct sock *sk)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ wvegas->base_rtt = 0x7fffffff;
++ wvegas_enable(sk);
++}
++
++static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
++{
++ return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
++}
++
++static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
++{
++ struct wvegas *wvegas = inet_csk_ca(sk);
++ u32 vrtt;
++
++ if (rtt_us < 0)
++ return;
++
++ vrtt = rtt_us + 1;
++
++ if (vrtt < wvegas->base_rtt)
++ wvegas->base_rtt = vrtt;
++
++ wvegas->sampled_rtt += vrtt;
++ wvegas->cnt_rtt++;
++}
++
++static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
++{
++ if (ca_state == TCP_CA_Open)
++ wvegas_enable(sk);
++ else
++ wvegas_disable(sk);
++}
++
++static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++ if (event == CA_EVENT_CWND_RESTART) {
++ mptcp_wvegas_init(sk);
++ } else if (event == CA_EVENT_LOSS) {
++ struct wvegas *wvegas = inet_csk_ca(sk);
++ wvegas->instant_rate = 0;
++ }
++}
++
++static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
++{
++ return min(tp->snd_ssthresh, tp->snd_cwnd - 1);
++}
++
++static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
++{
++ u64 total_rate = 0;
++ struct sock *sub_sk;
++ const struct wvegas *wvegas = inet_csk_ca(sk);
++
++ if (!mpcb)
++ return wvegas->weight;
++
++
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
++
++ /* sampled_rtt is initialized by 0 */
++ if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
++ total_rate += sub_wvegas->instant_rate;
++ }
++
++ if (total_rate && wvegas->instant_rate)
++ return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
++ else
++ return wvegas->weight;
++}
++
++static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ struct wvegas *wvegas = inet_csk_ca(sk);
++
++ if (!wvegas->doing_wvegas_now) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ if (after(ack, wvegas->beg_snd_nxt)) {
++ wvegas->beg_snd_nxt = tp->snd_nxt;
++
++ if (wvegas->cnt_rtt <= 2) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ } else {
++ u32 rtt, diff, q_delay;
++ u64 target_cwnd;
++
++ rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
++ target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
++
++ diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
++
++ if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
++ tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
++ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++
++ } else if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ tcp_slow_start(tp, acked);
++ } else {
++ if (diff >= wvegas->alpha) {
++ wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
++ wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
++ wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
++ }
++ if (diff > wvegas->alpha) {
++ tp->snd_cwnd--;
++ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++ } else if (diff < wvegas->alpha) {
++ tp->snd_cwnd++;
++ }
++
++ /* Try to drain link queue if needed*/
++ q_delay = rtt - wvegas->base_rtt;
++ if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
++ wvegas->queue_delay = q_delay;
++
++ if (q_delay >= 2 * wvegas->queue_delay) {
++ u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
++ tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
++ wvegas->queue_delay = 0;
++ }
++ }
++
++ if (tp->snd_cwnd < 2)
++ tp->snd_cwnd = 2;
++ else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
++ tp->snd_cwnd = tp->snd_cwnd_clamp;
++
++ tp->snd_ssthresh = tcp_current_ssthresh(sk);
++ }
++
++ wvegas->cnt_rtt = 0;
++ wvegas->sampled_rtt = 0;
++ }
++ /* Use normal slow start */
++ else if (tp->snd_cwnd <= tp->snd_ssthresh)
++ tcp_slow_start(tp, acked);
++}
++
++
++static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
++ .init = mptcp_wvegas_init,
++ .ssthresh = tcp_reno_ssthresh,
++ .cong_avoid = mptcp_wvegas_cong_avoid,
++ .pkts_acked = mptcp_wvegas_pkts_acked,
++ .set_state = mptcp_wvegas_state,
++ .cwnd_event = mptcp_wvegas_cwnd_event,
++
++ .owner = THIS_MODULE,
++ .name = "wvegas",
++};
++
++static int __init mptcp_wvegas_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
++ tcp_register_congestion_control(&mptcp_wvegas);
++ return 0;
++}
++
++static void __exit mptcp_wvegas_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_wvegas);
++}
++
++module_init(mptcp_wvegas_register);
++module_exit(mptcp_wvegas_unregister);
++
++MODULE_AUTHOR("Yu Cao, Enhuan Dong");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP wVegas");
++MODULE_VERSION("0.1");
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-06 11:39 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-06 11:39 UTC (permalink / raw
To: gentoo-commits
commit: f2f011b9a8a9057b75a30940d240fd4aaeb7d9e3
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Mon Oct 6 11:39:51 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Mon Oct 6 11:39:51 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=f2f011b9
Remove dup.
---
2500_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 --------------------------
1 file changed, 19230 deletions(-)
diff --git a/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch b/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
deleted file mode 100644
index 3000da3..0000000
--- a/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
+++ /dev/null
@@ -1,19230 +0,0 @@
-diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
-index 768a0fb67dd6..5a46d91a8df9 100644
---- a/drivers/infiniband/hw/cxgb4/cm.c
-+++ b/drivers/infiniband/hw/cxgb4/cm.c
-@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
- */
- memset(&tmp_opt, 0, sizeof(tmp_opt));
- tcp_clear_options(&tmp_opt);
-- tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+ tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
-
- req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
- memset(req, 0, sizeof(*req));
-diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
-index 2faef339d8f2..d86c853ffaad 100644
---- a/include/linux/ipv6.h
-+++ b/include/linux/ipv6.h
-@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
- return inet_sk(__sk)->pinet6;
- }
-
--static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
--{
-- struct request_sock *req = reqsk_alloc(ops);
--
-- if (req)
-- inet_rsk(req)->pktopts = NULL;
--
-- return req;
--}
--
- static inline struct raw6_sock *raw6_sk(const struct sock *sk)
- {
- return (struct raw6_sock *)sk;
-@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
- return NULL;
- }
-
--static inline struct inet6_request_sock *
-- inet6_rsk(const struct request_sock *rsk)
--{
-- return NULL;
--}
--
- static inline struct raw6_sock *raw6_sk(const struct sock *sk)
- {
- return NULL;
-diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
-index ec89301ada41..99ea4b0e3693 100644
---- a/include/linux/skbuff.h
-+++ b/include/linux/skbuff.h
-@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
- bool zero_okay,
- __sum16 check)
- {
-- if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
-- skb->csum_valid = 1;
-+ if (skb_csum_unnecessary(skb)) {
-+ return false;
-+ } else if (zero_okay && !check) {
-+ skb->ip_summed = CHECKSUM_UNNECESSARY;
- return false;
- }
-
-diff --git a/include/linux/tcp.h b/include/linux/tcp.h
-index a0513210798f..7bc2e078d6ca 100644
---- a/include/linux/tcp.h
-+++ b/include/linux/tcp.h
-@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
- /* TCP Fast Open */
- #define TCP_FASTOPEN_COOKIE_MIN 4 /* Min Fast Open Cookie size in bytes */
- #define TCP_FASTOPEN_COOKIE_MAX 16 /* Max Fast Open Cookie size in bytes */
--#define TCP_FASTOPEN_COOKIE_SIZE 8 /* the size employed by this impl. */
-+#define TCP_FASTOPEN_COOKIE_SIZE 4 /* the size employed by this impl. */
-
- /* TCP Fast Open Cookie as stored in memory */
- struct tcp_fastopen_cookie {
-@@ -72,6 +72,51 @@ struct tcp_sack_block {
- u32 end_seq;
- };
-
-+struct tcp_out_options {
-+ u16 options; /* bit field of OPTION_* */
-+ u8 ws; /* window scale, 0 to disable */
-+ u8 num_sack_blocks;/* number of SACK blocks to include */
-+ u8 hash_size; /* bytes in hash_location */
-+ u16 mss; /* 0 to disable */
-+ __u8 *hash_location; /* temporary pointer, overloaded */
-+ __u32 tsval, tsecr; /* need to include OPTION_TS */
-+ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
-+#ifdef CONFIG_MPTCP
-+ u16 mptcp_options; /* bit field of MPTCP related OPTION_* */
-+ u8 dss_csum:1,
-+ add_addr_v4:1,
-+ add_addr_v6:1; /* dss-checksum required? */
-+
-+ union {
-+ struct {
-+ __u64 sender_key; /* sender's key for mptcp */
-+ __u64 receiver_key; /* receiver's key for mptcp */
-+ } mp_capable;
-+
-+ struct {
-+ __u64 sender_truncated_mac;
-+ __u32 sender_nonce;
-+ /* random number of the sender */
-+ __u32 token; /* token for mptcp */
-+ u8 low_prio:1;
-+ } mp_join_syns;
-+ };
-+
-+ struct {
-+ struct in_addr addr;
-+ u8 addr_id;
-+ } add_addr4;
-+
-+ struct {
-+ struct in6_addr addr;
-+ u8 addr_id;
-+ } add_addr6;
-+
-+ u16 remove_addrs; /* list of address id */
-+ u8 addr_id; /* address id (mp_join or add_address) */
-+#endif /* CONFIG_MPTCP */
-+};
-+
- /*These are used to set the sack_ok field in struct tcp_options_received */
- #define TCP_SACK_SEEN (1 << 0) /*1 = peer is SACK capable, */
- #define TCP_FACK_ENABLED (1 << 1) /*1 = FACK is enabled locally*/
-@@ -95,6 +140,9 @@ struct tcp_options_received {
- u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
- };
-
-+struct mptcp_cb;
-+struct mptcp_tcp_sock;
-+
- static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
- {
- rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
-@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
-
- struct tcp_request_sock {
- struct inet_request_sock req;
--#ifdef CONFIG_TCP_MD5SIG
-- /* Only used by TCP MD5 Signature so far. */
- const struct tcp_request_sock_ops *af_specific;
--#endif
- struct sock *listener; /* needed for TFO */
- u32 rcv_isn;
- u32 snt_isn;
-@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
- return (struct tcp_request_sock *)req;
- }
-
-+struct tcp_md5sig_key;
-+
- struct tcp_sock {
- /* inet_connection_sock has to be the first member of tcp_sock */
- struct inet_connection_sock inet_conn;
-@@ -326,6 +373,37 @@ struct tcp_sock {
- * socket. Used to retransmit SYNACKs etc.
- */
- struct request_sock *fastopen_rsk;
-+
-+ /* MPTCP/TCP-specific callbacks */
-+ const struct tcp_sock_ops *ops;
-+
-+ struct mptcp_cb *mpcb;
-+ struct sock *meta_sk;
-+ /* We keep these flags even if CONFIG_MPTCP is not checked, because
-+ * it allows checking MPTCP capability just by checking the mpc flag,
-+ * rather than adding ifdefs everywhere.
-+ */
-+ u16 mpc:1, /* Other end is multipath capable */
-+ inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
-+ send_mp_fclose:1,
-+ request_mptcp:1, /* Did we send out an MP_CAPABLE?
-+ * (this speeds up mptcp_doit() in tcp_recvmsg)
-+ */
-+ mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
-+ pf:1, /* Potentially Failed state: when this flag is set, we
-+ * stop using the subflow
-+ */
-+ mp_killed:1, /* Killed with a tcp_done in mptcp? */
-+ was_meta_sk:1, /* This was a meta sk (in case of reuse) */
-+ is_master_sk,
-+ close_it:1, /* Must close socket in mptcp_data_ready? */
-+ closing:1;
-+ struct mptcp_tcp_sock *mptcp;
-+#ifdef CONFIG_MPTCP
-+ struct hlist_nulls_node tk_table;
-+ u32 mptcp_loc_token;
-+ u64 mptcp_loc_key;
-+#endif /* CONFIG_MPTCP */
- };
-
- enum tsq_flags {
-@@ -337,6 +415,8 @@ enum tsq_flags {
- TCP_MTU_REDUCED_DEFERRED, /* tcp_v{4|6}_err() could not call
- * tcp_v{4|6}_mtu_reduced()
- */
-+ MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
-+ MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
- };
-
- static inline struct tcp_sock *tcp_sk(const struct sock *sk)
-@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
- #ifdef CONFIG_TCP_MD5SIG
- struct tcp_md5sig_key *tw_md5_key;
- #endif
-+ struct mptcp_tw *mptcp_tw;
- };
-
- static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
-diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
-index 74af137304be..83f63033897a 100644
---- a/include/net/inet6_connection_sock.h
-+++ b/include/net/inet6_connection_sock.h
-@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
-
- struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
- const struct request_sock *req);
-+u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-+ const u32 rnd, const u32 synq_hsize);
-
- struct request_sock *inet6_csk_search_req(const struct sock *sk,
- struct request_sock ***prevp,
-diff --git a/include/net/inet_common.h b/include/net/inet_common.h
-index fe7994c48b75..780f229f46a8 100644
---- a/include/net/inet_common.h
-+++ b/include/net/inet_common.h
-@@ -1,6 +1,8 @@
- #ifndef _INET_COMMON_H
- #define _INET_COMMON_H
-
-+#include <net/sock.h>
-+
- extern const struct proto_ops inet_stream_ops;
- extern const struct proto_ops inet_dgram_ops;
-
-@@ -13,6 +15,8 @@ struct sock;
- struct sockaddr;
- struct socket;
-
-+int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
-+int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
- int inet_release(struct socket *sock);
- int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
- int addr_len, int flags);
-diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
-index 7a4313887568..f62159e39839 100644
---- a/include/net/inet_connection_sock.h
-+++ b/include/net/inet_connection_sock.h
-@@ -30,6 +30,7 @@
-
- struct inet_bind_bucket;
- struct tcp_congestion_ops;
-+struct tcp_options_received;
-
- /*
- * Pointers to address related TCP functions
-@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
-
- struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
-
-+u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
-+ const u32 synq_hsize);
-+
- struct request_sock *inet_csk_search_req(const struct sock *sk,
- struct request_sock ***prevp,
- const __be16 rport,
-diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
-index b1edf17bec01..6a32d8d6b85e 100644
---- a/include/net/inet_sock.h
-+++ b/include/net/inet_sock.h
-@@ -86,10 +86,14 @@ struct inet_request_sock {
- wscale_ok : 1,
- ecn_ok : 1,
- acked : 1,
-- no_srccheck: 1;
-+ no_srccheck: 1,
-+ mptcp_rqsk : 1,
-+ saw_mpc : 1;
- kmemcheck_bitfield_end(flags);
-- struct ip_options_rcu *opt;
-- struct sk_buff *pktopts;
-+ union {
-+ struct ip_options_rcu *opt;
-+ struct sk_buff *pktopts;
-+ };
- u32 ir_mark;
- };
-
-diff --git a/include/net/mptcp.h b/include/net/mptcp.h
-new file mode 100644
-index 000000000000..712780fc39e4
---- /dev/null
-+++ b/include/net/mptcp.h
-@@ -0,0 +1,1439 @@
-+/*
-+ * MPTCP implementation
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef _MPTCP_H
-+#define _MPTCP_H
-+
-+#include <linux/inetdevice.h>
-+#include <linux/ipv6.h>
-+#include <linux/list.h>
-+#include <linux/net.h>
-+#include <linux/netpoll.h>
-+#include <linux/skbuff.h>
-+#include <linux/socket.h>
-+#include <linux/tcp.h>
-+#include <linux/kernel.h>
-+
-+#include <asm/byteorder.h>
-+#include <asm/unaligned.h>
-+#include <crypto/hash.h>
-+#include <net/tcp.h>
-+
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ #define ntohll(x) be64_to_cpu(x)
-+ #define htonll(x) cpu_to_be64(x)
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ #define ntohll(x) (x)
-+ #define htonll(x) (x)
-+#endif
-+
-+struct mptcp_loc4 {
-+ u8 loc4_id;
-+ u8 low_prio:1;
-+ struct in_addr addr;
-+};
-+
-+struct mptcp_rem4 {
-+ u8 rem4_id;
-+ __be16 port;
-+ struct in_addr addr;
-+};
-+
-+struct mptcp_loc6 {
-+ u8 loc6_id;
-+ u8 low_prio:1;
-+ struct in6_addr addr;
-+};
-+
-+struct mptcp_rem6 {
-+ u8 rem6_id;
-+ __be16 port;
-+ struct in6_addr addr;
-+};
-+
-+struct mptcp_request_sock {
-+ struct tcp_request_sock req;
-+ /* hlist-nulls entry to the hash-table. Depending on whether this is a
-+ * a new MPTCP connection or an additional subflow, the request-socket
-+ * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
-+ */
-+ struct hlist_nulls_node hash_entry;
-+
-+ union {
-+ struct {
-+ /* Only on initial subflows */
-+ u64 mptcp_loc_key;
-+ u64 mptcp_rem_key;
-+ u32 mptcp_loc_token;
-+ };
-+
-+ struct {
-+ /* Only on additional subflows */
-+ struct mptcp_cb *mptcp_mpcb;
-+ u32 mptcp_rem_nonce;
-+ u32 mptcp_loc_nonce;
-+ u64 mptcp_hash_tmac;
-+ };
-+ };
-+
-+ u8 loc_id;
-+ u8 rem_id; /* Address-id in the MP_JOIN */
-+ u8 dss_csum:1,
-+ is_sub:1, /* Is this a new subflow? */
-+ low_prio:1, /* Interface set to low-prio? */
-+ rcv_low_prio:1;
-+};
-+
-+struct mptcp_options_received {
-+ u16 saw_mpc:1,
-+ dss_csum:1,
-+ drop_me:1,
-+
-+ is_mp_join:1,
-+ join_ack:1,
-+
-+ saw_low_prio:2, /* 0x1 - low-prio set for this subflow
-+ * 0x2 - low-prio set for another subflow
-+ */
-+ low_prio:1,
-+
-+ saw_add_addr:2, /* Saw at least one add_addr option:
-+ * 0x1: IPv4 - 0x2: IPv6
-+ */
-+ more_add_addr:1, /* Saw one more add-addr. */
-+
-+ saw_rem_addr:1, /* Saw at least one rem_addr option */
-+ more_rem_addr:1, /* Saw one more rem-addr. */
-+
-+ mp_fail:1,
-+ mp_fclose:1;
-+ u8 rem_id; /* Address-id in the MP_JOIN */
-+ u8 prio_addr_id; /* Address-id in the MP_PRIO */
-+
-+ const unsigned char *add_addr_ptr; /* Pointer to add-address option */
-+ const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
-+
-+ u32 data_ack;
-+ u32 data_seq;
-+ u16 data_len;
-+
-+ u32 mptcp_rem_token;/* Remote token */
-+
-+ /* Key inside the option (from mp_capable or fast_close) */
-+ u64 mptcp_key;
-+
-+ u32 mptcp_recv_nonce;
-+ u64 mptcp_recv_tmac;
-+ u8 mptcp_recv_mac[20];
-+};
-+
-+struct mptcp_tcp_sock {
-+ struct tcp_sock *next; /* Next subflow socket */
-+ struct hlist_node cb_list;
-+ struct mptcp_options_received rx_opt;
-+
-+ /* Those three fields record the current mapping */
-+ u64 map_data_seq;
-+ u32 map_subseq;
-+ u16 map_data_len;
-+ u16 slave_sk:1,
-+ fully_established:1,
-+ establish_increased:1,
-+ second_packet:1,
-+ attached:1,
-+ send_mp_fail:1,
-+ include_mpc:1,
-+ mapping_present:1,
-+ map_data_fin:1,
-+ low_prio:1, /* use this socket as backup */
-+ rcv_low_prio:1, /* Peer sent low-prio option to us */
-+ send_mp_prio:1, /* Trigger to send mp_prio on this socket */
-+ pre_established:1; /* State between sending 3rd ACK and
-+ * receiving the fourth ack of new subflows.
-+ */
-+
-+ /* isn: needed to translate abs to relative subflow seqnums */
-+ u32 snt_isn;
-+ u32 rcv_isn;
-+ u8 path_index;
-+ u8 loc_id;
-+ u8 rem_id;
-+
-+#define MPTCP_SCHED_SIZE 4
-+ u8 mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
-+
-+ struct sk_buff *shortcut_ofoqueue; /* Shortcut to the current modified
-+ * skb in the ofo-queue.
-+ */
-+
-+ int init_rcv_wnd;
-+ u32 infinite_cutoff_seq;
-+ struct delayed_work work;
-+ u32 mptcp_loc_nonce;
-+ struct tcp_sock *tp; /* Where is my daddy? */
-+ u32 last_end_data_seq;
-+
-+ /* MP_JOIN subflow: timer for retransmitting the 3rd ack */
-+ struct timer_list mptcp_ack_timer;
-+
-+ /* HMAC of the third ack */
-+ char sender_mac[20];
-+};
-+
-+struct mptcp_tw {
-+ struct list_head list;
-+ u64 loc_key;
-+ u64 rcv_nxt;
-+ struct mptcp_cb __rcu *mpcb;
-+ u8 meta_tw:1,
-+ in_list:1;
-+};
-+
-+#define MPTCP_PM_NAME_MAX 16
-+struct mptcp_pm_ops {
-+ struct list_head list;
-+
-+ /* Signal the creation of a new MPTCP-session. */
-+ void (*new_session)(const struct sock *meta_sk);
-+ void (*release_sock)(struct sock *meta_sk);
-+ void (*fully_established)(struct sock *meta_sk);
-+ void (*new_remote_address)(struct sock *meta_sk);
-+ int (*get_local_id)(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio);
-+ void (*addr_signal)(struct sock *sk, unsigned *size,
-+ struct tcp_out_options *opts, struct sk_buff *skb);
-+ void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
-+ sa_family_t family, __be16 port, u8 id);
-+ void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
-+ void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
-+ void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
-+
-+ char name[MPTCP_PM_NAME_MAX];
-+ struct module *owner;
-+};
-+
-+#define MPTCP_SCHED_NAME_MAX 16
-+struct mptcp_sched_ops {
-+ struct list_head list;
-+
-+ struct sock * (*get_subflow)(struct sock *meta_sk,
-+ struct sk_buff *skb,
-+ bool zero_wnd_test);
-+ struct sk_buff * (*next_segment)(struct sock *meta_sk,
-+ int *reinject,
-+ struct sock **subsk,
-+ unsigned int *limit);
-+ void (*init)(struct sock *sk);
-+
-+ char name[MPTCP_SCHED_NAME_MAX];
-+ struct module *owner;
-+};
-+
-+struct mptcp_cb {
-+ /* list of sockets in this multipath connection */
-+ struct tcp_sock *connection_list;
-+ /* list of sockets that need a call to release_cb */
-+ struct hlist_head callback_list;
-+
-+ /* High-order bits of 64-bit sequence numbers */
-+ u32 snd_high_order[2];
-+ u32 rcv_high_order[2];
-+
-+ u16 send_infinite_mapping:1,
-+ in_time_wait:1,
-+ list_rcvd:1, /* XXX TO REMOVE */
-+ addr_signal:1, /* Path-manager wants us to call addr_signal */
-+ dss_csum:1,
-+ server_side:1,
-+ infinite_mapping_rcv:1,
-+ infinite_mapping_snd:1,
-+ dfin_combined:1, /* Was the DFIN combined with subflow-fin? */
-+ passive_close:1,
-+ snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
-+ rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
-+
-+ /* socket count in this connection */
-+ u8 cnt_subflows;
-+ u8 cnt_established;
-+
-+ struct mptcp_sched_ops *sched_ops;
-+
-+ struct sk_buff_head reinject_queue;
-+ /* First cache-line boundary is here minus 8 bytes. But from the
-+ * reinject-queue only the next and prev pointers are regularly
-+ * accessed. Thus, the whole data-path is on a single cache-line.
-+ */
-+
-+ u64 csum_cutoff_seq;
-+
-+ /***** Start of fields, used for connection closure */
-+ spinlock_t tw_lock;
-+ unsigned char mptw_state;
-+ u8 dfin_path_index;
-+
-+ struct list_head tw_list;
-+
-+ /***** Start of fields, used for subflow establishment and closure */
-+ atomic_t mpcb_refcnt;
-+
-+ /* Mutex needed, because otherwise mptcp_close will complain that the
-+ * socket is owned by the user.
-+ * E.g., mptcp_sub_close_wq is taking the meta-lock.
-+ */
-+ struct mutex mpcb_mutex;
-+
-+ /***** Start of fields, used for subflow establishment */
-+ struct sock *meta_sk;
-+
-+ /* Master socket, also part of the connection_list, this
-+ * socket is the one that the application sees.
-+ */
-+ struct sock *master_sk;
-+
-+ __u64 mptcp_loc_key;
-+ __u64 mptcp_rem_key;
-+ __u32 mptcp_loc_token;
-+ __u32 mptcp_rem_token;
-+
-+#define MPTCP_PM_SIZE 608
-+ u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
-+ struct mptcp_pm_ops *pm_ops;
-+
-+ u32 path_index_bits;
-+ /* Next pi to pick up in case a new path becomes available */
-+ u8 next_path_index;
-+
-+ /* Original snd/rcvbuf of the initial subflow.
-+ * Used for the new subflows on the server-side to allow correct
-+ * autotuning
-+ */
-+ int orig_sk_rcvbuf;
-+ int orig_sk_sndbuf;
-+ u32 orig_window_clamp;
-+
-+ /* Timer for retransmitting SYN/ACK+MP_JOIN */
-+ struct timer_list synack_timer;
-+};
-+
-+#define MPTCP_SUB_CAPABLE 0
-+#define MPTCP_SUB_LEN_CAPABLE_SYN 12
-+#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN 12
-+#define MPTCP_SUB_LEN_CAPABLE_ACK 20
-+#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN 20
-+
-+#define MPTCP_SUB_JOIN 1
-+#define MPTCP_SUB_LEN_JOIN_SYN 12
-+#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN 12
-+#define MPTCP_SUB_LEN_JOIN_SYNACK 16
-+#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN 16
-+#define MPTCP_SUB_LEN_JOIN_ACK 24
-+#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN 24
-+
-+#define MPTCP_SUB_DSS 2
-+#define MPTCP_SUB_LEN_DSS 4
-+#define MPTCP_SUB_LEN_DSS_ALIGN 4
-+
-+/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
-+ * as they are part of the DSS-option.
-+ * To get the total length, just add the different options together.
-+ */
-+#define MPTCP_SUB_LEN_SEQ 10
-+#define MPTCP_SUB_LEN_SEQ_CSUM 12
-+#define MPTCP_SUB_LEN_SEQ_ALIGN 12
-+
-+#define MPTCP_SUB_LEN_SEQ_64 14
-+#define MPTCP_SUB_LEN_SEQ_CSUM_64 16
-+#define MPTCP_SUB_LEN_SEQ_64_ALIGN 16
-+
-+#define MPTCP_SUB_LEN_ACK 4
-+#define MPTCP_SUB_LEN_ACK_ALIGN 4
-+
-+#define MPTCP_SUB_LEN_ACK_64 8
-+#define MPTCP_SUB_LEN_ACK_64_ALIGN 8
-+
-+/* This is the "default" option-length we will send out most often.
-+ * MPTCP DSS-header
-+ * 32-bit data sequence number
-+ * 32-bit data ack
-+ *
-+ * It is necessary to calculate the effective MSS we will be using when
-+ * sending data.
-+ */
-+#define MPTCP_SUB_LEN_DSM_ALIGN (MPTCP_SUB_LEN_DSS_ALIGN + \
-+ MPTCP_SUB_LEN_SEQ_ALIGN + \
-+ MPTCP_SUB_LEN_ACK_ALIGN)
-+
-+#define MPTCP_SUB_ADD_ADDR 3
-+#define MPTCP_SUB_LEN_ADD_ADDR4 8
-+#define MPTCP_SUB_LEN_ADD_ADDR6 20
-+#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN 8
-+#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN 20
-+
-+#define MPTCP_SUB_REMOVE_ADDR 4
-+#define MPTCP_SUB_LEN_REMOVE_ADDR 4
-+
-+#define MPTCP_SUB_PRIO 5
-+#define MPTCP_SUB_LEN_PRIO 3
-+#define MPTCP_SUB_LEN_PRIO_ADDR 4
-+#define MPTCP_SUB_LEN_PRIO_ALIGN 4
-+
-+#define MPTCP_SUB_FAIL 6
-+#define MPTCP_SUB_LEN_FAIL 12
-+#define MPTCP_SUB_LEN_FAIL_ALIGN 12
-+
-+#define MPTCP_SUB_FCLOSE 7
-+#define MPTCP_SUB_LEN_FCLOSE 12
-+#define MPTCP_SUB_LEN_FCLOSE_ALIGN 12
-+
-+
-+#define OPTION_MPTCP (1 << 5)
-+
-+#ifdef CONFIG_MPTCP
-+
-+/* Used for checking if the mptcp initialization has been successful */
-+extern bool mptcp_init_failed;
-+
-+/* MPTCP options */
-+#define OPTION_TYPE_SYN (1 << 0)
-+#define OPTION_TYPE_SYNACK (1 << 1)
-+#define OPTION_TYPE_ACK (1 << 2)
-+#define OPTION_MP_CAPABLE (1 << 3)
-+#define OPTION_DATA_ACK (1 << 4)
-+#define OPTION_ADD_ADDR (1 << 5)
-+#define OPTION_MP_JOIN (1 << 6)
-+#define OPTION_MP_FAIL (1 << 7)
-+#define OPTION_MP_FCLOSE (1 << 8)
-+#define OPTION_REMOVE_ADDR (1 << 9)
-+#define OPTION_MP_PRIO (1 << 10)
-+
-+/* MPTCP flags: both TX and RX */
-+#define MPTCPHDR_SEQ 0x01 /* DSS.M option is present */
-+#define MPTCPHDR_FIN 0x02 /* DSS.F option is present */
-+#define MPTCPHDR_SEQ64_INDEX 0x04 /* index of seq in mpcb->snd_high_order */
-+/* MPTCP flags: RX only */
-+#define MPTCPHDR_ACK 0x08
-+#define MPTCPHDR_SEQ64_SET 0x10 /* Did we received a 64-bit seq number? */
-+#define MPTCPHDR_SEQ64_OFO 0x20 /* Is it not in our circular array? */
-+#define MPTCPHDR_DSS_CSUM 0x40
-+#define MPTCPHDR_JOIN 0x80
-+/* MPTCP flags: TX only */
-+#define MPTCPHDR_INF 0x08
-+
-+struct mptcp_option {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 ver:4,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ ver:4;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+};
-+
-+struct mp_capable {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 ver:4,
-+ sub:4;
-+ __u8 h:1,
-+ rsv:5,
-+ b:1,
-+ a:1;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ ver:4;
-+ __u8 a:1,
-+ b:1,
-+ rsv:5,
-+ h:1;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u64 sender_key;
-+ __u64 receiver_key;
-+} __attribute__((__packed__));
-+
-+struct mp_join {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 b:1,
-+ rsv:3,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ rsv:3,
-+ b:1;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u8 addr_id;
-+ union {
-+ struct {
-+ u32 token;
-+ u32 nonce;
-+ } syn;
-+ struct {
-+ __u64 mac;
-+ u32 nonce;
-+ } synack;
-+ struct {
-+ __u8 mac[20];
-+ } ack;
-+ } u;
-+} __attribute__((__packed__));
-+
-+struct mp_dss {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u16 rsv1:4,
-+ sub:4,
-+ A:1,
-+ a:1,
-+ M:1,
-+ m:1,
-+ F:1,
-+ rsv2:3;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u16 sub:4,
-+ rsv1:4,
-+ rsv2:3,
-+ F:1,
-+ m:1,
-+ M:1,
-+ a:1,
-+ A:1;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+};
-+
-+struct mp_add_addr {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 ipver:4,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ ipver:4;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u8 addr_id;
-+ union {
-+ struct {
-+ struct in_addr addr;
-+ __be16 port;
-+ } v4;
-+ struct {
-+ struct in6_addr addr;
-+ __be16 port;
-+ } v6;
-+ } u;
-+} __attribute__((__packed__));
-+
-+struct mp_remove_addr {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 rsv:4,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ rsv:4;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ /* list of addr_id */
-+ __u8 addrs_id;
-+};
-+
-+struct mp_fail {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u16 rsv1:4,
-+ sub:4,
-+ rsv2:8;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u16 sub:4,
-+ rsv1:4,
-+ rsv2:8;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __be64 data_seq;
-+} __attribute__((__packed__));
-+
-+struct mp_fclose {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u16 rsv1:4,
-+ sub:4,
-+ rsv2:8;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u16 sub:4,
-+ rsv1:4,
-+ rsv2:8;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u64 key;
-+} __attribute__((__packed__));
-+
-+struct mp_prio {
-+ __u8 kind;
-+ __u8 len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+ __u8 b:1,
-+ rsv:3,
-+ sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+ __u8 sub:4,
-+ rsv:3,
-+ b:1;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+ __u8 addr_id;
-+} __attribute__((__packed__));
-+
-+static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
-+{
-+ return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
-+}
-+
-+#define MPTCP_APP 2
-+
-+extern int sysctl_mptcp_enabled;
-+extern int sysctl_mptcp_checksum;
-+extern int sysctl_mptcp_debug;
-+extern int sysctl_mptcp_syn_retries;
-+
-+extern struct workqueue_struct *mptcp_wq;
-+
-+#define mptcp_debug(fmt, args...) \
-+ do { \
-+ if (unlikely(sysctl_mptcp_debug)) \
-+ pr_err(__FILE__ ": " fmt, ##args); \
-+ } while (0)
-+
-+/* Iterates over all subflows */
-+#define mptcp_for_each_tp(mpcb, tp) \
-+ for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
-+
-+#define mptcp_for_each_sk(mpcb, sk) \
-+ for ((sk) = (struct sock *)(mpcb)->connection_list; \
-+ sk; \
-+ sk = (struct sock *)tcp_sk(sk)->mptcp->next)
-+
-+#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp) \
-+ for (__sk = (struct sock *)(__mpcb)->connection_list, \
-+ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
-+ __sk; \
-+ __sk = __temp, \
-+ __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
-+
-+/* Iterates over all bit set to 1 in a bitset */
-+#define mptcp_for_each_bit_set(b, i) \
-+ for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
-+
-+#define mptcp_for_each_bit_unset(b, i) \
-+ mptcp_for_each_bit_set(~b, i)
-+
-+extern struct lock_class_key meta_key;
-+extern struct lock_class_key meta_slock_key;
-+extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
-+
-+/* This is needed to ensure that two subsequent key/nonce-generation result in
-+ * different keys/nonces if the IPs and ports are the same.
-+ */
-+extern u32 mptcp_seed;
-+
-+#define MPTCP_HASH_SIZE 1024
-+
-+extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
-+
-+/* This second hashtable is needed to retrieve request socks
-+ * created as a result of a join request. While the SYN contains
-+ * the token, the final ack does not, so we need a separate hashtable
-+ * to retrieve the mpcb.
-+ */
-+extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
-+extern spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
-+
-+/* Lock, protecting the two hash-tables that hold the token. Namely,
-+ * mptcp_reqsk_tk_htb and tk_hashtable
-+ */
-+extern spinlock_t mptcp_tk_hashlock; /* hashtable protection */
-+
-+/* Request-sockets can be hashed in the tk_htb for collision-detection or in
-+ * the regular htb for join-connections. We need to define different NULLS
-+ * values so that we can correctly detect a request-socket that has been
-+ * recycled. See also c25eb3bfb9729.
-+ */
-+#define MPTCP_REQSK_NULLS_BASE (1U << 29)
-+
-+
-+void mptcp_data_ready(struct sock *sk);
-+void mptcp_write_space(struct sock *sk);
-+
-+void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
-+ struct sock *sk);
-+void mptcp_ofo_queue(struct sock *meta_sk);
-+void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
-+void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
-+int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
-+ gfp_t flags);
-+void mptcp_del_sock(struct sock *sk);
-+void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
-+void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
-+void mptcp_update_sndbuf(const struct tcp_sock *tp);
-+void mptcp_send_fin(struct sock *meta_sk);
-+void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
-+bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+ int push_one, gfp_t gfp);
-+void tcp_parse_mptcp_options(const struct sk_buff *skb,
-+ struct mptcp_options_received *mopt);
-+void mptcp_parse_options(const uint8_t *ptr, int opsize,
-+ struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb);
-+void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
-+ unsigned *remaining);
-+void mptcp_synack_options(struct request_sock *req,
-+ struct tcp_out_options *opts,
-+ unsigned *remaining);
-+void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
-+ struct tcp_out_options *opts, unsigned *size);
-+void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+ const struct tcp_out_options *opts,
-+ struct sk_buff *skb);
-+void mptcp_close(struct sock *meta_sk, long timeout);
-+int mptcp_doit(struct sock *sk);
-+int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
-+int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
-+int mptcp_check_req_master(struct sock *sk, struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev);
-+struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev,
-+ const struct mptcp_options_received *mopt);
-+u32 __mptcp_select_window(struct sock *sk);
-+void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
-+ __u32 *window_clamp, int wscale_ok,
-+ __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+ const struct sock *sk);
-+unsigned int mptcp_current_mss(struct sock *meta_sk);
-+int mptcp_select_size(const struct sock *meta_sk, bool sg);
-+void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
-+void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
-+ u32 *hash_out);
-+void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
-+void mptcp_fin(struct sock *meta_sk);
-+void mptcp_retransmit_timer(struct sock *meta_sk);
-+int mptcp_write_wakeup(struct sock *meta_sk);
-+void mptcp_sub_close_wq(struct work_struct *work);
-+void mptcp_sub_close(struct sock *sk, unsigned long delay);
-+struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
-+void mptcp_fallback_meta_sk(struct sock *meta_sk);
-+int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+void mptcp_ack_handler(unsigned long);
-+int mptcp_check_rtt(const struct tcp_sock *tp, int time);
-+int mptcp_check_snd_buf(const struct tcp_sock *tp);
-+int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
-+ const struct sk_buff *skb);
-+void __init mptcp_init(void);
-+int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
-+void mptcp_destroy_sock(struct sock *sk);
-+int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
-+ const struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt);
-+unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
-+ int large_allowed);
-+int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
-+void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
-+void mptcp_time_wait(struct sock *sk, int state, int timeo);
-+void mptcp_disconnect(struct sock *sk);
-+bool mptcp_should_expand_sndbuf(const struct sock *sk);
-+int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
-+void mptcp_tsq_flags(struct sock *sk);
-+void mptcp_tsq_sub_deferred(struct sock *meta_sk);
-+struct mp_join *mptcp_find_join(const struct sk_buff *skb);
-+void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
-+void mptcp_hash_remove(struct tcp_sock *meta_tp);
-+struct sock *mptcp_hash_find(const struct net *net, const u32 token);
-+int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
-+int mptcp_do_join_short(struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt,
-+ struct net *net);
-+void mptcp_reqsk_destructor(struct request_sock *req);
-+void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+ const struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb);
-+int mptcp_check_req(struct sk_buff *skb, struct net *net);
-+void mptcp_connect_init(struct sock *sk);
-+void mptcp_sub_force_close(struct sock *sk);
-+int mptcp_sub_len_remove_addr_align(u16 bitfield);
-+void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+ const struct sk_buff *skb);
-+void mptcp_init_buffer_space(struct sock *sk);
-+void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
-+ struct sk_buff *skb);
-+void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
-+int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
-+void mptcp_init_congestion_control(struct sock *sk);
-+
-+/* MPTCP-path-manager registration/initialization functions */
-+int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
-+void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
-+void mptcp_init_path_manager(struct mptcp_cb *mpcb);
-+void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
-+void mptcp_fallback_default(struct mptcp_cb *mpcb);
-+void mptcp_get_default_path_manager(char *name);
-+int mptcp_set_default_path_manager(const char *name);
-+extern struct mptcp_pm_ops mptcp_pm_default;
-+
-+/* MPTCP-scheduler registration/initialization functions */
-+int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
-+void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
-+void mptcp_init_scheduler(struct mptcp_cb *mpcb);
-+void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
-+void mptcp_get_default_scheduler(char *name);
-+int mptcp_set_default_scheduler(const char *name);
-+extern struct mptcp_sched_ops mptcp_sched_default;
-+
-+static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
-+ unsigned long len)
-+{
-+ sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
-+ jiffies + len);
-+}
-+
-+static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
-+{
-+ sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
-+}
-+
-+static inline bool is_mptcp_enabled(const struct sock *sk)
-+{
-+ if (!sysctl_mptcp_enabled || mptcp_init_failed)
-+ return false;
-+
-+ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
-+ return false;
-+
-+ return true;
-+}
-+
-+static inline int mptcp_pi_to_flag(int pi)
-+{
-+ return 1 << (pi - 1);
-+}
-+
-+static inline
-+struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
-+{
-+ return (struct mptcp_request_sock *)req;
-+}
-+
-+static inline
-+struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
-+{
-+ return (struct request_sock *)req;
-+}
-+
-+static inline bool mptcp_can_sendpage(struct sock *sk)
-+{
-+ struct sock *sk_it;
-+
-+ if (tcp_sk(sk)->mpcb->dss_csum)
-+ return false;
-+
-+ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
-+ if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
-+ !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
-+ return false;
-+ }
-+
-+ return true;
-+}
-+
-+static inline void mptcp_push_pending_frames(struct sock *meta_sk)
-+{
-+ /* We check packets out and send-head here. TCP only checks the
-+ * send-head. But, MPTCP also checks packets_out, as this is an
-+ * indication that we might want to do opportunistic reinjection.
-+ */
-+ if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
-+ struct tcp_sock *tp = tcp_sk(meta_sk);
-+
-+ /* We don't care about the MSS, because it will be set in
-+ * mptcp_write_xmit.
-+ */
-+ __tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
-+ }
-+}
-+
-+static inline void mptcp_send_reset(struct sock *sk)
-+{
-+ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
-+ mptcp_sub_force_close(sk);
-+}
-+
-+static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
-+{
-+ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
-+}
-+
-+static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
-+{
-+ return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
-+}
-+
-+/* Is it a data-fin while in infinite mapping mode?
-+ * In infinite mode, a subflow-fin is in fact a data-fin.
-+ */
-+static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
-+ const struct tcp_sock *tp)
-+{
-+ return mptcp_is_data_fin(skb) ||
-+ (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
-+}
-+
-+static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
-+{
-+ u64 data_seq_high = (u32)(data_seq >> 32);
-+
-+ if (mpcb->rcv_high_order[0] == data_seq_high)
-+ return 0;
-+ else if (mpcb->rcv_high_order[1] == data_seq_high)
-+ return MPTCPHDR_SEQ64_INDEX;
-+ else
-+ return MPTCPHDR_SEQ64_OFO;
-+}
-+
-+/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
-+ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
-+ */
-+static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
-+ u32 *data_seq,
-+ struct mptcp_cb *mpcb)
-+{
-+ __u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
-+
-+ if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
-+ u64 data_seq64 = get_unaligned_be64(ptr);
-+
-+ if (mpcb)
-+ TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
-+
-+ *data_seq = (u32)data_seq64;
-+ ptr++;
-+ } else {
-+ *data_seq = get_unaligned_be32(ptr);
-+ }
-+
-+ return ptr;
-+}
-+
-+static inline struct sock *mptcp_meta_sk(const struct sock *sk)
-+{
-+ return tcp_sk(sk)->meta_sk;
-+}
-+
-+static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
-+{
-+ return tcp_sk(tp->meta_sk);
-+}
-+
-+static inline int is_meta_tp(const struct tcp_sock *tp)
-+{
-+ return tp->mpcb && mptcp_meta_tp(tp) == tp;
-+}
-+
-+static inline int is_meta_sk(const struct sock *sk)
-+{
-+ return sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP &&
-+ mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
-+}
-+
-+static inline int is_master_tp(const struct tcp_sock *tp)
-+{
-+ return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
-+}
-+
-+static inline void mptcp_hash_request_remove(struct request_sock *req)
-+{
-+ int in_softirq = 0;
-+
-+ if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
-+ return;
-+
-+ if (in_softirq()) {
-+ spin_lock(&mptcp_reqsk_hlock);
-+ in_softirq = 1;
-+ } else {
-+ spin_lock_bh(&mptcp_reqsk_hlock);
-+ }
-+
-+ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
-+
-+ if (in_softirq)
-+ spin_unlock(&mptcp_reqsk_hlock);
-+ else
-+ spin_unlock_bh(&mptcp_reqsk_hlock);
-+}
-+
-+static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
-+{
-+ mopt->saw_mpc = 0;
-+ mopt->dss_csum = 0;
-+ mopt->drop_me = 0;
-+
-+ mopt->is_mp_join = 0;
-+ mopt->join_ack = 0;
-+
-+ mopt->saw_low_prio = 0;
-+ mopt->low_prio = 0;
-+
-+ mopt->saw_add_addr = 0;
-+ mopt->more_add_addr = 0;
-+
-+ mopt->saw_rem_addr = 0;
-+ mopt->more_rem_addr = 0;
-+
-+ mopt->mp_fail = 0;
-+ mopt->mp_fclose = 0;
-+}
-+
-+static inline void mptcp_reset_mopt(struct tcp_sock *tp)
-+{
-+ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
-+
-+ mopt->saw_low_prio = 0;
-+ mopt->saw_add_addr = 0;
-+ mopt->more_add_addr = 0;
-+ mopt->saw_rem_addr = 0;
-+ mopt->more_rem_addr = 0;
-+ mopt->join_ack = 0;
-+ mopt->mp_fail = 0;
-+ mopt->mp_fclose = 0;
-+}
-+
-+static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
-+ const struct mptcp_cb *mpcb)
-+{
-+ return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
-+ MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
-+}
-+
-+static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
-+ u32 data_seq_32)
-+{
-+ return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
-+}
-+
-+static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
-+{
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
-+ meta_tp->rcv_nxt);
-+}
-+
-+static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
-+{
-+ if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
-+ mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
-+ }
-+}
-+
-+static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
-+ u32 old_rcv_nxt)
-+{
-+ if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
-+ mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
-+ }
-+}
-+
-+static inline int mptcp_sk_can_send(const struct sock *sk)
-+{
-+ return tcp_passive_fastopen(sk) ||
-+ ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
-+ !tcp_sk(sk)->mptcp->pre_established);
-+}
-+
-+static inline int mptcp_sk_can_recv(const struct sock *sk)
-+{
-+ return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
-+}
-+
-+static inline int mptcp_sk_can_send_ack(const struct sock *sk)
-+{
-+ return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
-+ TCPF_CLOSE | TCPF_LISTEN)) &&
-+ !tcp_sk(sk)->mptcp->pre_established;
-+}
-+
-+/* Only support GSO if all subflows supports it */
-+static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
-+{
-+ struct sock *sk;
-+
-+ if (tcp_sk(meta_sk)->mpcb->dss_csum)
-+ return false;
-+
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+ if (!sk_can_gso(sk))
-+ return false;
-+ }
-+ return true;
-+}
-+
-+static inline bool mptcp_can_sg(const struct sock *meta_sk)
-+{
-+ struct sock *sk;
-+
-+ if (tcp_sk(meta_sk)->mpcb->dss_csum)
-+ return false;
-+
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+ if (!(sk->sk_route_caps & NETIF_F_SG))
-+ return false;
-+ }
-+ return true;
-+}
-+
-+static inline void mptcp_set_rto(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *sk_it;
-+ struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
-+ __u32 max_rto = 0;
-+
-+ /* We are in recovery-phase on the MPTCP-level. Do not update the
-+ * RTO, because this would kill exponential backoff.
-+ */
-+ if (micsk->icsk_retransmits)
-+ return;
-+
-+ mptcp_for_each_sk(tp->mpcb, sk_it) {
-+ if (mptcp_sk_can_send(sk_it) &&
-+ inet_csk(sk_it)->icsk_rto > max_rto)
-+ max_rto = inet_csk(sk_it)->icsk_rto;
-+ }
-+ if (max_rto) {
-+ micsk->icsk_rto = max_rto << 1;
-+
-+ /* A successfull rto-measurement - reset backoff counter */
-+ micsk->icsk_backoff = 0;
-+ }
-+}
-+
-+static inline int mptcp_sysctl_syn_retries(void)
-+{
-+ return sysctl_mptcp_syn_retries;
-+}
-+
-+static inline void mptcp_sub_close_passive(struct sock *sk)
-+{
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
-+
-+ /* Only close, if the app did a send-shutdown (passive close), and we
-+ * received the data-ack of the data-fin.
-+ */
-+ if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
-+ mptcp_sub_close(sk, 0);
-+}
-+
-+static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ /* If data has been acknowleged on the meta-level, fully_established
-+ * will have been set before and thus we will not fall back to infinite
-+ * mapping.
-+ */
-+ if (likely(tp->mptcp->fully_established))
-+ return false;
-+
-+ if (!(flag & MPTCP_FLAG_DATA_ACKED))
-+ return false;
-+
-+ /* Don't fallback twice ;) */
-+ if (tp->mpcb->infinite_mapping_snd)
-+ return false;
-+
-+ pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
-+ __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
-+ &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
-+ __builtin_return_address(0));
-+ if (!is_master_tp(tp))
-+ return true;
-+
-+ tp->mpcb->infinite_mapping_snd = 1;
-+ tp->mpcb->infinite_mapping_rcv = 1;
-+ tp->mptcp->fully_established = 1;
-+
-+ return false;
-+}
-+
-+/* Find the first index whose bit in the bit-field == 0 */
-+static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
-+{
-+ u8 base = mpcb->next_path_index;
-+ int i;
-+
-+ /* Start at 1, because 0 is reserved for the meta-sk */
-+ mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
-+ if (i + base < 1)
-+ continue;
-+ if (i + base >= sizeof(mpcb->path_index_bits) * 8)
-+ break;
-+ i += base;
-+ mpcb->path_index_bits |= (1 << i);
-+ mpcb->next_path_index = i + 1;
-+ return i;
-+ }
-+ mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
-+ if (i >= sizeof(mpcb->path_index_bits) * 8)
-+ break;
-+ if (i < 1)
-+ continue;
-+ mpcb->path_index_bits |= (1 << i);
-+ mpcb->next_path_index = i + 1;
-+ return i;
-+ }
-+
-+ return 0;
-+}
-+
-+static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
-+{
-+ return sk->sk_family == AF_INET6 &&
-+ ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
-+}
-+
-+/* TCP and MPTCP mpc flag-depending functions */
-+u16 mptcp_select_window(struct sock *sk);
-+void mptcp_init_buffer_space(struct sock *sk);
-+void mptcp_tcp_set_rto(struct sock *sk);
-+
-+/* TCP and MPTCP flag-depending functions */
-+bool mptcp_prune_ofo_queue(struct sock *sk);
-+
-+#else /* CONFIG_MPTCP */
-+#define mptcp_debug(fmt, args...) \
-+ do { \
-+ } while (0)
-+
-+/* Without MPTCP, we just do one iteration
-+ * over the only socket available. This assumes that
-+ * the sk/tp arg is the socket in that case.
-+ */
-+#define mptcp_for_each_sk(mpcb, sk)
-+#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
-+
-+static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
-+{
-+ return false;
-+}
-+static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
-+{
-+ return false;
-+}
-+static inline struct sock *mptcp_meta_sk(const struct sock *sk)
-+{
-+ return NULL;
-+}
-+static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
-+{
-+ return NULL;
-+}
-+static inline int is_meta_sk(const struct sock *sk)
-+{
-+ return 0;
-+}
-+static inline int is_master_tp(const struct tcp_sock *tp)
-+{
-+ return 0;
-+}
-+static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_del_sock(const struct sock *sk) {}
-+static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
-+static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
-+static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
-+static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
-+ const struct sock *sk) {}
-+static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
-+static inline void mptcp_set_rto(const struct sock *sk) {}
-+static inline void mptcp_send_fin(const struct sock *meta_sk) {}
-+static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
-+ const struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb) {}
-+static inline void mptcp_syn_options(const struct sock *sk,
-+ struct tcp_out_options *opts,
-+ unsigned *remaining) {}
-+static inline void mptcp_synack_options(struct request_sock *req,
-+ struct tcp_out_options *opts,
-+ unsigned *remaining) {}
-+
-+static inline void mptcp_established_options(struct sock *sk,
-+ struct sk_buff *skb,
-+ struct tcp_out_options *opts,
-+ unsigned *size) {}
-+static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+ const struct tcp_out_options *opts,
-+ struct sk_buff *skb) {}
-+static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
-+static inline int mptcp_doit(struct sock *sk)
-+{
-+ return 0;
-+}
-+static inline int mptcp_check_req_fastopen(struct sock *child,
-+ struct request_sock *req)
-+{
-+ return 1;
-+}
-+static inline int mptcp_check_req_master(const struct sock *sk,
-+ const struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev)
-+{
-+ return 1;
-+}
-+static inline struct sock *mptcp_check_req_child(struct sock *sk,
-+ struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev,
-+ const struct mptcp_options_received *mopt)
-+{
-+ return NULL;
-+}
-+static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
-+{
-+ return 0;
-+}
-+static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
-+{
-+ return 0;
-+}
-+static inline void mptcp_sub_close_passive(struct sock *sk) {}
-+static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
-+{
-+ return false;
-+}
-+static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
-+static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
-+{
-+ return 0;
-+}
-+static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
-+{
-+ return 0;
-+}
-+static inline int mptcp_sysctl_syn_retries(void)
-+{
-+ return 0;
-+}
-+static inline void mptcp_send_reset(const struct sock *sk) {}
-+static inline int mptcp_handle_options(struct sock *sk,
-+ const struct tcphdr *th,
-+ struct sk_buff *skb)
-+{
-+ return 0;
-+}
-+static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
-+static inline void __init mptcp_init(void) {}
-+static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
-+{
-+ return 0;
-+}
-+static inline bool mptcp_sk_can_gso(const struct sock *sk)
-+{
-+ return false;
-+}
-+static inline bool mptcp_can_sg(const struct sock *meta_sk)
-+{
-+ return false;
-+}
-+static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
-+ u32 mss_now, int large_allowed)
-+{
-+ return 0;
-+}
-+static inline void mptcp_destroy_sock(struct sock *sk) {}
-+static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
-+ struct sock **skptr,
-+ struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt)
-+{
-+ return 0;
-+}
-+static inline bool mptcp_can_sendpage(struct sock *sk)
-+{
-+ return false;
-+}
-+static inline int mptcp_init_tw_sock(struct sock *sk,
-+ struct tcp_timewait_sock *tw)
-+{
-+ return 0;
-+}
-+static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
-+static inline void mptcp_disconnect(struct sock *sk) {}
-+static inline void mptcp_tsq_flags(struct sock *sk) {}
-+static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
-+static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+ const struct tcp_options_received *rx_opt,
-+ const struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb) {}
-+static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+ const struct sk_buff *skb) {}
-+static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* _MPTCP_H */
-diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
-new file mode 100644
-index 000000000000..93ad97c77c5a
---- /dev/null
-+++ b/include/net/mptcp_v4.h
-@@ -0,0 +1,67 @@
-+/*
-+ * MPTCP implementation
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef MPTCP_V4_H_
-+#define MPTCP_V4_H_
-+
-+
-+#include <linux/in.h>
-+#include <linux/skbuff.h>
-+#include <net/mptcp.h>
-+#include <net/request_sock.h>
-+#include <net/sock.h>
-+
-+extern struct request_sock_ops mptcp_request_sock_ops;
-+extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
-+extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
-+extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
-+
-+#ifdef CONFIG_MPTCP
-+
-+int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
-+ const __be32 laddr, const struct net *net);
-+int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
-+ struct mptcp_rem4 *rem);
-+int mptcp_pm_v4_init(void);
-+void mptcp_pm_v4_undo(void);
-+u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
-+u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
-+
-+#else
-+
-+static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
-+ const struct sk_buff *skb)
-+{
-+ return 0;
-+}
-+
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* MPTCP_V4_H_ */
-diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
-new file mode 100644
-index 000000000000..49a4f30ccd4d
---- /dev/null
-+++ b/include/net/mptcp_v6.h
-@@ -0,0 +1,69 @@
-+/*
-+ * MPTCP implementation
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef _MPTCP_V6_H
-+#define _MPTCP_V6_H
-+
-+#include <linux/in6.h>
-+#include <net/if_inet6.h>
-+
-+#include <net/mptcp.h>
-+
-+
-+#ifdef CONFIG_MPTCP
-+extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
-+extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
-+extern struct request_sock_ops mptcp6_request_sock_ops;
-+extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
-+extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
-+
-+int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
-+ const struct in6_addr *laddr, const struct net *net);
-+int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
-+ struct mptcp_rem6 *rem);
-+int mptcp_pm_v6_init(void);
-+void mptcp_pm_v6_undo(void);
-+__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
-+ __be16 sport, __be16 dport);
-+u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
-+ __be16 sport, __be16 dport);
-+
-+#else /* CONFIG_MPTCP */
-+
-+#define mptcp_v6_mapped ipv6_mapped
-+
-+static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ return 0;
-+}
-+
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* _MPTCP_V6_H */
-diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
-index 361d26077196..bae95a11c531 100644
---- a/include/net/net_namespace.h
-+++ b/include/net/net_namespace.h
-@@ -16,6 +16,7 @@
- #include <net/netns/packet.h>
- #include <net/netns/ipv4.h>
- #include <net/netns/ipv6.h>
-+#include <net/netns/mptcp.h>
- #include <net/netns/ieee802154_6lowpan.h>
- #include <net/netns/sctp.h>
- #include <net/netns/dccp.h>
-@@ -92,6 +93,9 @@ struct net {
- #if IS_ENABLED(CONFIG_IPV6)
- struct netns_ipv6 ipv6;
- #endif
-+#if IS_ENABLED(CONFIG_MPTCP)
-+ struct netns_mptcp mptcp;
-+#endif
- #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
- struct netns_ieee802154_lowpan ieee802154_lowpan;
- #endif
-diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
-new file mode 100644
-index 000000000000..bad418b04cc8
---- /dev/null
-+++ b/include/net/netns/mptcp.h
-@@ -0,0 +1,44 @@
-+/*
-+ * MPTCP implementation - MPTCP namespace
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef __NETNS_MPTCP_H__
-+#define __NETNS_MPTCP_H__
-+
-+#include <linux/compiler.h>
-+
-+enum {
-+ MPTCP_PM_FULLMESH = 0,
-+ MPTCP_PM_MAX
-+};
-+
-+struct netns_mptcp {
-+ void *path_managers[MPTCP_PM_MAX];
-+};
-+
-+#endif /* __NETNS_MPTCP_H__ */
-diff --git a/include/net/request_sock.h b/include/net/request_sock.h
-index 7f830ff67f08..e79e87a8e1a6 100644
---- a/include/net/request_sock.h
-+++ b/include/net/request_sock.h
-@@ -164,7 +164,7 @@ struct request_sock_queue {
- };
-
- int reqsk_queue_alloc(struct request_sock_queue *queue,
-- unsigned int nr_table_entries);
-+ unsigned int nr_table_entries, gfp_t flags);
-
- void __reqsk_queue_destroy(struct request_sock_queue *queue);
- void reqsk_queue_destroy(struct request_sock_queue *queue);
-diff --git a/include/net/sock.h b/include/net/sock.h
-index 156350745700..0e23cae8861f 100644
---- a/include/net/sock.h
-+++ b/include/net/sock.h
-@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
-
- int sk_wait_data(struct sock *sk, long *timeo);
-
-+/* START - needed for MPTCP */
-+struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
-+void sock_lock_init(struct sock *sk);
-+
-+extern struct lock_class_key af_callback_keys[AF_MAX];
-+extern char *const af_family_clock_key_strings[AF_MAX+1];
-+
-+#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
-+/* END - needed for MPTCP */
-+
- struct request_sock_ops;
- struct timewait_sock_ops;
- struct inet_hashinfo;
-diff --git a/include/net/tcp.h b/include/net/tcp.h
-index 7286db80e8b8..ff92e74cd684 100644
---- a/include/net/tcp.h
-+++ b/include/net/tcp.h
-@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
- #define TCPOPT_SACK 5 /* SACK Block */
- #define TCPOPT_TIMESTAMP 8 /* Better RTT estimations/PAWS */
- #define TCPOPT_MD5SIG 19 /* MD5 Signature (RFC2385) */
-+#define TCPOPT_MPTCP 30
- #define TCPOPT_EXP 254 /* Experimental */
- /* Magic number to be after the option value for sharing TCP
- * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
-@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
- #define TFO_SERVER_WO_SOCKOPT1 0x400
- #define TFO_SERVER_WO_SOCKOPT2 0x800
-
-+/* Flags from tcp_input.c for tcp_ack */
-+#define FLAG_DATA 0x01 /* Incoming frame contained data. */
-+#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
-+#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
-+#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
-+#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
-+#define FLAG_DATA_SACKED 0x20 /* New SACK. */
-+#define FLAG_ECE 0x40 /* ECE in this ACK */
-+#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
-+#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
-+#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
-+#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
-+#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
-+#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
-+#define MPTCP_FLAG_DATA_ACKED 0x8000
-+
-+#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
-+#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
-+#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
-+#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
-+
- extern struct inet_timewait_death_row tcp_death_row;
-
- /* sysctl variables for tcp */
-@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
- #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
- #define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
-
-+/**** START - Exports needed for MPTCP ****/
-+extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
-+extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
-+
-+struct mptcp_options_received;
-+
-+void tcp_enter_quickack_mode(struct sock *sk);
-+int tcp_close_state(struct sock *sk);
-+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-+ const struct sk_buff *skb);
-+int tcp_xmit_probe_skb(struct sock *sk, int urgent);
-+void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
-+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-+ gfp_t gfp_mask);
-+unsigned int tcp_mss_split_point(const struct sock *sk,
-+ const struct sk_buff *skb,
-+ unsigned int mss_now,
-+ unsigned int max_segs,
-+ int nonagle);
-+bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ unsigned int cur_mss, int nonagle);
-+bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ unsigned int cur_mss);
-+unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
-+int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+ unsigned int mss_now);
-+void __pskb_trim_head(struct sk_buff *skb, int len);
-+void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
-+void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
-+void tcp_reset(struct sock *sk);
-+bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
-+ const u32 ack_seq, const u32 nwin);
-+bool tcp_urg_mode(const struct tcp_sock *tp);
-+void tcp_ack_probe(struct sock *sk);
-+void tcp_rearm_rto(struct sock *sk);
-+int tcp_write_timeout(struct sock *sk);
-+bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
-+ unsigned int timeout, bool syn_set);
-+void tcp_write_err(struct sock *sk);
-+void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
-+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+ unsigned int mss_now);
-+
-+int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
-+void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req);
-+__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
-+int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl,
-+ struct request_sock *req,
-+ u16 queue_mapping,
-+ struct tcp_fastopen_cookie *foc);
-+void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
-+struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
-+struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
-+void tcp_v4_reqsk_destructor(struct request_sock *req);
-+
-+int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
-+void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req);
-+__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
-+int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl, struct request_sock *req,
-+ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
-+void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
-+int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
-+int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
-+void tcp_v6_destroy_sock(struct sock *sk);
-+void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
-+void tcp_v6_hash(struct sock *sk);
-+struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
-+struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req,
-+ struct dst_entry *dst);
-+void tcp_v6_reqsk_destructor(struct request_sock *req);
-+
-+unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
-+ int large_allowed);
-+u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
-+
-+void skb_clone_fraglist(struct sk_buff *skb);
-+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
-+
-+void inet_twsk_free(struct inet_timewait_sock *tw);
-+int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
-+/* These states need RST on ABORT according to RFC793 */
-+static inline bool tcp_need_reset(int state)
-+{
-+ return (1 << state) &
-+ (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
-+ TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
-+}
-+
-+bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
-+ int hlen);
-+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-+ bool *fragstolen);
-+bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
-+ struct sk_buff *from, bool *fragstolen);
-+/**** END - Exports needed for MPTCP ****/
-+
- void tcp_tasklet_init(void);
-
- void tcp_v4_err(struct sk_buff *skb, u32);
-@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- size_t len, int nonblock, int flags, int *addr_len);
- void tcp_parse_options(const struct sk_buff *skb,
- struct tcp_options_received *opt_rx,
-+ struct mptcp_options_received *mopt_rx,
- int estab, struct tcp_fastopen_cookie *foc);
- const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
-
-@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
-
- u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
- u16 *mssp);
--__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
--#else
--static inline __u32 cookie_v4_init_sequence(struct sock *sk,
-- struct sk_buff *skb,
-- __u16 *mss)
--{
-- return 0;
--}
-+__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
-+ __u16 *mss);
- #endif
-
- __u32 cookie_init_timestamp(struct request_sock *req);
-@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
- const struct tcphdr *th, u16 *mssp);
- __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
- __u16 *mss);
--#else
--static inline __u32 cookie_v6_init_sequence(struct sock *sk,
-- struct sk_buff *skb,
-- __u16 *mss)
--{
-- return 0;
--}
- #endif
- /* tcp_output.c */
-
-@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
- void tcp_send_loss_probe(struct sock *sk);
- bool tcp_schedule_loss_probe(struct sock *sk);
-
-+u16 tcp_select_window(struct sock *sk);
-+bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+ int push_one, gfp_t gfp);
-+
- /* tcp_input.c */
- void tcp_resume_early_retransmit(struct sock *sk);
- void tcp_rearm_rto(struct sock *sk);
- void tcp_reset(struct sock *sk);
-+void tcp_set_rto(struct sock *sk);
-+bool tcp_should_expand_sndbuf(const struct sock *sk);
-+bool tcp_prune_ofo_queue(struct sock *sk);
-
- /* tcp_timer.c */
- void tcp_init_xmit_timers(struct sock *);
-@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
- */
- struct tcp_skb_cb {
- union {
-- struct inet_skb_parm h4;
-+ union {
-+ struct inet_skb_parm h4;
- #if IS_ENABLED(CONFIG_IPV6)
-- struct inet6_skb_parm h6;
-+ struct inet6_skb_parm h6;
- #endif
-- } header; /* For incoming frames */
-+ } header; /* For incoming frames */
-+#ifdef CONFIG_MPTCP
-+ union { /* For MPTCP outgoing frames */
-+ __u32 path_mask; /* paths that tried to send this skb */
-+ __u32 dss[6]; /* DSS options */
-+ };
-+#endif
-+ };
- __u32 seq; /* Starting sequence number */
- __u32 end_seq; /* SEQ + FIN + SYN + datalen */
- __u32 when; /* used to compute rtt's */
-+#ifdef CONFIG_MPTCP
-+ __u8 mptcp_flags; /* flags for the MPTCP layer */
-+ __u8 dss_off; /* Number of 4-byte words until
-+ * seq-number */
-+#endif
- __u8 tcp_flags; /* TCP header flags. (tcp[13]) */
-
- __u8 sacked; /* State flags for SACK/FACK. */
-@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
- /* Determine a window scaling and initial window to offer. */
- void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
- __u32 *window_clamp, int wscale_ok,
-- __u8 *rcv_wscale, __u32 init_rcv_wnd);
-+ __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+ const struct sock *sk);
-
- static inline int tcp_win_from_space(int space)
- {
-@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
- space - (space>>sysctl_tcp_adv_win_scale);
- }
-
-+#ifdef CONFIG_MPTCP
-+extern struct static_key mptcp_static_key;
-+static inline bool mptcp(const struct tcp_sock *tp)
-+{
-+ return static_key_false(&mptcp_static_key) && tp->mpc;
-+}
-+#else
-+static inline bool mptcp(const struct tcp_sock *tp)
-+{
-+ return 0;
-+}
-+#endif
-+
- /* Note: caller must be prepared to deal with negative returns */
- static inline int tcp_space(const struct sock *sk)
- {
-+ if (mptcp(tcp_sk(sk)))
-+ sk = tcp_sk(sk)->meta_sk;
-+
- return tcp_win_from_space(sk->sk_rcvbuf -
- atomic_read(&sk->sk_rmem_alloc));
- }
-
- static inline int tcp_full_space(const struct sock *sk)
- {
-+ if (mptcp(tcp_sk(sk)))
-+ sk = tcp_sk(sk)->meta_sk;
-+
- return tcp_win_from_space(sk->sk_rcvbuf);
- }
-
-@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
- ireq->wscale_ok = rx_opt->wscale_ok;
- ireq->acked = 0;
- ireq->ecn_ok = 0;
-+ ireq->mptcp_rqsk = 0;
-+ ireq->saw_mpc = 0;
- ireq->ir_rmt_port = tcp_hdr(skb)->source;
- ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
- }
-@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
- void tcp4_proc_exit(void);
- #endif
-
-+int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
-+int tcp_conn_request(struct request_sock_ops *rsk_ops,
-+ const struct tcp_request_sock_ops *af_ops,
-+ struct sock *sk, struct sk_buff *skb);
-+
- /* TCP af-specific functions */
- struct tcp_sock_af_ops {
- #ifdef CONFIG_TCP_MD5SIG
-@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
- #endif
- };
-
-+/* TCP/MPTCP-specific functions */
-+struct tcp_sock_ops {
-+ u32 (*__select_window)(struct sock *sk);
-+ u16 (*select_window)(struct sock *sk);
-+ void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
-+ __u32 *window_clamp, int wscale_ok,
-+ __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+ const struct sock *sk);
-+ void (*init_buffer_space)(struct sock *sk);
-+ void (*set_rto)(struct sock *sk);
-+ bool (*should_expand_sndbuf)(const struct sock *sk);
-+ void (*send_fin)(struct sock *sk);
-+ bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
-+ int push_one, gfp_t gfp);
-+ void (*send_active_reset)(struct sock *sk, gfp_t priority);
-+ int (*write_wakeup)(struct sock *sk);
-+ bool (*prune_ofo_queue)(struct sock *sk);
-+ void (*retransmit_timer)(struct sock *sk);
-+ void (*time_wait)(struct sock *sk, int state, int timeo);
-+ void (*cleanup_rbuf)(struct sock *sk, int copied);
-+ void (*init_congestion_control)(struct sock *sk);
-+};
-+extern const struct tcp_sock_ops tcp_specific;
-+
- struct tcp_request_sock_ops {
-+ u16 mss_clamp;
- #ifdef CONFIG_TCP_MD5SIG
- struct tcp_md5sig_key *(*md5_lookup) (struct sock *sk,
- struct request_sock *req);
-@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
- const struct request_sock *req,
- const struct sk_buff *skb);
- #endif
-+ int (*init_req)(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb);
-+#ifdef CONFIG_SYN_COOKIES
-+ __u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
-+ __u16 *mss);
-+#endif
-+ struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
-+ const struct request_sock *req,
-+ bool *strict);
-+ __u32 (*init_seq)(const struct sk_buff *skb);
-+ int (*send_synack)(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl, struct request_sock *req,
-+ u16 queue_mapping, struct tcp_fastopen_cookie *foc);
-+ void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
-+ const unsigned long timeout);
- };
-
-+#ifdef CONFIG_SYN_COOKIES
-+static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
-+ struct sock *sk, struct sk_buff *skb,
-+ __u16 *mss)
-+{
-+ return ops->cookie_init_seq(sk, skb, mss);
-+}
-+#else
-+static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
-+ struct sock *sk, struct sk_buff *skb,
-+ __u16 *mss)
-+{
-+ return 0;
-+}
-+#endif
-+
- int tcpv4_offload_init(void);
-
- void tcp_v4_init(void);
-diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
-index 9cf2394f0bcf..c2634b6ed854 100644
---- a/include/uapi/linux/if.h
-+++ b/include/uapi/linux/if.h
-@@ -109,6 +109,9 @@ enum net_device_flags {
- #define IFF_DORMANT IFF_DORMANT
- #define IFF_ECHO IFF_ECHO
-
-+#define IFF_NOMULTIPATH 0x80000 /* Disable for MPTCP */
-+#define IFF_MPBACKUP 0x100000 /* Use as backup path for MPTCP */
-+
- #define IFF_VOLATILE (IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
- IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
-
-diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
-index 3b9718328d8b..487475681d84 100644
---- a/include/uapi/linux/tcp.h
-+++ b/include/uapi/linux/tcp.h
-@@ -112,6 +112,7 @@ enum {
- #define TCP_FASTOPEN 23 /* Enable FastOpen on listeners */
- #define TCP_TIMESTAMP 24
- #define TCP_NOTSENT_LOWAT 25 /* limit number of unsent bytes in write queue */
-+#define MPTCP_ENABLED 26
-
- struct tcp_repair_opt {
- __u32 opt_code;
-diff --git a/net/Kconfig b/net/Kconfig
-index d92afe4204d9..96b58593ad5e 100644
---- a/net/Kconfig
-+++ b/net/Kconfig
-@@ -79,6 +79,7 @@ if INET
- source "net/ipv4/Kconfig"
- source "net/ipv6/Kconfig"
- source "net/netlabel/Kconfig"
-+source "net/mptcp/Kconfig"
-
- endif # if INET
-
-diff --git a/net/Makefile b/net/Makefile
-index cbbbe6d657ca..244bac1435b1 100644
---- a/net/Makefile
-+++ b/net/Makefile
-@@ -20,6 +20,7 @@ obj-$(CONFIG_INET) += ipv4/
- obj-$(CONFIG_XFRM) += xfrm/
- obj-$(CONFIG_UNIX) += unix/
- obj-$(CONFIG_NET) += ipv6/
-+obj-$(CONFIG_MPTCP) += mptcp/
- obj-$(CONFIG_PACKET) += packet/
- obj-$(CONFIG_NET_KEY) += key/
- obj-$(CONFIG_BRIDGE) += bridge/
-diff --git a/net/core/dev.c b/net/core/dev.c
-index 367a586d0c8a..215d2757fbf6 100644
---- a/net/core/dev.c
-+++ b/net/core/dev.c
-@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
-
- dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
- IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
-- IFF_AUTOMEDIA)) |
-+ IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
- (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
- IFF_ALLMULTI));
-
-diff --git a/net/core/request_sock.c b/net/core/request_sock.c
-index 467f326126e0..909dfa13f499 100644
---- a/net/core/request_sock.c
-+++ b/net/core/request_sock.c
-@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
- EXPORT_SYMBOL(sysctl_max_syn_backlog);
-
- int reqsk_queue_alloc(struct request_sock_queue *queue,
-- unsigned int nr_table_entries)
-+ unsigned int nr_table_entries,
-+ gfp_t flags)
- {
- size_t lopt_size = sizeof(struct listen_sock);
- struct listen_sock *lopt;
-@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
- nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
- lopt_size += nr_table_entries * sizeof(struct request_sock *);
- if (lopt_size > PAGE_SIZE)
-- lopt = vzalloc(lopt_size);
-+ lopt = __vmalloc(lopt_size,
-+ flags | __GFP_HIGHMEM | __GFP_ZERO,
-+ PAGE_KERNEL);
- else
-- lopt = kzalloc(lopt_size, GFP_KERNEL);
-+ lopt = kzalloc(lopt_size, flags);
- if (lopt == NULL)
- return -ENOMEM;
-
-diff --git a/net/core/skbuff.c b/net/core/skbuff.c
-index c1a33033cbe2..8abc5d60fbe3 100644
---- a/net/core/skbuff.c
-+++ b/net/core/skbuff.c
-@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
- skb_drop_list(&skb_shinfo(skb)->frag_list);
- }
-
--static void skb_clone_fraglist(struct sk_buff *skb)
-+void skb_clone_fraglist(struct sk_buff *skb)
- {
- struct sk_buff *list;
-
-@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
- skb->inner_mac_header += off;
- }
-
--static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
-+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
- {
- __copy_skb_header(new, old);
-
-diff --git a/net/core/sock.c b/net/core/sock.c
-index 026e01f70274..359295523177 100644
---- a/net/core/sock.c
-+++ b/net/core/sock.c
-@@ -136,6 +136,11 @@
-
- #include <trace/events/sock.h>
-
-+#ifdef CONFIG_MPTCP
-+#include <net/mptcp.h>
-+#include <net/inet_common.h>
-+#endif
-+
- #ifdef CONFIG_INET
- #include <net/tcp.h>
- #endif
-@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
- "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG" ,
- "slock-AF_NFC" , "slock-AF_VSOCK" ,"slock-AF_MAX"
- };
--static const char *const af_family_clock_key_strings[AF_MAX+1] = {
-+char *const af_family_clock_key_strings[AF_MAX+1] = {
- "clock-AF_UNSPEC", "clock-AF_UNIX" , "clock-AF_INET" ,
- "clock-AF_AX25" , "clock-AF_IPX" , "clock-AF_APPLETALK",
- "clock-AF_NETROM", "clock-AF_BRIDGE" , "clock-AF_ATMPVC" ,
-@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
- * sk_callback_lock locking rules are per-address-family,
- * so split the lock classes by using a per-AF key:
- */
--static struct lock_class_key af_callback_keys[AF_MAX];
-+struct lock_class_key af_callback_keys[AF_MAX];
-
- /* Take into consideration the size of the struct sk_buff overhead in the
- * determination of these values, since that is non-constant across
-@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
- }
- }
-
--#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
--
- static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
- {
- if (sk->sk_flags & flags) {
-@@ -1253,8 +1256,25 @@ lenout:
- *
- * (We also register the sk_lock with the lock validator.)
- */
--static inline void sock_lock_init(struct sock *sk)
--{
-+void sock_lock_init(struct sock *sk)
-+{
-+#ifdef CONFIG_MPTCP
-+ /* Reclassify the lock-class for subflows */
-+ if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
-+ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
-+ sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
-+ &meta_slock_key,
-+ "sk_lock-AF_INET-MPTCP",
-+ &meta_key);
-+
-+ /* We don't yet have the mptcp-point.
-+ * Thus we still need inet_sock_destruct
-+ */
-+ sk->sk_destruct = inet_sock_destruct;
-+ return;
-+ }
-+#endif
-+
- sock_lock_init_class_and_name(sk,
- af_family_slock_key_strings[sk->sk_family],
- af_family_slock_keys + sk->sk_family,
-@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
- }
- EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
-
--static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
-+struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
- int family)
- {
- struct sock *sk;
-diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
-index 4db3c2a1679c..04cb17d4b0ce 100644
---- a/net/dccp/ipv6.c
-+++ b/net/dccp/ipv6.c
-@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
- goto drop;
-
-- req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
-+ req = inet_reqsk_alloc(&dccp6_request_sock_ops);
- if (req == NULL)
- goto drop;
-
-diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
-index 05c57f0fcabe..630434db0085 100644
---- a/net/ipv4/Kconfig
-+++ b/net/ipv4/Kconfig
-@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
- For further details see:
- http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
-
-+config TCP_CONG_COUPLED
-+ tristate "MPTCP COUPLED CONGESTION CONTROL"
-+ depends on MPTCP
-+ default n
-+ ---help---
-+ MultiPath TCP Coupled Congestion Control
-+ To enable it, just put 'coupled' in tcp_congestion_control
-+
-+config TCP_CONG_OLIA
-+ tristate "MPTCP Opportunistic Linked Increase"
-+ depends on MPTCP
-+ default n
-+ ---help---
-+ MultiPath TCP Opportunistic Linked Increase Congestion Control
-+ To enable it, just put 'olia' in tcp_congestion_control
-+
-+config TCP_CONG_WVEGAS
-+ tristate "MPTCP WVEGAS CONGESTION CONTROL"
-+ depends on MPTCP
-+ default n
-+ ---help---
-+ wVegas congestion control for MPTCP
-+ To enable it, just put 'wvegas' in tcp_congestion_control
-+
- choice
- prompt "Default TCP congestion control"
- default DEFAULT_CUBIC
-@@ -584,6 +608,15 @@ choice
- config DEFAULT_WESTWOOD
- bool "Westwood" if TCP_CONG_WESTWOOD=y
-
-+ config DEFAULT_COUPLED
-+ bool "Coupled" if TCP_CONG_COUPLED=y
-+
-+ config DEFAULT_OLIA
-+ bool "Olia" if TCP_CONG_OLIA=y
-+
-+ config DEFAULT_WVEGAS
-+ bool "Wvegas" if TCP_CONG_WVEGAS=y
-+
- config DEFAULT_RENO
- bool "Reno"
-
-@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
- default "vegas" if DEFAULT_VEGAS
- default "westwood" if DEFAULT_WESTWOOD
- default "veno" if DEFAULT_VENO
-+ default "coupled" if DEFAULT_COUPLED
-+ default "wvegas" if DEFAULT_WVEGAS
- default "reno" if DEFAULT_RENO
- default "cubic"
-
-diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
-index d156b3c5f363..4afd6d8d9028 100644
---- a/net/ipv4/af_inet.c
-+++ b/net/ipv4/af_inet.c
-@@ -104,6 +104,7 @@
- #include <net/ip_fib.h>
- #include <net/inet_connection_sock.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
- #include <net/udp.h>
- #include <net/udplite.h>
- #include <net/ping.h>
-@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
- * Create an inet socket.
- */
-
--static int inet_create(struct net *net, struct socket *sock, int protocol,
-- int kern)
-+int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
- {
- struct sock *sk;
- struct inet_protosw *answer;
-@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
- lock_sock(sk2);
-
- sock_rps_record_flow(sk2);
-+
-+ if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
-+ struct sock *sk_it = sk2;
-+
-+ mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+
-+ if (tcp_sk(sk2)->mpcb->master_sk) {
-+ sk_it = tcp_sk(sk2)->mpcb->master_sk;
-+
-+ write_lock_bh(&sk_it->sk_callback_lock);
-+ sk_it->sk_wq = newsock->wq;
-+ sk_it->sk_socket = newsock;
-+ write_unlock_bh(&sk_it->sk_callback_lock);
-+ }
-+ }
-+
- WARN_ON(!((1 << sk2->sk_state) &
- (TCPF_ESTABLISHED | TCPF_SYN_RECV |
- TCPF_CLOSE_WAIT | TCPF_CLOSE)));
-@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
-
- ip_init();
-
-+ /* We must initialize MPTCP before TCP. */
-+ mptcp_init();
-+
- tcp_v4_init();
-
- /* Setup TCP slab cache for open requests. */
-diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
-index 14d02ea905b6..7d734d8af19b 100644
---- a/net/ipv4/inet_connection_sock.c
-+++ b/net/ipv4/inet_connection_sock.c
-@@ -23,6 +23,7 @@
- #include <net/route.h>
- #include <net/tcp_states.h>
- #include <net/xfrm.h>
-+#include <net/mptcp.h>
-
- #ifdef INET_CSK_DEBUG
- const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
-@@ -465,8 +466,8 @@ no_route:
- }
- EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
-
--static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
-- const u32 rnd, const u32 synq_hsize)
-+u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
-+ const u32 synq_hsize)
- {
- return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
- }
-@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
-
- lopt->clock_hand = i;
-
-- if (lopt->qlen)
-+ if (lopt->qlen && !is_meta_sk(parent))
- inet_csk_reset_keepalive_timer(parent, interval);
- }
- EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
-@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
- const struct request_sock *req,
- const gfp_t priority)
- {
-- struct sock *newsk = sk_clone_lock(sk, priority);
-+ struct sock *newsk;
-+
-+ newsk = sk_clone_lock(sk, priority);
-
- if (newsk != NULL) {
- struct inet_connection_sock *newicsk = inet_csk(newsk);
-@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
- {
- struct inet_sock *inet = inet_sk(sk);
- struct inet_connection_sock *icsk = inet_csk(sk);
-- int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
-+ int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
-+ GFP_KERNEL);
-
- if (rc != 0)
- return rc;
-@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
-
- while ((req = acc_req) != NULL) {
- struct sock *child = req->sk;
-+ bool mutex_taken = false;
-
- acc_req = req->dl_next;
-
-+ if (is_meta_sk(child)) {
-+ mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
-+ mutex_taken = true;
-+ }
- local_bh_disable();
- bh_lock_sock(child);
- WARN_ON(sock_owned_by_user(child));
-@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
-
- bh_unlock_sock(child);
- local_bh_enable();
-+ if (mutex_taken)
-+ mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
- sock_put(child);
-
- sk_acceptq_removed(sk);
-diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
-index c86624b36a62..0ff3fe004d62 100644
---- a/net/ipv4/syncookies.c
-+++ b/net/ipv4/syncookies.c
-@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
- }
- EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
-
--__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
-+__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
-+ __u16 *mssp)
- {
- const struct iphdr *iph = ip_hdr(skb);
- const struct tcphdr *th = tcp_hdr(skb);
-@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
-
- /* check for timestamp cookie support */
- memset(&tcp_opt, 0, sizeof(tcp_opt));
-- tcp_parse_options(skb, &tcp_opt, 0, NULL);
-+ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
-
- if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
- goto out;
-@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
- /* Try to redo what tcp_v4_send_synack did. */
- req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
-
-- tcp_select_initial_window(tcp_full_space(sk), req->mss,
-- &req->rcv_wnd, &req->window_clamp,
-- ireq->wscale_ok, &rcv_wscale,
-- dst_metric(&rt->dst, RTAX_INITRWND));
-+ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
-+ &req->rcv_wnd, &req->window_clamp,
-+ ireq->wscale_ok, &rcv_wscale,
-+ dst_metric(&rt->dst, RTAX_INITRWND), sk);
-
- ireq->rcv_wscale = rcv_wscale;
-
-diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
-index 9d2118e5fbc7..2cb89f886d45 100644
---- a/net/ipv4/tcp.c
-+++ b/net/ipv4/tcp.c
-@@ -271,6 +271,7 @@
-
- #include <net/icmp.h>
- #include <net/inet_common.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- #include <net/xfrm.h>
- #include <net/ip.h>
-@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
- return period;
- }
-
-+const struct tcp_sock_ops tcp_specific = {
-+ .__select_window = __tcp_select_window,
-+ .select_window = tcp_select_window,
-+ .select_initial_window = tcp_select_initial_window,
-+ .init_buffer_space = tcp_init_buffer_space,
-+ .set_rto = tcp_set_rto,
-+ .should_expand_sndbuf = tcp_should_expand_sndbuf,
-+ .init_congestion_control = tcp_init_congestion_control,
-+ .send_fin = tcp_send_fin,
-+ .write_xmit = tcp_write_xmit,
-+ .send_active_reset = tcp_send_active_reset,
-+ .write_wakeup = tcp_write_wakeup,
-+ .prune_ofo_queue = tcp_prune_ofo_queue,
-+ .retransmit_timer = tcp_retransmit_timer,
-+ .time_wait = tcp_time_wait,
-+ .cleanup_rbuf = tcp_cleanup_rbuf,
-+};
-+
- /* Address-family independent initialization for a tcp_sock.
- *
- * NOTE: A lot of things set to zero explicitly by call to
-@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
- sk->sk_sndbuf = sysctl_tcp_wmem[1];
- sk->sk_rcvbuf = sysctl_tcp_rmem[1];
-
-+ tp->ops = &tcp_specific;
-+
- local_bh_disable();
- sock_update_memcg(sk);
- sk_sockets_allocated_inc(sk);
-@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
- int ret;
-
- sock_rps_record_flow(sk);
-+
-+#ifdef CONFIG_MPTCP
-+ if (mptcp(tcp_sk(sk))) {
-+ struct sock *sk_it;
-+ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+ }
-+#endif
- /*
- * We can't seek on a socket input
- */
-@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
- return NULL;
- }
-
--static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
-- int large_allowed)
-+unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- u32 xmit_size_goal, old_size_goal;
-@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
- {
- int mss_now;
-
-- mss_now = tcp_current_mss(sk);
-- *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+ if (mptcp(tcp_sk(sk))) {
-+ mss_now = mptcp_current_mss(sk);
-+ *size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+ } else {
-+ mss_now = tcp_current_mss(sk);
-+ *size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+ }
-
- return mss_now;
- }
-@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
- * is fully established.
- */
- if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
-- !tcp_passive_fastopen(sk)) {
-+ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
-+ tp->mpcb->master_sk : sk)) {
- if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- goto out_err;
- }
-
-+ if (mptcp(tp)) {
-+ struct sock *sk_it = sk;
-+
-+ /* We must check this with socket-lock hold because we iterate
-+ * over the subflows.
-+ */
-+ if (!mptcp_can_sendpage(sk)) {
-+ ssize_t ret;
-+
-+ release_sock(sk);
-+ ret = sock_no_sendpage(sk->sk_socket, page, offset,
-+ size, flags);
-+ lock_sock(sk);
-+ return ret;
-+ }
-+
-+ mptcp_for_each_sk(tp->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+ }
-+
- clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
-
- mss_now = tcp_send_mss(sk, &size_goal, flags);
-@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
- {
- ssize_t res;
-
-- if (!(sk->sk_route_caps & NETIF_F_SG) ||
-- !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-+ /* If MPTCP is enabled, we check it later after establishment */
-+ if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
-+ !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
- return sock_no_sendpage(sk->sk_socket, page, offset, size,
- flags);
-
-@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
- const struct tcp_sock *tp = tcp_sk(sk);
- int tmp = tp->mss_cache;
-
-+ if (mptcp(tp))
-+ return mptcp_select_size(sk, sg);
-+
- if (sg) {
- if (sk_can_gso(sk)) {
- /* Small frames wont use a full page:
-@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- * is fully established.
- */
- if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
-- !tcp_passive_fastopen(sk)) {
-+ !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
-+ tp->mpcb->master_sk : sk)) {
- if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- goto do_error;
- }
-
-+ if (mptcp(tp)) {
-+ struct sock *sk_it = sk;
-+ mptcp_for_each_sk(tp->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+ }
-+
- if (unlikely(tp->repair)) {
- if (tp->repair_queue == TCP_RECV_QUEUE) {
- copied = tcp_send_rcvq(sk, msg, size);
-@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
- goto out_err;
-
-- sg = !!(sk->sk_route_caps & NETIF_F_SG);
-+ if (mptcp(tp))
-+ sg = mptcp_can_sg(sk);
-+ else
-+ sg = !!(sk->sk_route_caps & NETIF_F_SG);
-
- while (--iovlen >= 0) {
- size_t seglen = iov->iov_len;
-@@ -1183,8 +1251,15 @@ new_segment:
-
- /*
- * Check whether we can use HW checksum.
-+ *
-+ * If dss-csum is enabled, we do not do hw-csum.
-+ * In case of non-mptcp we check the
-+ * device-capabilities.
-+ * In case of mptcp, hw-csum's will be handled
-+ * later in mptcp_write_xmit.
- */
-- if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
-+ if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
-+ (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
- skb->ip_summed = CHECKSUM_PARTIAL;
-
- skb_entail(sk, skb);
-@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
-
- /* Optimize, __tcp_select_window() is not cheap. */
- if (2*rcv_window_now <= tp->window_clamp) {
-- __u32 new_window = __tcp_select_window(sk);
-+ __u32 new_window = tp->ops->__select_window(sk);
-
- /* Send ACK now, if this read freed lots of space
- * in our buffer. Certainly, new_window is new window.
-@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
- /* Clean up data we have read: This will do ACK frames. */
- if (copied > 0) {
- tcp_recv_skb(sk, seq, &offset);
-- tcp_cleanup_rbuf(sk, copied);
-+ tp->ops->cleanup_rbuf(sk, copied);
- }
- return copied;
- }
-@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
-
- lock_sock(sk);
-
-+#ifdef CONFIG_MPTCP
-+ if (mptcp(tp)) {
-+ struct sock *sk_it;
-+ mptcp_for_each_sk(tp->mpcb, sk_it)
-+ sock_rps_record_flow(sk_it);
-+ }
-+#endif
-+
- err = -ENOTCONN;
- if (sk->sk_state == TCP_LISTEN)
- goto out;
-@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- }
- }
-
-- tcp_cleanup_rbuf(sk, copied);
-+ tp->ops->cleanup_rbuf(sk, copied);
-
- if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
- /* Install new reader */
-@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- if (tp->rcv_wnd == 0 &&
- !skb_queue_empty(&sk->sk_async_wait_queue)) {
- tcp_service_net_dma(sk, true);
-- tcp_cleanup_rbuf(sk, copied);
-+ tp->ops->cleanup_rbuf(sk, copied);
- } else
- dma_async_issue_pending(tp->ucopy.dma_chan);
- }
-@@ -1993,7 +2076,7 @@ skip_copy:
- */
-
- /* Clean up data we have read: This will do ACK frames. */
-- tcp_cleanup_rbuf(sk, copied);
-+ tp->ops->cleanup_rbuf(sk, copied);
-
- release_sock(sk);
- return copied;
-@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
- /* TCP_CLOSING */ TCP_CLOSING,
- };
-
--static int tcp_close_state(struct sock *sk)
-+int tcp_close_state(struct sock *sk)
- {
- int next = (int)new_state[sk->sk_state];
- int ns = next & TCP_STATE_MASK;
-@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
- TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
- /* Clear out any half completed packets. FIN if needed. */
- if (tcp_close_state(sk))
-- tcp_send_fin(sk);
-+ tcp_sk(sk)->ops->send_fin(sk);
- }
- }
- EXPORT_SYMBOL(tcp_shutdown);
-@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
- int data_was_unread = 0;
- int state;
-
-+ if (is_meta_sk(sk)) {
-+ mptcp_close(sk, timeout);
-+ return;
-+ }
-+
- lock_sock(sk);
- sk->sk_shutdown = SHUTDOWN_MASK;
-
-@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
- /* Unread data was tossed, zap the connection. */
- NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
- tcp_set_state(sk, TCP_CLOSE);
-- tcp_send_active_reset(sk, sk->sk_allocation);
-+ tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
- } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
- /* Check zero linger _after_ checking for unread data. */
- sk->sk_prot->disconnect(sk, 0);
-@@ -2247,7 +2335,7 @@ adjudge_to_death:
- struct tcp_sock *tp = tcp_sk(sk);
- if (tp->linger2 < 0) {
- tcp_set_state(sk, TCP_CLOSE);
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
- NET_INC_STATS_BH(sock_net(sk),
- LINUX_MIB_TCPABORTONLINGER);
- } else {
-@@ -2257,7 +2345,8 @@ adjudge_to_death:
- inet_csk_reset_keepalive_timer(sk,
- tmo - TCP_TIMEWAIT_LEN);
- } else {
-- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+ tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
-+ tmo);
- goto out;
- }
- }
-@@ -2266,7 +2355,7 @@ adjudge_to_death:
- sk_mem_reclaim(sk);
- if (tcp_check_oom(sk, 0)) {
- tcp_set_state(sk, TCP_CLOSE);
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
- NET_INC_STATS_BH(sock_net(sk),
- LINUX_MIB_TCPABORTONMEMORY);
- }
-@@ -2291,15 +2380,6 @@ out:
- }
- EXPORT_SYMBOL(tcp_close);
-
--/* These states need RST on ABORT according to RFC793 */
--
--static inline bool tcp_need_reset(int state)
--{
-- return (1 << state) &
-- (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
-- TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
--}
--
- int tcp_disconnect(struct sock *sk, int flags)
- {
- struct inet_sock *inet = inet_sk(sk);
-@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
- /* The last check adjusts for discrepancy of Linux wrt. RFC
- * states
- */
-- tcp_send_active_reset(sk, gfp_any());
-+ tp->ops->send_active_reset(sk, gfp_any());
- sk->sk_err = ECONNRESET;
- } else if (old_state == TCP_SYN_SENT)
- sk->sk_err = ECONNRESET;
-@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
- if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
- inet_reset_saddr(sk);
-
-+ if (is_meta_sk(sk)) {
-+ mptcp_disconnect(sk);
-+ } else {
-+ if (tp->inside_tk_table)
-+ mptcp_hash_remove_bh(tp);
-+ }
-+
- sk->sk_shutdown = 0;
- sock_reset_flag(sk, SOCK_DONE);
- tp->srtt_us = 0;
-@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- break;
-
- case TCP_DEFER_ACCEPT:
-+ /* An established MPTCP-connection (mptcp(tp) only returns true
-+ * if the socket is established) should not use DEFER on new
-+ * subflows.
-+ */
-+ if (mptcp(tp))
-+ break;
- /* Translate value in seconds to number of retransmits */
- icsk->icsk_accept_queue.rskq_defer_accept =
- secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
-@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
- inet_csk_ack_scheduled(sk)) {
- icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
-- tcp_cleanup_rbuf(sk, 1);
-+ tp->ops->cleanup_rbuf(sk, 1);
- if (!(val & 1))
- icsk->icsk_ack.pingpong = 1;
- }
-@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- tp->notsent_lowat = val;
- sk->sk_write_space(sk);
- break;
-+#ifdef CONFIG_MPTCP
-+ case MPTCP_ENABLED:
-+ if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
-+ if (val)
-+ tp->mptcp_enabled = 1;
-+ else
-+ tp->mptcp_enabled = 0;
-+ } else {
-+ err = -EPERM;
-+ }
-+ break;
-+#endif
- default:
- err = -ENOPROTOOPT;
- break;
-@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
- case TCP_NOTSENT_LOWAT:
- val = tp->notsent_lowat;
- break;
-+#ifdef CONFIG_MPTCP
-+ case MPTCP_ENABLED:
-+ val = tp->mptcp_enabled;
-+ break;
-+#endif
- default:
- return -ENOPROTOOPT;
- }
-@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
- if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
-
-+ WARN_ON(sk->sk_state == TCP_CLOSE);
- tcp_set_state(sk, TCP_CLOSE);
-+
- tcp_clear_xmit_timers(sk);
-+
- if (req != NULL)
- reqsk_fastopen_remove(sk, req, false);
-
-diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
-index 9771563ab564..5c230d96c4c1 100644
---- a/net/ipv4/tcp_fastopen.c
-+++ b/net/ipv4/tcp_fastopen.c
-@@ -7,6 +7,7 @@
- #include <linux/rculist.h>
- #include <net/inetpeer.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-
- int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
-
-@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- {
- struct tcp_sock *tp;
- struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
-- struct sock *child;
-+ struct sock *child, *meta_sk;
-
- req->num_retrans = 0;
- req->num_timeout = 0;
-@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- /* Add the child socket directly into the accept queue */
- inet_csk_reqsk_queue_add(sk, req, child);
-
-- /* Now finish processing the fastopen child socket. */
-- inet_csk(child)->icsk_af_ops->rebuild_header(child);
-- tcp_init_congestion_control(child);
-- tcp_mtup_init(child);
-- tcp_init_metrics(child);
-- tcp_init_buffer_space(child);
--
- /* Queue the data carried in the SYN packet. We need to first
- * bump skb's refcnt because the caller will attempt to free it.
- *
-@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- tp->syn_data_acked = 1;
- }
- tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-+
-+ meta_sk = child;
-+ if (!mptcp_check_req_fastopen(meta_sk, req)) {
-+ child = tcp_sk(meta_sk)->mpcb->master_sk;
-+ tp = tcp_sk(child);
-+ }
-+
-+ /* Now finish processing the fastopen child socket. */
-+ inet_csk(child)->icsk_af_ops->rebuild_header(child);
-+ tp->ops->init_congestion_control(child);
-+ tcp_mtup_init(child);
-+ tcp_init_metrics(child);
-+ tp->ops->init_buffer_space(child);
-+
- sk->sk_data_ready(sk);
-- bh_unlock_sock(child);
-+ if (mptcp(tcp_sk(child)))
-+ bh_unlock_sock(child);
-+ bh_unlock_sock(meta_sk);
- sock_put(child);
- WARN_ON(req->sk == NULL);
- return true;
-diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
-index 40639c288dc2..3273bb69f387 100644
---- a/net/ipv4/tcp_input.c
-+++ b/net/ipv4/tcp_input.c
-@@ -74,6 +74,9 @@
- #include <linux/ipsec.h>
- #include <asm/unaligned.h>
- #include <net/netdma.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-
- int sysctl_tcp_timestamps __read_mostly = 1;
- int sysctl_tcp_window_scaling __read_mostly = 1;
-@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
- int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
- int sysctl_tcp_early_retrans __read_mostly = 3;
-
--#define FLAG_DATA 0x01 /* Incoming frame contained data. */
--#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
--#define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */
--#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted. */
--#define FLAG_SYN_ACKED 0x10 /* This ACK acknowledged SYN. */
--#define FLAG_DATA_SACKED 0x20 /* New SACK. */
--#define FLAG_ECE 0x40 /* ECE in this ACK */
--#define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/
--#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */
--#define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
--#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */
--#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
--#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
--
--#define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
--#define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
--#define FLAG_CA_ALERT (FLAG_DATA_SACKED|FLAG_ECE)
--#define FLAG_FORWARD_PROGRESS (FLAG_ACKED|FLAG_DATA_SACKED)
--
- #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
- #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
-
-@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
- icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
- }
-
--static void tcp_enter_quickack_mode(struct sock *sk)
-+void tcp_enter_quickack_mode(struct sock *sk)
- {
- struct inet_connection_sock *icsk = inet_csk(sk);
- tcp_incr_quickack(sk);
-@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
- per_mss = roundup_pow_of_two(per_mss) +
- SKB_DATA_ALIGN(sizeof(struct sk_buff));
-
-- nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
-- nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
-+ if (mptcp(tp)) {
-+ nr_segs = mptcp_check_snd_buf(tp);
-+ } else {
-+ nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
-+ nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
-+ }
-
- /* Fast Recovery (RFC 5681 3.2) :
- * Cubic needs 1.7 factor, rounded to 2 to include
-@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
- */
- sndmem = 2 * nr_segs * per_mss;
-
-- if (sk->sk_sndbuf < sndmem)
-+ /* MPTCP: after this sndmem is the new contribution of the
-+ * current subflow to the aggregated sndbuf */
-+ if (sk->sk_sndbuf < sndmem) {
-+ int old_sndbuf = sk->sk_sndbuf;
- sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
-+ /* MPTCP: ok, the subflow sndbuf has grown, reflect
-+ * this in the aggregate buffer.*/
-+ if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
-+ mptcp_update_sndbuf(tp);
-+ }
- }
-
- /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
-@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
- static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-
- /* Check #1 */
-- if (tp->rcv_ssthresh < tp->window_clamp &&
-- (int)tp->rcv_ssthresh < tcp_space(sk) &&
-+ if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
-+ (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
- !sk_under_memory_pressure(sk)) {
- int incr;
-
-@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
- * will fit to rcvbuf in future.
- */
- if (tcp_win_from_space(skb->truesize) <= skb->len)
-- incr = 2 * tp->advmss;
-+ incr = 2 * meta_tp->advmss;
- else
-- incr = __tcp_grow_window(sk, skb);
-+ incr = __tcp_grow_window(meta_sk, skb);
-
- if (incr) {
- incr = max_t(int, incr, 2 * skb->len);
-- tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
-- tp->window_clamp);
-+ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
-+ meta_tp->window_clamp);
- inet_csk(sk)->icsk_ack.quick |= 1;
- }
- }
-@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
- int copied;
-
- time = tcp_time_stamp - tp->rcvq_space.time;
-- if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
-+ if (mptcp(tp)) {
-+ if (mptcp_check_rtt(tp, time))
-+ return;
-+ } else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
- return;
-
- /* Number of bytes copied to user in last RTT */
-@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
- /* Calculate rto without backoff. This is the second half of Van Jacobson's
- * routine referred to above.
- */
--static void tcp_set_rto(struct sock *sk)
-+void tcp_set_rto(struct sock *sk)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
- /* Old crap is replaced with new one. 8)
-@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
- int len;
- int in_sack;
-
-- if (!sk_can_gso(sk))
-+ /* For MPTCP we cannot shift skb-data and remove one skb from the
-+ * send-queue, because this will make us loose the DSS-option (which
-+ * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
-+ */
-+ if (!sk_can_gso(sk) || mptcp(tp))
- goto fallback;
-
- /* Normally R but no L won't result in plain S */
-@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
- return false;
-
- tcp_rtt_estimator(sk, seq_rtt_us);
-- tcp_set_rto(sk);
-+ tp->ops->set_rto(sk);
-
- /* RFC6298: only reset backoff on valid RTT measurement. */
- inet_csk(sk)->icsk_backoff = 0;
-@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
- }
-
- /* If we get here, the whole TSO packet has not been acked. */
--static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
-+u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- u32 packets_acked;
-@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
- */
- if (!(scb->tcp_flags & TCPHDR_SYN)) {
- flag |= FLAG_DATA_ACKED;
-+ if (mptcp(tp) && mptcp_is_data_seq(skb))
-+ flag |= MPTCP_FLAG_DATA_ACKED;
- } else {
- flag |= FLAG_SYN_ACKED;
- tp->retrans_stamp = 0;
-@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
- return flag;
- }
-
--static void tcp_ack_probe(struct sock *sk)
-+void tcp_ack_probe(struct sock *sk)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
- struct inet_connection_sock *icsk = inet_csk(sk);
-@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
- /* Check that window update is acceptable.
- * The function assumes that snd_una<=ack<=snd_next.
- */
--static inline bool tcp_may_update_window(const struct tcp_sock *tp,
-- const u32 ack, const u32 ack_seq,
-- const u32 nwin)
-+bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
-+ const u32 ack_seq, const u32 nwin)
- {
- return after(ack, tp->snd_una) ||
- after(ack_seq, tp->snd_wl1) ||
-@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
- }
-
- /* This routine deals with incoming acks, but not outgoing ones. */
--static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
-+static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
- {
- struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
-@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
- sack_rtt_us);
- acked -= tp->packets_out;
-
-+ if (mptcp(tp)) {
-+ if (mptcp_fallback_infinite(sk, flag)) {
-+ pr_err("%s resetting flow\n", __func__);
-+ mptcp_send_reset(sk);
-+ goto invalid_ack;
-+ }
-+
-+ mptcp_clean_rtx_infinite(skb, sk);
-+ }
-+
- /* Advance cwnd if state allows */
- if (tcp_may_raise_cwnd(sk, flag))
- tcp_cong_avoid(sk, ack, acked);
-@@ -3512,8 +3528,9 @@ old_ack:
- * the fast version below fails.
- */
- void tcp_parse_options(const struct sk_buff *skb,
-- struct tcp_options_received *opt_rx, int estab,
-- struct tcp_fastopen_cookie *foc)
-+ struct tcp_options_received *opt_rx,
-+ struct mptcp_options_received *mopt,
-+ int estab, struct tcp_fastopen_cookie *foc)
- {
- const unsigned char *ptr;
- const struct tcphdr *th = tcp_hdr(skb);
-@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
- */
- break;
- #endif
-+ case TCPOPT_MPTCP:
-+ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
-+ break;
- case TCPOPT_EXP:
- /* Fast Open option shares code 254 using a
- * 16 bits magic number. It's valid only in
-@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
- if (tcp_parse_aligned_timestamp(tp, th))
- return true;
- }
--
-- tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
-+ tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
-+ 1, NULL);
- if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
- tp->rx_opt.rcv_tsecr -= tp->tsoffset;
-
-@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
- dst = __sk_dst_get(sk);
- if (!dst || !dst_metric(dst, RTAX_QUICKACK))
- inet_csk(sk)->icsk_ack.pingpong = 1;
-+ if (mptcp(tp))
-+ mptcp_sub_close_passive(sk);
- break;
-
- case TCP_CLOSE_WAIT:
-@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
- tcp_set_state(sk, TCP_CLOSING);
- break;
- case TCP_FIN_WAIT2:
-+ if (mptcp(tp)) {
-+ /* The socket will get closed by mptcp_data_ready.
-+ * We first have to process all data-sequences.
-+ */
-+ tp->close_it = 1;
-+ break;
-+ }
- /* Received a FIN -- send ACK and enter TIME_WAIT. */
- tcp_send_ack(sk);
-- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
-+ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
- break;
- default:
- /* Only TCP_LISTEN and TCP_CLOSE are left, in these
-@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
- if (!sock_flag(sk, SOCK_DEAD)) {
- sk->sk_state_change(sk);
-
-+ /* Don't wake up MPTCP-subflows */
-+ if (mptcp(tp))
-+ return;
-+
- /* Do not send POLL_HUP for half duplex close. */
- if (sk->sk_shutdown == SHUTDOWN_MASK ||
- sk->sk_state == TCP_CLOSE)
-@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
- tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
- }
-
-- if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
-+ /* In case of MPTCP, the segment may be empty if it's a
-+ * non-data DATA_FIN. (see beginning of tcp_data_queue)
-+ */
-+ if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
-+ !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
- SOCK_DEBUG(sk, "ofo packet was already received\n");
- __skb_unlink(skb, &tp->out_of_order_queue);
- __kfree_skb(skb);
-@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
- }
- }
-
--static bool tcp_prune_ofo_queue(struct sock *sk);
- static int tcp_prune_queue(struct sock *sk);
-
- static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- unsigned int size)
- {
-+ if (mptcp(tcp_sk(sk)))
-+ sk = mptcp_meta_sk(sk);
-+
- if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
- !sk_rmem_schedule(sk, skb, size)) {
-
-@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- return -1;
-
- if (!sk_rmem_schedule(sk, skb, size)) {
-- if (!tcp_prune_ofo_queue(sk))
-+ if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
- return -1;
-
- if (!sk_rmem_schedule(sk, skb, size))
-@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- * Better try to coalesce them right now to avoid future collapses.
- * Returns true if caller should free @from instead of queueing it
- */
--static bool tcp_try_coalesce(struct sock *sk,
-- struct sk_buff *to,
-- struct sk_buff *from,
-- bool *fragstolen)
-+bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
-+ bool *fragstolen)
- {
- int delta;
-
- *fragstolen = false;
-
-+ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
-+ return false;
-+
- if (tcp_hdr(from)->fin)
- return false;
-
-@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
-
- /* Do skb overlap to previous one? */
- if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-- if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ /* MPTCP allows non-data data-fin to be in the ofo-queue */
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
-+ !(mptcp(tp) && end_seq == seq)) {
- /* All the bits are present. Drop. */
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
- __kfree_skb(skb);
-@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
- end_seq);
- break;
- }
-+ /* MPTCP allows non-data data-fin to be in the ofo-queue */
-+ if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
-+ continue;
- __skb_unlink(skb1, &tp->out_of_order_queue);
- tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
- TCP_SKB_CB(skb1)->end_seq);
-@@ -4280,8 +4325,8 @@ end:
- }
- }
-
--static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-- bool *fragstolen)
-+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-+ bool *fragstolen)
- {
- int eaten;
- struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
-@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
- int eaten = -1;
- bool fragstolen = false;
-
-- if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-+ /* If no data is present, but a data_fin is in the options, we still
-+ * have to call mptcp_queue_skb later on. */
-+ if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
-+ !(mptcp(tp) && mptcp_is_data_fin(skb)))
- goto drop;
-
- skb_dst_drop(skb);
-@@ -4389,7 +4437,7 @@ queue_and_out:
- eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
- }
- tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-- if (skb->len)
-+ if (skb->len || mptcp_is_data_fin(skb))
- tcp_event_data_recv(sk, skb);
- if (th->fin)
- tcp_fin(sk);
-@@ -4411,7 +4459,11 @@ queue_and_out:
-
- if (eaten > 0)
- kfree_skb_partial(skb, fragstolen);
-- if (!sock_flag(sk, SOCK_DEAD))
-+ if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
-+ /* MPTCP: we always have to call data_ready, because
-+ * we may be about to receive a data-fin, which still
-+ * must get queued.
-+ */
- sk->sk_data_ready(sk);
- return;
- }
-@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
- next = skb_queue_next(list, skb);
-
- __skb_unlink(skb, list);
-+ if (mptcp(tcp_sk(sk)))
-+ mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
- __kfree_skb(skb);
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
-
-@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
- * Purge the out-of-order queue.
- * Return true if queue was pruned.
- */
--static bool tcp_prune_ofo_queue(struct sock *sk)
-+bool tcp_prune_ofo_queue(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- bool res = false;
-@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
- /* Collapsing did not help, destructive actions follow.
- * This must not ever occur. */
-
-- tcp_prune_ofo_queue(sk);
-+ tp->ops->prune_ofo_queue(sk);
-
- if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
- return 0;
-@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
- return -1;
- }
-
--static bool tcp_should_expand_sndbuf(const struct sock *sk)
-+/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
-+ * As additional protections, we do not touch cwnd in retransmission phases,
-+ * and if application hit its sndbuf limit recently.
-+ */
-+void tcp_cwnd_application_limited(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
-+ sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
-+ /* Limited by application or receiver window. */
-+ u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
-+ u32 win_used = max(tp->snd_cwnd_used, init_win);
-+ if (win_used < tp->snd_cwnd) {
-+ tp->snd_ssthresh = tcp_current_ssthresh(sk);
-+ tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
-+ }
-+ tp->snd_cwnd_used = 0;
-+ }
-+ tp->snd_cwnd_stamp = tcp_time_stamp;
-+}
-+
-+bool tcp_should_expand_sndbuf(const struct sock *sk)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
-
-@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-- if (tcp_should_expand_sndbuf(sk)) {
-+ if (tp->ops->should_expand_sndbuf(sk)) {
- tcp_sndbuf_expand(sk);
- tp->snd_cwnd_stamp = tcp_time_stamp;
- }
-@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
- {
- if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
- sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
-- if (sk->sk_socket &&
-- test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
-+ if (mptcp(tcp_sk(sk)) ||
-+ (sk->sk_socket &&
-+ test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
- tcp_new_space(sk);
- }
- }
-@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
- /* ... and right edge of window advances far enough.
- * (tcp_recvmsg() will send ACK otherwise). Or...
- */
-- __tcp_select_window(sk) >= tp->rcv_wnd) ||
-+ tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
- /* We ACK each frame or... */
- tcp_in_quickack_mode(sk) ||
- /* We have out of order data. */
-@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-+ /* MPTCP urgent data is not yet supported */
-+ if (mptcp(tp))
-+ return;
-+
- /* Check if we get a new urgent pointer - normally not. */
- if (th->urg)
- tcp_check_urg(sk, th);
-@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
- }
-
- #ifdef CONFIG_NET_DMA
--static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
-- int hlen)
-+bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- int chunk = skb->len - hlen;
-@@ -5052,9 +5132,15 @@ syn_challenge:
- goto discard;
- }
-
-+ /* If valid: post process the received MPTCP options. */
-+ if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
-+ goto discard;
-+
- return true;
-
- discard:
-+ if (mptcp(tp))
-+ mptcp_reset_mopt(tp);
- __kfree_skb(skb);
- return false;
- }
-@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-
- tp->rx_opt.saw_tstamp = 0;
-
-+ /* MPTCP: force slowpath. */
-+ if (mptcp(tp))
-+ goto slow_path;
-+
- /* pred_flags is 0xS?10 << 16 + snd_wnd
- * if header_prediction is to be made
- * 'S' will always be tp->tcp_header_len >> 2
-@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
- }
- if (copied_early)
-- tcp_cleanup_rbuf(sk, skb->len);
-+ tp->ops->cleanup_rbuf(sk, skb->len);
- }
- if (!eaten) {
- if (tcp_checksum_complete_user(sk, skb))
-@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
-
- tcp_init_metrics(sk);
-
-- tcp_init_congestion_control(sk);
-+ tp->ops->init_congestion_control(sk);
-
- /* Prevent spurious tcp_cwnd_restart() on first data
- * packet.
- */
- tp->lsndtime = tcp_time_stamp;
-
-- tcp_init_buffer_space(sk);
-+ tp->ops->init_buffer_space(sk);
-
- if (sock_flag(sk, SOCK_KEEPOPEN))
- inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
-@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
- /* Get original SYNACK MSS value if user MSS sets mss_clamp */
- tcp_clear_options(&opt);
- opt.user_mss = opt.mss_clamp = 0;
-- tcp_parse_options(synack, &opt, 0, NULL);
-+ tcp_parse_options(synack, &opt, NULL, 0, NULL);
- mss = opt.mss_clamp;
- }
-
-@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
-
- tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
-
-- if (data) { /* Retransmit unacked data in SYN */
-+ /* In mptcp case, we do not rely on "retransmit", but instead on
-+ * "transmit", because if fastopen data is not acked, the retransmission
-+ * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
-+ */
-+ if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
- tcp_for_write_queue_from(data, sk) {
- if (data == tcp_send_head(sk) ||
- __tcp_retransmit_skb(sk, data))
-@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- struct tcp_sock *tp = tcp_sk(sk);
- struct tcp_fastopen_cookie foc = { .len = -1 };
- int saved_clamp = tp->rx_opt.mss_clamp;
-+ struct mptcp_options_received mopt;
-+ mptcp_init_mp_opt(&mopt);
-
-- tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
-+ tcp_parse_options(skb, &tp->rx_opt,
-+ mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
- if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
- tp->rx_opt.rcv_tsecr -= tp->tsoffset;
-
-@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
- tcp_ack(sk, skb, FLAG_SLOWPATH);
-
-+ if (tp->request_mptcp || mptcp(tp)) {
-+ int ret;
-+ ret = mptcp_rcv_synsent_state_process(sk, &sk,
-+ skb, &mopt);
-+
-+ /* May have changed if we support MPTCP */
-+ tp = tcp_sk(sk);
-+ icsk = inet_csk(sk);
-+
-+ if (ret == 1)
-+ goto reset_and_undo;
-+ if (ret == 2)
-+ goto discard;
-+ }
-+
-+ if (mptcp(tp) && !is_master_tp(tp)) {
-+ /* Timer for repeating the ACK until an answer
-+ * arrives. Used only when establishing an additional
-+ * subflow inside of an MPTCP connection.
-+ */
-+ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+ jiffies + icsk->icsk_rto);
-+ }
-+
- /* Ok.. it's good. Set up sequence numbers and
- * move to established.
- */
-@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- tp->tcp_header_len = sizeof(struct tcphdr);
- }
-
-+ if (mptcp(tp)) {
-+ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
-+ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-+ }
-+
- if (tcp_is_sack(tp) && sysctl_tcp_fack)
- tcp_enable_fack(tp);
-
-@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- tcp_rcv_fastopen_synack(sk, skb, &foc))
- return -1;
-
-- if (sk->sk_write_pending ||
-+ /* With MPTCP we cannot send data on the third ack due to the
-+ * lack of option-space to combine with an MP_CAPABLE.
-+ */
-+ if (!mptcp(tp) && (sk->sk_write_pending ||
- icsk->icsk_accept_queue.rskq_defer_accept ||
-- icsk->icsk_ack.pingpong) {
-+ icsk->icsk_ack.pingpong)) {
- /* Save one ACK. Data will be ready after
- * several ticks, if write_pending is set.
- *
-@@ -5536,6 +5665,7 @@ discard:
- tcp_paws_reject(&tp->rx_opt, 0))
- goto discard_and_undo;
-
-+ /* TODO - check this here for MPTCP */
- if (th->syn) {
- /* We see SYN without ACK. It is attempt of
- * simultaneous connect with crossed SYNs.
-@@ -5552,6 +5682,11 @@ discard:
- tp->tcp_header_len = sizeof(struct tcphdr);
- }
-
-+ if (mptcp(tp)) {
-+ tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
-+ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-+ }
-+
- tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
- tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
-
-@@ -5610,6 +5745,7 @@ reset_and_undo:
-
- int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- const struct tcphdr *th, unsigned int len)
-+ __releases(&sk->sk_lock.slock)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- struct inet_connection_sock *icsk = inet_csk(sk);
-@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-
- case TCP_SYN_SENT:
- queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
-+ if (is_meta_sk(sk)) {
-+ sk = tcp_sk(sk)->mpcb->master_sk;
-+ tp = tcp_sk(sk);
-+
-+ /* Need to call it here, because it will announce new
-+ * addresses, which can only be done after the third ack
-+ * of the 3-way handshake.
-+ */
-+ mptcp_update_metasocket(sk, tp->meta_sk);
-+ }
- if (queued >= 0)
- return queued;
-
-@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- tcp_urg(sk, skb, th);
- __kfree_skb(skb);
- tcp_data_snd_check(sk);
-+ if (mptcp(tp) && is_master_tp(tp))
-+ bh_unlock_sock(sk);
- return 0;
- }
-
-@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- synack_stamp = tp->lsndtime;
- /* Make sure socket is routed, for correct metrics. */
- icsk->icsk_af_ops->rebuild_header(sk);
-- tcp_init_congestion_control(sk);
-+ tp->ops->init_congestion_control(sk);
-
- tcp_mtup_init(sk);
- tp->copied_seq = tp->rcv_nxt;
-- tcp_init_buffer_space(sk);
-+ tp->ops->init_buffer_space(sk);
- }
- smp_mb();
- tcp_set_state(sk, TCP_ESTABLISHED);
-@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-
- if (tp->rx_opt.tstamp_ok)
- tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
-+ if (mptcp(tp))
-+ tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-
- if (req) {
- /* Re-arm the timer because data may have been sent out.
-@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-
- tcp_initialize_rcv_mss(sk);
- tcp_fast_path_on(tp);
-+ /* Send an ACK when establishing a new
-+ * MPTCP subflow, i.e. using an MP_JOIN
-+ * subtype.
-+ */
-+ if (mptcp(tp) && !is_master_tp(tp))
-+ tcp_send_ack(sk);
- break;
-
- case TCP_FIN_WAIT1: {
-@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- tmo = tcp_fin_time(sk);
- if (tmo > TCP_TIMEWAIT_LEN) {
- inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
-- } else if (th->fin || sock_owned_by_user(sk)) {
-+ } else if (th->fin || mptcp_is_data_fin(skb) ||
-+ sock_owned_by_user(sk)) {
- /* Bad case. We could lose such FIN otherwise.
- * It is not a big problem, but it looks confusing
- * and not so rare event. We still can lose it now,
-@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- */
- inet_csk_reset_keepalive_timer(sk, tmo);
- } else {
-- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
- goto discard;
- }
- break;
-@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
-
- case TCP_CLOSING:
- if (tp->snd_una == tp->write_seq) {
-- tcp_time_wait(sk, TCP_TIME_WAIT, 0);
-+ tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
- goto discard;
- }
- break;
-@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- goto discard;
- }
- break;
-+ case TCP_CLOSE:
-+ if (tp->mp_killed)
-+ goto discard;
- }
-
- /* step 6: check the URG bit */
-@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- */
- if (sk->sk_shutdown & RCV_SHUTDOWN) {
- if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
-- after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
-+ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
-+ !mptcp(tp)) {
-+ /* In case of mptcp, the reset is handled by
-+ * mptcp_rcv_state_process
-+ */
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
- tcp_reset(sk);
- return 1;
-@@ -5877,3 +6041,154 @@ discard:
- return 0;
- }
- EXPORT_SYMBOL(tcp_rcv_state_process);
-+
-+static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
-+{
-+ struct inet_request_sock *ireq = inet_rsk(req);
-+
-+ if (family == AF_INET)
-+ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
-+ &ireq->ir_rmt_addr, port);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else if (family == AF_INET6)
-+ LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
-+ &ireq->ir_v6_rmt_addr, port);
-+#endif
-+}
-+
-+int tcp_conn_request(struct request_sock_ops *rsk_ops,
-+ const struct tcp_request_sock_ops *af_ops,
-+ struct sock *sk, struct sk_buff *skb)
-+{
-+ struct tcp_options_received tmp_opt;
-+ struct request_sock *req;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct dst_entry *dst = NULL;
-+ __u32 isn = TCP_SKB_CB(skb)->when;
-+ bool want_cookie = false, fastopen;
-+ struct flowi fl;
-+ struct tcp_fastopen_cookie foc = { .len = -1 };
-+ int err;
-+
-+
-+ /* TW buckets are converted to open requests without
-+ * limitations, they conserve resources and peer is
-+ * evidently real one.
-+ */
-+ if ((sysctl_tcp_syncookies == 2 ||
-+ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-+ want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
-+ if (!want_cookie)
-+ goto drop;
-+ }
-+
-+
-+ /* Accept backlog is full. If we have already queued enough
-+ * of warm entries in syn queue, drop request. It is better than
-+ * clogging syn queue with openreqs with exponentially increasing
-+ * timeout.
-+ */
-+ if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-+ goto drop;
-+ }
-+
-+ req = inet_reqsk_alloc(rsk_ops);
-+ if (!req)
-+ goto drop;
-+
-+ tcp_rsk(req)->af_specific = af_ops;
-+
-+ tcp_clear_options(&tmp_opt);
-+ tmp_opt.mss_clamp = af_ops->mss_clamp;
-+ tmp_opt.user_mss = tp->rx_opt.user_mss;
-+ tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
-+
-+ if (want_cookie && !tmp_opt.saw_tstamp)
-+ tcp_clear_options(&tmp_opt);
-+
-+ tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-+ tcp_openreq_init(req, &tmp_opt, skb);
-+
-+ if (af_ops->init_req(req, sk, skb))
-+ goto drop_and_free;
-+
-+ if (security_inet_conn_request(sk, skb, req))
-+ goto drop_and_free;
-+
-+ if (!want_cookie || tmp_opt.tstamp_ok)
-+ TCP_ECN_create_request(req, skb, sock_net(sk));
-+
-+ if (want_cookie) {
-+ isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
-+ req->cookie_ts = tmp_opt.tstamp_ok;
-+ } else if (!isn) {
-+ /* VJ's idea. We save last timestamp seen
-+ * from the destination in peer table, when entering
-+ * state TIME-WAIT, and check against it before
-+ * accepting new connection request.
-+ *
-+ * If "isn" is not zero, this request hit alive
-+ * timewait bucket, so that all the necessary checks
-+ * are made in the function processing timewait state.
-+ */
-+ if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
-+ bool strict;
-+
-+ dst = af_ops->route_req(sk, &fl, req, &strict);
-+ if (dst && strict &&
-+ !tcp_peer_is_proven(req, dst, true)) {
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
-+ goto drop_and_release;
-+ }
-+ }
-+ /* Kill the following clause, if you dislike this way. */
-+ else if (!sysctl_tcp_syncookies &&
-+ (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
-+ (sysctl_max_syn_backlog >> 2)) &&
-+ !tcp_peer_is_proven(req, dst, false)) {
-+ /* Without syncookies last quarter of
-+ * backlog is filled with destinations,
-+ * proven to be alive.
-+ * It means that we continue to communicate
-+ * to destinations, already remembered
-+ * to the moment of synflood.
-+ */
-+ pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
-+ rsk_ops->family);
-+ goto drop_and_release;
-+ }
-+
-+ isn = af_ops->init_seq(skb);
-+ }
-+ if (!dst) {
-+ dst = af_ops->route_req(sk, &fl, req, NULL);
-+ if (!dst)
-+ goto drop_and_free;
-+ }
-+
-+ tcp_rsk(req)->snt_isn = isn;
-+ tcp_openreq_init_rwin(req, sk, dst);
-+ fastopen = !want_cookie &&
-+ tcp_try_fastopen(sk, skb, req, &foc, dst);
-+ err = af_ops->send_synack(sk, dst, &fl, req,
-+ skb_get_queue_mapping(skb), &foc);
-+ if (!fastopen) {
-+ if (err || want_cookie)
-+ goto drop_and_free;
-+
-+ tcp_rsk(req)->listener = NULL;
-+ af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-+ }
-+
-+ return 0;
-+
-+drop_and_release:
-+ dst_release(dst);
-+drop_and_free:
-+ reqsk_free(req);
-+drop:
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-+ return 0;
-+}
-+EXPORT_SYMBOL(tcp_conn_request);
-diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
-index 77cccda1ad0c..c77017f600f1 100644
---- a/net/ipv4/tcp_ipv4.c
-+++ b/net/ipv4/tcp_ipv4.c
-@@ -67,6 +67,8 @@
- #include <net/icmp.h>
- #include <net/inet_hashtables.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
- #include <net/transp_v6.h>
- #include <net/ipv6.h>
- #include <net/inet_common.h>
-@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
- struct inet_hashinfo tcp_hashinfo;
- EXPORT_SYMBOL(tcp_hashinfo);
-
--static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
-+__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
- {
- return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
- ip_hdr(skb)->saddr,
-@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- struct inet_sock *inet;
- const int type = icmp_hdr(icmp_skb)->type;
- const int code = icmp_hdr(icmp_skb)->code;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk;
- struct sk_buff *skb;
- struct request_sock *fastopen;
- __u32 seq, snd_una;
-@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- return;
- }
-
-- bh_lock_sock(sk);
-+ tp = tcp_sk(sk);
-+ if (mptcp(tp))
-+ meta_sk = mptcp_meta_sk(sk);
-+ else
-+ meta_sk = sk;
-+
-+ bh_lock_sock(meta_sk);
- /* If too many ICMPs get dropped on busy
- * servers this needs to be solved differently.
- * We do take care of PMTU discovery (RFC1191) special case :
- * we can receive locally generated ICMP messages while socket is held.
- */
-- if (sock_owned_by_user(sk)) {
-+ if (sock_owned_by_user(meta_sk)) {
- if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
- NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
- }
-@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- }
-
- icsk = inet_csk(sk);
-- tp = tcp_sk(sk);
- seq = ntohl(th->seq);
- /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
- fastopen = tp->fastopen_rsk;
-@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- goto out;
-
- tp->mtu_info = info;
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- tcp_v4_mtu_reduced(sk);
- } else {
- if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
- sock_hold(sk);
-+ if (mptcp(tp))
-+ mptcp_tsq_flags(sk);
- }
- goto out;
- }
-@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- !icsk->icsk_backoff || fastopen)
- break;
-
-- if (sock_owned_by_user(sk))
-+ if (sock_owned_by_user(meta_sk))
- break;
-
- icsk->icsk_backoff--;
-@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- switch (sk->sk_state) {
- struct request_sock *req, **prev;
- case TCP_LISTEN:
-- if (sock_owned_by_user(sk))
-+ if (sock_owned_by_user(meta_sk))
- goto out;
-
- req = inet_csk_search_req(sk, &prev, th->dest,
-@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- if (fastopen && fastopen->sk == NULL)
- break;
-
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- sk->sk_err = err;
-
- sk->sk_error_report(sk);
-@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- */
-
- inet = inet_sk(sk);
-- if (!sock_owned_by_user(sk) && inet->recverr) {
-+ if (!sock_owned_by_user(meta_sk) && inet->recverr) {
- sk->sk_err = err;
- sk->sk_error_report(sk);
- } else { /* Only an error on timeout */
-@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- }
-
- out:
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
- * Exception: precedence violation. We do not implement it in any case.
- */
-
--static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
-+void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- struct {
-@@ -702,10 +711,10 @@ release_sk1:
- outside socket context is ugly, certainly. What can I do?
- */
-
--static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
-+static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
- u32 win, u32 tsval, u32 tsecr, int oif,
- struct tcp_md5sig_key *key,
-- int reply_flags, u8 tos)
-+ int reply_flags, u8 tos, int mptcp)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- struct {
-@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
- #ifdef CONFIG_TCP_MD5SIG
- + (TCPOLEN_MD5SIG_ALIGNED >> 2)
- #endif
-+#ifdef CONFIG_MPTCP
-+ + ((MPTCP_SUB_LEN_DSS >> 2) +
-+ (MPTCP_SUB_LEN_ACK >> 2))
-+#endif
- ];
- } rep;
- struct ip_reply_arg arg;
-@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
- ip_hdr(skb)->daddr, &rep.th);
- }
- #endif
-+#ifdef CONFIG_MPTCP
-+ if (mptcp) {
-+ int offset = (tsecr) ? 3 : 0;
-+ /* Construction of 32-bit data_ack */
-+ rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
-+ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
-+ (0x20 << 8) |
-+ (0x01));
-+ rep.opt[offset] = htonl(data_ack);
-+
-+ arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
-+ rep.th.doff = arg.iov[0].iov_len / 4;
-+ }
-+#endif /* CONFIG_MPTCP */
-+
- arg.flags = reply_flags;
- arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
- ip_hdr(skb)->saddr, /* XXX */
-@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
- {
- struct inet_timewait_sock *tw = inet_twsk(sk);
- struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
-+ u32 data_ack = 0;
-+ int mptcp = 0;
-+
-+ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
-+ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
-+ mptcp = 1;
-+ }
-
- tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
-+ data_ack,
- tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
- tcp_time_stamp + tcptw->tw_ts_offset,
- tcptw->tw_ts_recent,
- tw->tw_bound_dev_if,
- tcp_twsk_md5_key(tcptw),
- tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
-- tw->tw_tos
-+ tw->tw_tos, mptcp
- );
-
- inet_twsk_put(tw);
- }
-
--static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-- struct request_sock *req)
-+void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req)
- {
- /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
- * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
- */
- tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
- tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
-- tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
-+ tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
- tcp_time_stamp,
- req->ts_recent,
- 0,
- tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
- AF_INET),
- inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
-- ip_hdr(skb)->tos);
-+ ip_hdr(skb)->tos, 0);
- }
-
- /*
-@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
- * This still operates on a request_sock only, not on a big
- * socket.
- */
--static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-- struct request_sock *req,
-- u16 queue_mapping,
-- struct tcp_fastopen_cookie *foc)
-+int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl,
-+ struct request_sock *req,
-+ u16 queue_mapping,
-+ struct tcp_fastopen_cookie *foc)
- {
- const struct inet_request_sock *ireq = inet_rsk(req);
- struct flowi4 fl4;
-@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
- return err;
- }
-
--static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
--{
-- int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
--
-- if (!res) {
-- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-- }
-- return res;
--}
--
- /*
- * IPv4 request_sock destructor.
- */
--static void tcp_v4_reqsk_destructor(struct request_sock *req)
-+void tcp_v4_reqsk_destructor(struct request_sock *req)
- {
- kfree(inet_rsk(req)->opt);
- }
-@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
- /*
- * Save and compile IPv4 options into the request_sock if needed.
- */
--static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
-+struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
- {
- const struct ip_options *opt = &(IPCB(skb)->opt);
- struct ip_options_rcu *dopt = NULL;
-@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
-
- #endif
-
-+static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ struct inet_request_sock *ireq = inet_rsk(req);
-+
-+ ireq->ir_loc_addr = ip_hdr(skb)->daddr;
-+ ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
-+ ireq->no_srccheck = inet_sk(sk)->transparent;
-+ ireq->opt = tcp_v4_save_options(skb);
-+ ireq->ir_mark = inet_request_mark(sk, skb);
-+
-+ return 0;
-+}
-+
-+static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
-+ const struct request_sock *req,
-+ bool *strict)
-+{
-+ struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
-+
-+ if (strict) {
-+ if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
-+ *strict = true;
-+ else
-+ *strict = false;
-+ }
-+
-+ return dst;
-+}
-+
- struct request_sock_ops tcp_request_sock_ops __read_mostly = {
- .family = PF_INET,
- .obj_size = sizeof(struct tcp_request_sock),
-- .rtx_syn_ack = tcp_v4_rtx_synack,
-+ .rtx_syn_ack = tcp_rtx_synack,
- .send_ack = tcp_v4_reqsk_send_ack,
- .destructor = tcp_v4_reqsk_destructor,
- .send_reset = tcp_v4_send_reset,
- .syn_ack_timeout = tcp_syn_ack_timeout,
- };
-
-+const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
-+ .mss_clamp = TCP_MSS_DEFAULT,
- #ifdef CONFIG_TCP_MD5SIG
--static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
- .md5_lookup = tcp_v4_reqsk_md5_lookup,
- .calc_md5_hash = tcp_v4_md5_hash_skb,
--};
- #endif
-+ .init_req = tcp_v4_init_req,
-+#ifdef CONFIG_SYN_COOKIES
-+ .cookie_init_seq = cookie_v4_init_sequence,
-+#endif
-+ .route_req = tcp_v4_route_req,
-+ .init_seq = tcp_v4_init_sequence,
-+ .send_synack = tcp_v4_send_synack,
-+ .queue_hash_add = inet_csk_reqsk_queue_hash_add,
-+};
-
- int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
- {
-- struct tcp_options_received tmp_opt;
-- struct request_sock *req;
-- struct inet_request_sock *ireq;
-- struct tcp_sock *tp = tcp_sk(sk);
-- struct dst_entry *dst = NULL;
-- __be32 saddr = ip_hdr(skb)->saddr;
-- __be32 daddr = ip_hdr(skb)->daddr;
-- __u32 isn = TCP_SKB_CB(skb)->when;
-- bool want_cookie = false, fastopen;
-- struct flowi4 fl4;
-- struct tcp_fastopen_cookie foc = { .len = -1 };
-- int err;
--
- /* Never answer to SYNs send to broadcast or multicast */
- if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
- goto drop;
-
-- /* TW buckets are converted to open requests without
-- * limitations, they conserve resources and peer is
-- * evidently real one.
-- */
-- if ((sysctl_tcp_syncookies == 2 ||
-- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-- want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
-- if (!want_cookie)
-- goto drop;
-- }
--
-- /* Accept backlog is full. If we have already queued enough
-- * of warm entries in syn queue, drop request. It is better than
-- * clogging syn queue with openreqs with exponentially increasing
-- * timeout.
-- */
-- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-- goto drop;
-- }
--
-- req = inet_reqsk_alloc(&tcp_request_sock_ops);
-- if (!req)
-- goto drop;
--
--#ifdef CONFIG_TCP_MD5SIG
-- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
--#endif
--
-- tcp_clear_options(&tmp_opt);
-- tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
-- tmp_opt.user_mss = tp->rx_opt.user_mss;
-- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
--
-- if (want_cookie && !tmp_opt.saw_tstamp)
-- tcp_clear_options(&tmp_opt);
--
-- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-- tcp_openreq_init(req, &tmp_opt, skb);
-+ return tcp_conn_request(&tcp_request_sock_ops,
-+ &tcp_request_sock_ipv4_ops, sk, skb);
-
-- ireq = inet_rsk(req);
-- ireq->ir_loc_addr = daddr;
-- ireq->ir_rmt_addr = saddr;
-- ireq->no_srccheck = inet_sk(sk)->transparent;
-- ireq->opt = tcp_v4_save_options(skb);
-- ireq->ir_mark = inet_request_mark(sk, skb);
--
-- if (security_inet_conn_request(sk, skb, req))
-- goto drop_and_free;
--
-- if (!want_cookie || tmp_opt.tstamp_ok)
-- TCP_ECN_create_request(req, skb, sock_net(sk));
--
-- if (want_cookie) {
-- isn = cookie_v4_init_sequence(sk, skb, &req->mss);
-- req->cookie_ts = tmp_opt.tstamp_ok;
-- } else if (!isn) {
-- /* VJ's idea. We save last timestamp seen
-- * from the destination in peer table, when entering
-- * state TIME-WAIT, and check against it before
-- * accepting new connection request.
-- *
-- * If "isn" is not zero, this request hit alive
-- * timewait bucket, so that all the necessary checks
-- * are made in the function processing timewait state.
-- */
-- if (tmp_opt.saw_tstamp &&
-- tcp_death_row.sysctl_tw_recycle &&
-- (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
-- fl4.daddr == saddr) {
-- if (!tcp_peer_is_proven(req, dst, true)) {
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
-- goto drop_and_release;
-- }
-- }
-- /* Kill the following clause, if you dislike this way. */
-- else if (!sysctl_tcp_syncookies &&
-- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
-- (sysctl_max_syn_backlog >> 2)) &&
-- !tcp_peer_is_proven(req, dst, false)) {
-- /* Without syncookies last quarter of
-- * backlog is filled with destinations,
-- * proven to be alive.
-- * It means that we continue to communicate
-- * to destinations, already remembered
-- * to the moment of synflood.
-- */
-- LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
-- &saddr, ntohs(tcp_hdr(skb)->source));
-- goto drop_and_release;
-- }
--
-- isn = tcp_v4_init_sequence(skb);
-- }
-- if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
-- goto drop_and_free;
--
-- tcp_rsk(req)->snt_isn = isn;
-- tcp_rsk(req)->snt_synack = tcp_time_stamp;
-- tcp_openreq_init_rwin(req, sk, dst);
-- fastopen = !want_cookie &&
-- tcp_try_fastopen(sk, skb, req, &foc, dst);
-- err = tcp_v4_send_synack(sk, dst, req,
-- skb_get_queue_mapping(skb), &foc);
-- if (!fastopen) {
-- if (err || want_cookie)
-- goto drop_and_free;
--
-- tcp_rsk(req)->snt_synack = tcp_time_stamp;
-- tcp_rsk(req)->listener = NULL;
-- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-- }
--
-- return 0;
--
--drop_and_release:
-- dst_release(dst);
--drop_and_free:
-- reqsk_free(req);
- drop:
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
- return 0;
-@@ -1497,7 +1433,7 @@ put_and_exit:
- }
- EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
-
--static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
-+struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- struct tcphdr *th = tcp_hdr(skb);
- const struct iphdr *iph = ip_hdr(skb);
-@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
-
- if (nsk) {
- if (nsk->sk_state != TCP_TIME_WAIT) {
-+ /* Don't lock again the meta-sk. It has been locked
-+ * before mptcp_v4_do_rcv.
-+ */
-+ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
-+ bh_lock_sock(mptcp_meta_sk(nsk));
- bh_lock_sock(nsk);
-+
- return nsk;
-+
- }
- inet_twsk_put(inet_twsk(nsk));
- return NULL;
-@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
- goto discard;
- #endif
-
-+ if (is_meta_sk(sk))
-+ return mptcp_v4_do_rcv(sk, skb);
-+
- if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
- struct dst_entry *dst = sk->sk_rx_dst;
-
-@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
- } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
- wake_up_interruptible_sync_poll(sk_sleep(sk),
- POLLIN | POLLRDNORM | POLLRDBAND);
-- if (!inet_csk_ack_scheduled(sk))
-+ if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
- (3 * tcp_rto_min(sk)) / 4,
- TCP_RTO_MAX);
-@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
- {
- const struct iphdr *iph;
- const struct tcphdr *th;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk = NULL;
- int ret;
- struct net *net = dev_net(skb->dev);
-
-@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
- TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
- skb->len - th->doff * 4);
- TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-+#ifdef CONFIG_MPTCP
-+ TCP_SKB_CB(skb)->mptcp_flags = 0;
-+ TCP_SKB_CB(skb)->dss_off = 0;
-+#endif
- TCP_SKB_CB(skb)->when = 0;
- TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
- TCP_SKB_CB(skb)->sacked = 0;
-
- sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
-- if (!sk)
-- goto no_tcp_socket;
-
- process:
-- if (sk->sk_state == TCP_TIME_WAIT)
-+ if (sk && sk->sk_state == TCP_TIME_WAIT)
- goto do_time_wait;
-
-+#ifdef CONFIG_MPTCP
-+ if (!sk && th->syn && !th->ack) {
-+ int ret = mptcp_lookup_join(skb, NULL);
-+
-+ if (ret < 0) {
-+ tcp_v4_send_reset(NULL, skb);
-+ goto discard_it;
-+ } else if (ret > 0) {
-+ return 0;
-+ }
-+ }
-+
-+ /* Is there a pending request sock for this segment ? */
-+ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
-+ if (sk)
-+ sock_put(sk);
-+ return 0;
-+ }
-+#endif
-+ if (!sk)
-+ goto no_tcp_socket;
-+
- if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
- NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- goto discard_and_relse;
-@@ -1759,11 +1729,21 @@ process:
- sk_mark_napi_id(sk, skb);
- skb->dev = NULL;
-
-- bh_lock_sock_nested(sk);
-+ if (mptcp(tcp_sk(sk))) {
-+ meta_sk = mptcp_meta_sk(sk);
-+
-+ bh_lock_sock_nested(meta_sk);
-+ if (sock_owned_by_user(meta_sk))
-+ skb->sk = sk;
-+ } else {
-+ meta_sk = sk;
-+ bh_lock_sock_nested(sk);
-+ }
-+
- ret = 0;
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- #ifdef CONFIG_NET_DMA
-- struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_sock *tp = tcp_sk(meta_sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
- if (tp->ucopy.dma_chan)
-@@ -1771,16 +1751,16 @@ process:
- else
- #endif
- {
-- if (!tcp_prequeue(sk, skb))
-+ if (!tcp_prequeue(meta_sk, skb))
- ret = tcp_v4_do_rcv(sk, skb);
- }
-- } else if (unlikely(sk_add_backlog(sk, skb,
-- sk->sk_rcvbuf + sk->sk_sndbuf))) {
-- bh_unlock_sock(sk);
-+ } else if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+ bh_unlock_sock(meta_sk);
- NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
- goto discard_and_relse;
- }
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
-
- sock_put(sk);
-
-@@ -1835,6 +1815,18 @@ do_time_wait:
- sk = sk2;
- goto process;
- }
-+#ifdef CONFIG_MPTCP
-+ if (th->syn && !th->ack) {
-+ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
-+
-+ if (ret < 0) {
-+ tcp_v4_send_reset(NULL, skb);
-+ goto discard_it;
-+ } else if (ret > 0) {
-+ return 0;
-+ }
-+ }
-+#endif
- /* Fall through to ACK */
- }
- case TCP_TW_ACK:
-@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
-
- tcp_init_sock(sk);
-
-- icsk->icsk_af_ops = &ipv4_specific;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v4_specific;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv4_specific;
-
- #ifdef CONFIG_TCP_MD5SIG
- tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
-@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
-
- tcp_cleanup_congestion_control(sk);
-
-+ if (mptcp(tp))
-+ mptcp_destroy_sock(sk);
-+ if (tp->inside_tk_table)
-+ mptcp_hash_remove(tp);
-+
- /* Cleanup up the write buffer. */
- tcp_write_queue_purge(sk);
-
-@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
- }
- #endif /* CONFIG_PROC_FS */
-
-+#ifdef CONFIG_MPTCP
-+static void tcp_v4_clear_sk(struct sock *sk, int size)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ /* we do not want to clear tk_table field, because of RCU lookups */
-+ sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
-+
-+ size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
-+ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
-+}
-+#endif
-+
- struct proto tcp_prot = {
- .name = "TCP",
- .owner = THIS_MODULE,
-@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
- .destroy_cgroup = tcp_destroy_cgroup,
- .proto_cgroup = tcp_proto_cgroup,
- #endif
-+#ifdef CONFIG_MPTCP
-+ .clear_sk = tcp_v4_clear_sk,
-+#endif
- };
- EXPORT_SYMBOL(tcp_prot);
-
-diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
-index e68e0d4af6c9..ae6946857dff 100644
---- a/net/ipv4/tcp_minisocks.c
-+++ b/net/ipv4/tcp_minisocks.c
-@@ -18,11 +18,13 @@
- * Jorge Cwik, <jorge@laser.satlink.net>
- */
-
-+#include <linux/kconfig.h>
- #include <linux/mm.h>
- #include <linux/module.h>
- #include <linux/slab.h>
- #include <linux/sysctl.h>
- #include <linux/workqueue.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- #include <net/inet_common.h>
- #include <net/xfrm.h>
-@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- struct tcp_options_received tmp_opt;
- struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
- bool paws_reject = false;
-+ struct mptcp_options_received mopt;
-
- tmp_opt.saw_tstamp = 0;
- if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-- tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+ mptcp_init_mp_opt(&mopt);
-+
-+ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
-
- if (tmp_opt.saw_tstamp) {
- tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
-@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- tmp_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
- paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
- }
-+
-+ if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
-+ if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
-+ goto kill_with_rst;
-+ }
- }
-
- if (tw->tw_substate == TCP_FIN_WAIT2) {
-@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- if (!th->ack ||
- !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
- TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
-+ /* If mptcp_is_data_fin() returns true, we are sure that
-+ * mopt has been initialized - otherwise it would not
-+ * be a DATA_FIN.
-+ */
-+ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
-+ mptcp_is_data_fin(skb) &&
-+ TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
-+ mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
-+ return TCP_TW_ACK;
-+
- inet_twsk_put(tw);
- return TCP_TW_SUCCESS;
- }
-@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
- tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
- tcptw->tw_ts_offset = tp->tsoffset;
-
-+ if (mptcp(tp)) {
-+ if (mptcp_init_tw_sock(sk, tcptw)) {
-+ inet_twsk_free(tw);
-+ goto exit;
-+ }
-+ } else {
-+ tcptw->mptcp_tw = NULL;
-+ }
-+
- #if IS_ENABLED(CONFIG_IPV6)
- if (tw->tw_family == PF_INET6) {
- struct ipv6_pinfo *np = inet6_sk(sk);
-@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
- }
-
-+exit:
- tcp_update_metrics(sk);
- tcp_done(sk);
- }
-
- void tcp_twsk_destructor(struct sock *sk)
- {
--#ifdef CONFIG_TCP_MD5SIG
- struct tcp_timewait_sock *twsk = tcp_twsk(sk);
-
-+ if (twsk->mptcp_tw)
-+ mptcp_twsk_destructor(twsk);
-+#ifdef CONFIG_TCP_MD5SIG
- if (twsk->tw_md5_key)
- kfree_rcu(twsk->tw_md5_key, rcu);
- #endif
-@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
- req->window_clamp = tcp_full_space(sk);
-
- /* tcp_full_space because it is guaranteed to be the first packet */
-- tcp_select_initial_window(tcp_full_space(sk),
-- mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
-+ tp->ops->select_initial_window(tcp_full_space(sk),
-+ mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
-+ (ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
- &req->rcv_wnd,
- &req->window_clamp,
- ireq->wscale_ok,
- &rcv_wscale,
-- dst_metric(dst, RTAX_INITRWND));
-+ dst_metric(dst, RTAX_INITRWND), sk);
- ireq->rcv_wscale = rcv_wscale;
- }
- EXPORT_SYMBOL(tcp_openreq_init_rwin);
-@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
- newtp->rx_opt.ts_recent_stamp = 0;
- newtp->tcp_header_len = sizeof(struct tcphdr);
- }
-+ if (ireq->saw_mpc)
-+ newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
- newtp->tsoffset = 0;
- #ifdef CONFIG_TCP_MD5SIG
- newtp->md5sig_info = NULL; /*XXX*/
-@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- bool fastopen)
- {
- struct tcp_options_received tmp_opt;
-+ struct mptcp_options_received mopt;
- struct sock *child;
- const struct tcphdr *th = tcp_hdr(skb);
- __be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
- bool paws_reject = false;
-
-- BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
-+ BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
-
- tmp_opt.saw_tstamp = 0;
-+
-+ mptcp_init_mp_opt(&mopt);
-+
- if (th->doff > (sizeof(struct tcphdr)>>2)) {
-- tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+ tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
-
- if (tmp_opt.saw_tstamp) {
- tmp_opt.ts_recent = req->ts_recent;
-@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- *
- * Reset timer after retransmitting SYNACK, similar to
- * the idea of fast retransmit in recovery.
-+ *
-+ * Fall back to TCP if MP_CAPABLE is not set.
- */
-+
-+ if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
-+ inet_rsk(req)->saw_mpc = false;
-+
-+
- if (!inet_rtx_syn_ack(sk, req))
- req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
- TCP_RTO_MAX) + jiffies;
-@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- * socket is created, wait for troubles.
- */
- child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
-+
- if (child == NULL)
- goto listen_overflow;
-
-+ if (!is_meta_sk(sk)) {
-+ int ret = mptcp_check_req_master(sk, child, req, prev);
-+ if (ret < 0)
-+ goto listen_overflow;
-+
-+ /* MPTCP-supported */
-+ if (!ret)
-+ return tcp_sk(child)->mpcb->master_sk;
-+ } else {
-+ return mptcp_check_req_child(sk, child, req, prev, &mopt);
-+ }
- inet_csk_reqsk_queue_unlink(sk, req, prev);
- inet_csk_reqsk_queue_removed(sk, req);
-
-@@ -746,7 +804,17 @@ embryonic_reset:
- tcp_reset(sk);
- }
- if (!fastopen) {
-- inet_csk_reqsk_queue_drop(sk, req, prev);
-+ if (is_meta_sk(sk)) {
-+ /* We want to avoid stoping the keepalive-timer and so
-+ * avoid ending up in inet_csk_reqsk_queue_removed ...
-+ */
-+ inet_csk_reqsk_queue_unlink(sk, req, prev);
-+ if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
-+ mptcp_delete_synack_timer(sk);
-+ reqsk_free(req);
-+ } else {
-+ inet_csk_reqsk_queue_drop(sk, req, prev);
-+ }
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
- }
- return NULL;
-@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
- {
- int ret = 0;
- int state = child->sk_state;
-+ struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
-
-- if (!sock_owned_by_user(child)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
- skb->len);
- /* Wakeup parent, send SIGIO */
-@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
- * in main socket hash table and lock on listening
- * socket does not protect us more.
- */
-- __sk_add_backlog(child, skb);
-+ if (mptcp(tcp_sk(child)))
-+ skb->sk = child;
-+ __sk_add_backlog(meta_sk, skb);
- }
-
-- bh_unlock_sock(child);
-+ if (mptcp(tcp_sk(child)))
-+ bh_unlock_sock(child);
-+ bh_unlock_sock(meta_sk);
- sock_put(child);
- return ret;
- }
-diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
-index 179b51e6bda3..efd31b6c5784 100644
---- a/net/ipv4/tcp_output.c
-+++ b/net/ipv4/tcp_output.c
-@@ -36,6 +36,12 @@
-
- #define pr_fmt(fmt) "TCP: " fmt
-
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#endif
-+#include <net/ipv6.h>
- #include <net/tcp.h>
-
- #include <linux/compiler.h>
-@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
- unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
- EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
-
--static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-- int push_one, gfp_t gfp);
--
- /* Account for new data that has been sent to the network. */
--static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
-+void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
- {
- struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
-@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
- void tcp_select_initial_window(int __space, __u32 mss,
- __u32 *rcv_wnd, __u32 *window_clamp,
- int wscale_ok, __u8 *rcv_wscale,
-- __u32 init_rcv_wnd)
-+ __u32 init_rcv_wnd, const struct sock *sk)
- {
- unsigned int space = (__space < 0 ? 0 : __space);
-
-@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
- * value can be stuffed directly into th->window for an outgoing
- * frame.
- */
--static u16 tcp_select_window(struct sock *sk)
-+u16 tcp_select_window(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- u32 old_win = tp->rcv_wnd;
-- u32 cur_win = tcp_receive_window(tp);
-- u32 new_win = __tcp_select_window(sk);
-+ /* The window must never shrink at the meta-level. At the subflow we
-+ * have to allow this. Otherwise we may announce a window too large
-+ * for the current meta-level sk_rcvbuf.
-+ */
-+ u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
-+ u32 new_win = tp->ops->__select_window(sk);
-
- /* Never shrink the offered window */
- if (new_win < cur_win) {
-@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
- LINUX_MIB_TCPWANTZEROWINDOWADV);
- new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
- }
-+
- tp->rcv_wnd = new_win;
- tp->rcv_wup = tp->rcv_nxt;
-
-@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
- /* Constructs common control bits of non-data skb. If SYN/FIN is present,
- * auto increment end seqno.
- */
--static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
-+void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
- {
- struct skb_shared_info *shinfo = skb_shinfo(skb);
-
-@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
- TCP_SKB_CB(skb)->end_seq = seq;
- }
-
--static inline bool tcp_urg_mode(const struct tcp_sock *tp)
-+bool tcp_urg_mode(const struct tcp_sock *tp)
- {
- return tp->snd_una != tp->snd_up;
- }
-@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
- #define OPTION_MD5 (1 << 2)
- #define OPTION_WSCALE (1 << 3)
- #define OPTION_FAST_OPEN_COOKIE (1 << 8)
--
--struct tcp_out_options {
-- u16 options; /* bit field of OPTION_* */
-- u16 mss; /* 0 to disable */
-- u8 ws; /* window scale, 0 to disable */
-- u8 num_sack_blocks; /* number of SACK blocks to include */
-- u8 hash_size; /* bytes in hash_location */
-- __u8 *hash_location; /* temporary pointer, overloaded */
-- __u32 tsval, tsecr; /* need to include OPTION_TS */
-- struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
--};
-+/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
-
- /* Write previously computed TCP options to the packet.
- *
-@@ -430,7 +428,7 @@ struct tcp_out_options {
- * (but it may well be that other scenarios fail similarly).
- */
- static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-- struct tcp_out_options *opts)
-+ struct tcp_out_options *opts, struct sk_buff *skb)
- {
- u16 options = opts->options; /* mungable copy */
-
-@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
- }
- ptr += (foc->len + 3) >> 2;
- }
-+
-+ if (unlikely(OPTION_MPTCP & opts->options))
-+ mptcp_options_write(ptr, tp, opts, skb);
- }
-
- /* Compute TCP options for SYN packets. This is not the final
-@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
- if (unlikely(!(OPTION_TS & opts->options)))
- remaining -= TCPOLEN_SACKPERM_ALIGNED;
- }
-+ if (tp->request_mptcp || mptcp(tp))
-+ mptcp_syn_options(sk, opts, &remaining);
-
- if (fastopen && fastopen->cookie.len >= 0) {
- u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
-@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
- }
- }
-
-+ if (ireq->saw_mpc)
-+ mptcp_synack_options(req, opts, &remaining);
-+
- return MAX_TCP_OPTION_SPACE - remaining;
- }
-
-@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
- opts->tsecr = tp->rx_opt.ts_recent;
- size += TCPOLEN_TSTAMP_ALIGNED;
- }
-+ if (mptcp(tp))
-+ mptcp_established_options(sk, skb, opts, &size);
-
- eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
- if (unlikely(eff_sacks)) {
-- const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
-- opts->num_sack_blocks =
-- min_t(unsigned int, eff_sacks,
-- (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
-- TCPOLEN_SACK_PERBLOCK);
-- size += TCPOLEN_SACK_BASE_ALIGNED +
-- opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
-+ const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
-+ if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
-+ opts->num_sack_blocks = 0;
-+ else
-+ opts->num_sack_blocks =
-+ min_t(unsigned int, eff_sacks,
-+ (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
-+ TCPOLEN_SACK_PERBLOCK);
-+ if (opts->num_sack_blocks)
-+ size += TCPOLEN_SACK_BASE_ALIGNED +
-+ opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
- }
-
- return size;
-@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
- if ((1 << sk->sk_state) &
- (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
- TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
-- tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
-- 0, GFP_ATOMIC);
-+ tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
-+ tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
- }
- /*
- * One tasklet per cpu tries to send more skbs.
-@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
- unsigned long flags;
- struct list_head *q, *n;
- struct tcp_sock *tp;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk;
-
- local_irq_save(flags);
- list_splice_init(&tsq->head, &list);
-@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
- list_del(&tp->tsq_node);
-
- sk = (struct sock *)tp;
-- bh_lock_sock(sk);
-+ meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-+ bh_lock_sock(meta_sk);
-
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- tcp_tsq_handler(sk);
-+ if (mptcp(tp))
-+ tcp_tsq_handler(meta_sk);
- } else {
-+ if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
-+ goto exit;
-+
- /* defer the work to tcp_release_cb() */
- set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
-+
-+ if (mptcp(tp))
-+ mptcp_tsq_flags(sk);
- }
-- bh_unlock_sock(sk);
-+exit:
-+ bh_unlock_sock(meta_sk);
-
- clear_bit(TSQ_QUEUED, &tp->tsq_flags);
- sk_free(sk);
-@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
- #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) | \
- (1UL << TCP_WRITE_TIMER_DEFERRED) | \
- (1UL << TCP_DELACK_TIMER_DEFERRED) | \
-- (1UL << TCP_MTU_REDUCED_DEFERRED))
-+ (1UL << TCP_MTU_REDUCED_DEFERRED) | \
-+ (1UL << MPTCP_PATH_MANAGER) | \
-+ (1UL << MPTCP_SUB_DEFERRED))
-+
- /**
- * tcp_release_cb - tcp release_sock() callback
- * @sk: socket
-@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
- sk->sk_prot->mtu_reduced(sk);
- __sock_put(sk);
- }
-+ if (flags & (1UL << MPTCP_PATH_MANAGER)) {
-+ if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
-+ tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
-+ __sock_put(sk);
-+ }
-+ if (flags & (1UL << MPTCP_SUB_DEFERRED))
-+ mptcp_tsq_sub_deferred(sk);
- }
- EXPORT_SYMBOL(tcp_release_cb);
-
-@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
- * We are working here with either a clone of the original
- * SKB, or a fresh unique copy made by the retransmit engine.
- */
--static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-- gfp_t gfp_mask)
-+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-+ gfp_t gfp_mask)
- {
- const struct inet_connection_sock *icsk = inet_csk(sk);
- struct inet_sock *inet;
-@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- */
- th->window = htons(min(tp->rcv_wnd, 65535U));
- } else {
-- th->window = htons(tcp_select_window(sk));
-+ th->window = htons(tp->ops->select_window(sk));
- }
- th->check = 0;
- th->urg_ptr = 0;
-@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- }
- }
-
-- tcp_options_write((__be32 *)(th + 1), tp, &opts);
-+ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
- if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
- TCP_ECN_send(sk, skb, tcp_header_size);
-
-@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
- * otherwise socket can stall.
- */
--static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
-+void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
- }
-
- /* Initialize TSO segments for a packet. */
--static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-- unsigned int mss_now)
-+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+ unsigned int mss_now)
- {
- struct skb_shared_info *shinfo = skb_shinfo(skb);
-
- /* Make sure we own this skb before messing gso_size/gso_segs */
- WARN_ON_ONCE(skb_cloned(skb));
-
-- if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
-+ if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
-+ (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
- /* Avoid the costly divide in the normal
- * non-TSO case.
- */
-@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
- /* Pcount in the middle of the write queue got changed, we need to do various
- * tweaks to fix counters
- */
--static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
-+void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
- * eventually). The difference is that pulled data not copied, but
- * immediately discarded.
- */
--static void __pskb_trim_head(struct sk_buff *skb, int len)
-+void __pskb_trim_head(struct sk_buff *skb, int len)
- {
- struct skb_shared_info *shinfo;
- int i, k, eat;
-@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
- /* Remove acked data from a packet in the transmit queue. */
- int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
- {
-+ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
-+ return mptcp_trim_head(sk, skb, len);
-+
- if (skb_unclone(skb, GFP_ATOMIC))
- return -ENOMEM;
-
-@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
- if (tcp_skb_pcount(skb) > 1)
- tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
-
-+#ifdef CONFIG_MPTCP
-+ /* Some data got acked - we assume that the seq-number reached the dest.
-+ * Anyway, our MPTCP-option has been trimmed above - we lost it here.
-+ * Only remove the SEQ if the call does not come from a meta retransmit.
-+ */
-+ if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
-+ TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
-+#endif
-+
- return 0;
- }
-
-@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
-
- return mss_now;
- }
-+EXPORT_SYMBOL(tcp_current_mss);
-
- /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
- * As additional protections, we do not touch cwnd in retransmission phases,
-@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
- * But we can avoid doing the divide again given we already have
- * skb_pcount = skb->len / mss_now
- */
--static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-- const struct sk_buff *skb)
-+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-+ const struct sk_buff *skb)
- {
- if (skb->len < tcp_skb_pcount(skb) * mss_now)
- tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
-@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
- (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
- }
- /* Returns the portion of skb which can be sent right away */
--static unsigned int tcp_mss_split_point(const struct sock *sk,
-- const struct sk_buff *skb,
-- unsigned int mss_now,
-- unsigned int max_segs,
-- int nonagle)
-+unsigned int tcp_mss_split_point(const struct sock *sk,
-+ const struct sk_buff *skb,
-+ unsigned int mss_now,
-+ unsigned int max_segs,
-+ int nonagle)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
- u32 partial, needed, window, max_len;
-@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
- /* Can at least one segment of SKB be sent right now, according to the
- * congestion window rules? If so, return how many segments are allowed.
- */
--static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-- const struct sk_buff *skb)
-+unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-+ const struct sk_buff *skb)
- {
- u32 in_flight, cwnd;
-
- /* Don't be strict about the congestion window for the final FIN. */
-- if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
-+ if (skb &&
-+ (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
- tcp_skb_pcount(skb) == 1)
- return 1;
-
-@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
- * This must be invoked the first time we consider transmitting
- * SKB onto the wire.
- */
--static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-- unsigned int mss_now)
-+int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+ unsigned int mss_now)
- {
- int tso_segs = tcp_skb_pcount(skb);
-
-@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
- /* Return true if the Nagle test allows this packet to be
- * sent now.
- */
--static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-- unsigned int cur_mss, int nonagle)
-+bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ unsigned int cur_mss, int nonagle)
- {
- /* Nagle rule does not apply to frames, which sit in the middle of the
- * write_queue (they have no chances to get new data).
-@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
- return true;
-
- /* Don't use the nagle rule for urgent data (or for the final FIN). */
-- if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
-+ if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
-+ mptcp_is_data_fin(skb))
- return true;
-
- if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
-@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
- }
-
- /* Does at least the first segment of SKB fit into the send window? */
--static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
-- const struct sk_buff *skb,
-- unsigned int cur_mss)
-+bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ unsigned int cur_mss)
- {
- u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-
-@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
- u32 send_win, cong_win, limit, in_flight;
- int win_divisor;
-
-- if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
-+ if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
- goto send_now;
-
- if (icsk->icsk_ca_state != TCP_CA_Open)
-@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
- * Returns true, if no segments are in flight and we have queued segments,
- * but cannot send anything now because of SWS or another problem.
- */
--static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
- int push_one, gfp_t gfp)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-
- sent_pkts = 0;
-
-- if (!push_one) {
-+ /* pmtu not yet supported with MPTCP. Should be possible, by early
-+ * exiting the loop inside tcp_mtu_probe, making sure that only one
-+ * single DSS-mapping gets probed.
-+ */
-+ if (!push_one && !mptcp(tp)) {
- /* Do MTU probing. */
- result = tcp_mtu_probe(sk);
- if (!result) {
-@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
- int err = -1;
-
- if (tcp_send_head(sk) != NULL) {
-- err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
-+ err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
-+ GFP_ATOMIC);
- goto rearm_timer;
- }
-
-@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
- if (unlikely(sk->sk_state == TCP_CLOSE))
- return;
-
-- if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
-- sk_gfp_atomic(sk, GFP_ATOMIC)))
-+ if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
-+ sk_gfp_atomic(sk, GFP_ATOMIC)))
- tcp_check_probe_timer(sk);
- }
-
-@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
-
- BUG_ON(!skb || skb->len < mss_now);
-
-- tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
-+ tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
-+ sk->sk_allocation);
- }
-
- /* This function returns the amount that we can raise the
-@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
- if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
- return;
-
-+ /* Currently not supported for MPTCP - but it should be possible */
-+ if (mptcp(tp))
-+ return;
-+
- tcp_for_write_queue_from_safe(skb, tmp, sk) {
- if (!tcp_can_collapse(sk, skb))
- break;
-@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
-
- /* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
- th->window = htons(min(req->rcv_wnd, 65535U));
-- tcp_options_write((__be32 *)(th + 1), tp, &opts);
-+ tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
- th->doff = (tcp_header_size >> 2);
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
-
-@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
- (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
- tp->window_clamp = tcp_full_space(sk);
-
-- tcp_select_initial_window(tcp_full_space(sk),
-- tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
-- &tp->rcv_wnd,
-- &tp->window_clamp,
-- sysctl_tcp_window_scaling,
-- &rcv_wscale,
-- dst_metric(dst, RTAX_INITRWND));
-+ tp->ops->select_initial_window(tcp_full_space(sk),
-+ tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
-+ &tp->rcv_wnd,
-+ &tp->window_clamp,
-+ sysctl_tcp_window_scaling,
-+ &rcv_wscale,
-+ dst_metric(dst, RTAX_INITRWND), sk);
-
- tp->rx_opt.rcv_wscale = rcv_wscale;
- tp->rcv_ssthresh = tp->rcv_wnd;
-@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
- inet_csk(sk)->icsk_retransmits = 0;
- tcp_clear_retrans(tp);
-+
-+#ifdef CONFIG_MPTCP
-+ if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
-+ if (is_master_tp(tp)) {
-+ tp->request_mptcp = 1;
-+ mptcp_connect_init(sk);
-+ } else if (tp->mptcp) {
-+ struct inet_sock *inet = inet_sk(sk);
-+
-+ tp->mptcp->snt_isn = tp->write_seq;
-+ tp->mptcp->init_rcv_wnd = tp->rcv_wnd;
-+
-+ /* Set nonce for new subflows */
-+ if (sk->sk_family == AF_INET)
-+ tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
-+ inet->inet_saddr,
-+ inet->inet_daddr,
-+ inet->inet_sport,
-+ inet->inet_dport);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
-+ inet6_sk(sk)->saddr.s6_addr32,
-+ sk->sk_v6_daddr.s6_addr32,
-+ inet->inet_sport,
-+ inet->inet_dport);
-+#endif
-+ }
-+ }
-+#endif
- }
-
- static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
-@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
- TCP_SKB_CB(buff)->when = tcp_time_stamp;
- tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
- }
-+EXPORT_SYMBOL(tcp_send_ack);
-
- /* This routine sends a packet with an out of date sequence
- * number. It assumes the other end will try to ack it.
-@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
- * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
- * out-of-date with SND.UNA-1 to probe window.
- */
--static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
-+int tcp_xmit_probe_skb(struct sock *sk, int urgent)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- struct sk_buff *skb;
-@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
- struct tcp_sock *tp = tcp_sk(sk);
- int err;
-
-- err = tcp_write_wakeup(sk);
-+ err = tp->ops->write_wakeup(sk);
-
- if (tp->packets_out || !tcp_send_head(sk)) {
- /* Cancel probe timer, if it is not required. */
-@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
- TCP_RTO_MAX);
- }
- }
-+
-+int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
-+{
-+ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
-+ struct flowi fl;
-+ int res;
-+
-+ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
-+ if (!res) {
-+ TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-+ }
-+ return res;
-+}
-+EXPORT_SYMBOL(tcp_rtx_synack);
-diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
-index 286227abed10..966b873cbf3e 100644
---- a/net/ipv4/tcp_timer.c
-+++ b/net/ipv4/tcp_timer.c
-@@ -20,6 +20,7 @@
-
- #include <linux/module.h>
- #include <linux/gfp.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
-
- int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
-@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
- int sysctl_tcp_orphan_retries __read_mostly;
- int sysctl_tcp_thin_linear_timeouts __read_mostly;
-
--static void tcp_write_err(struct sock *sk)
-+void tcp_write_err(struct sock *sk)
- {
- sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
- sk->sk_error_report(sk);
-@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
- (!tp->snd_wnd && !tp->packets_out))
- do_reset = 1;
- if (do_reset)
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
- tcp_done(sk);
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
- return 1;
-@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
- * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
- * syn_set flag is set.
- */
--static bool retransmits_timed_out(struct sock *sk,
-- unsigned int boundary,
-- unsigned int timeout,
-- bool syn_set)
-+bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
-+ unsigned int timeout, bool syn_set)
- {
- unsigned int linear_backoff_thresh, start_ts;
- unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
-@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
- }
-
- /* A write timeout has occurred. Process the after effects. */
--static int tcp_write_timeout(struct sock *sk)
-+int tcp_write_timeout(struct sock *sk)
- {
- struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
-@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
- }
- retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
- syn_set = true;
-+ /* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
-+ if (tcp_sk(sk)->request_mptcp &&
-+ icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
-+ tcp_sk(sk)->request_mptcp = 0;
- } else {
- if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
- /* Black hole detection */
-@@ -251,18 +254,22 @@ out:
- static void tcp_delack_timer(unsigned long data)
- {
- struct sock *sk = (struct sock *)data;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-
-- bh_lock_sock(sk);
-- if (!sock_owned_by_user(sk)) {
-+ bh_lock_sock(meta_sk);
-+ if (!sock_owned_by_user(meta_sk)) {
- tcp_delack_timer_handler(sk);
- } else {
- inet_csk(sk)->icsk_ack.blocked = 1;
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
- /* deleguate our work to tcp_release_cb() */
- if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
- sock_hold(sk);
-+ if (mptcp(tp))
-+ mptcp_tsq_flags(sk);
- }
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-@@ -479,6 +486,10 @@ out_reset_timer:
- __sk_dst_reset(sk);
-
- out:;
-+ if (mptcp(tp)) {
-+ mptcp_reinject_data(sk, 1);
-+ mptcp_set_rto(sk);
-+ }
- }
-
- void tcp_write_timer_handler(struct sock *sk)
-@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
- break;
- case ICSK_TIME_RETRANS:
- icsk->icsk_pending = 0;
-- tcp_retransmit_timer(sk);
-+ tcp_sk(sk)->ops->retransmit_timer(sk);
- break;
- case ICSK_TIME_PROBE0:
- icsk->icsk_pending = 0;
-@@ -520,16 +531,19 @@ out:
- static void tcp_write_timer(unsigned long data)
- {
- struct sock *sk = (struct sock *)data;
-+ struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
-
-- bh_lock_sock(sk);
-- if (!sock_owned_by_user(sk)) {
-+ bh_lock_sock(meta_sk);
-+ if (!sock_owned_by_user(meta_sk)) {
- tcp_write_timer_handler(sk);
- } else {
- /* deleguate our work to tcp_release_cb() */
- if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
- sock_hold(sk);
-+ if (mptcp(tcp_sk(sk)))
-+ mptcp_tsq_flags(sk);
- }
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
- struct sock *sk = (struct sock *) data;
- struct inet_connection_sock *icsk = inet_csk(sk);
- struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
- u32 elapsed;
-
- /* Only process if socket is not in use. */
-- bh_lock_sock(sk);
-- if (sock_owned_by_user(sk)) {
-+ bh_lock_sock(meta_sk);
-+ if (sock_owned_by_user(meta_sk)) {
- /* Try again later. */
- inet_csk_reset_keepalive_timer (sk, HZ/20);
- goto out;
-@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
- goto out;
- }
-
-+ if (tp->send_mp_fclose) {
-+ /* MUST do this before tcp_write_timeout, because retrans_stamp
-+ * may have been set to 0 in another part while we are
-+ * retransmitting MP_FASTCLOSE. Then, we would crash, because
-+ * retransmits_timed_out accesses the meta-write-queue.
-+ *
-+ * We make sure that the timestamp is != 0.
-+ */
-+ if (!tp->retrans_stamp)
-+ tp->retrans_stamp = tcp_time_stamp ? : 1;
-+
-+ if (tcp_write_timeout(sk))
-+ goto out;
-+
-+ tcp_send_ack(sk);
-+ icsk->icsk_retransmits++;
-+
-+ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-+ elapsed = icsk->icsk_rto;
-+ goto resched;
-+ }
-+
- if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
- if (tp->linger2 >= 0) {
- const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
-
- if (tmo > 0) {
-- tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+ tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
- goto out;
- }
- }
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
- goto death;
- }
-
-@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
- icsk->icsk_probes_out > 0) ||
- (icsk->icsk_user_timeout == 0 &&
- icsk->icsk_probes_out >= keepalive_probes(tp))) {
-- tcp_send_active_reset(sk, GFP_ATOMIC);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
- tcp_write_err(sk);
- goto out;
- }
-- if (tcp_write_wakeup(sk) <= 0) {
-+ if (tp->ops->write_wakeup(sk) <= 0) {
- icsk->icsk_probes_out++;
- elapsed = keepalive_intvl_when(tp);
- } else {
-@@ -642,7 +679,7 @@ death:
- tcp_done(sk);
-
- out:
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
-index 5667b3003af9..7139c2973fd2 100644
---- a/net/ipv6/addrconf.c
-+++ b/net/ipv6/addrconf.c
-@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
-
- kfree_rcu(ifp, rcu);
- }
-+EXPORT_SYMBOL(inet6_ifa_finish_destroy);
-
- static void
- ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
-diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
-index 7cb4392690dd..7057afbca4df 100644
---- a/net/ipv6/af_inet6.c
-+++ b/net/ipv6/af_inet6.c
-@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
- return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
- }
-
--static int inet6_create(struct net *net, struct socket *sock, int protocol,
-- int kern)
-+int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
- {
- struct inet_sock *inet;
- struct ipv6_pinfo *np;
-diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
-index a245e5ddffbd..99c892b8992d 100644
---- a/net/ipv6/inet6_connection_sock.c
-+++ b/net/ipv6/inet6_connection_sock.c
-@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
- /*
- * request_sock (formerly open request) hash tables.
- */
--static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-- const u32 rnd, const u32 synq_hsize)
-+u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-+ const u32 rnd, const u32 synq_hsize)
- {
- u32 c;
-
-diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
-index edb58aff4ae7..ea4d9fda0927 100644
---- a/net/ipv6/ipv6_sockglue.c
-+++ b/net/ipv6/ipv6_sockglue.c
-@@ -48,6 +48,8 @@
- #include <net/addrconf.h>
- #include <net/inet_common.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
- #include <net/udp.h>
- #include <net/udplite.h>
- #include <net/xfrm.h>
-@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
- sock_prot_inuse_add(net, &tcp_prot, 1);
- local_bh_enable();
- sk->sk_prot = &tcp_prot;
-- icsk->icsk_af_ops = &ipv4_specific;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v4_specific;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv4_specific;
- sk->sk_socket->ops = &inet_stream_ops;
- sk->sk_family = PF_INET;
- tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
-diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
-index a822b880689b..b2b38869d795 100644
---- a/net/ipv6/syncookies.c
-+++ b/net/ipv6/syncookies.c
-@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
-
- /* check for timestamp cookie support */
- memset(&tcp_opt, 0, sizeof(tcp_opt));
-- tcp_parse_options(skb, &tcp_opt, 0, NULL);
-+ tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
-
- if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
- goto out;
-
- ret = NULL;
-- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
-+ req = inet_reqsk_alloc(&tcp6_request_sock_ops);
- if (!req)
- goto out;
-
-@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
- }
-
- req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
-- tcp_select_initial_window(tcp_full_space(sk), req->mss,
-- &req->rcv_wnd, &req->window_clamp,
-- ireq->wscale_ok, &rcv_wscale,
-- dst_metric(dst, RTAX_INITRWND));
-+ tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
-+ &req->rcv_wnd, &req->window_clamp,
-+ ireq->wscale_ok, &rcv_wscale,
-+ dst_metric(dst, RTAX_INITRWND), sk);
-
- ireq->rcv_wscale = rcv_wscale;
-
-diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
-index 229239ad96b1..fda94d71666e 100644
---- a/net/ipv6/tcp_ipv6.c
-+++ b/net/ipv6/tcp_ipv6.c
-@@ -63,6 +63,8 @@
- #include <net/inet_common.h>
- #include <net/secure_seq.h>
- #include <net/tcp_memcontrol.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v6.h>
- #include <net/busy_poll.h>
-
- #include <linux/proc_fs.h>
-@@ -71,12 +73,6 @@
- #include <linux/crypto.h>
- #include <linux/scatterlist.h>
-
--static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
--static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-- struct request_sock *req);
--
--static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
--
- static const struct inet_connection_sock_af_ops ipv6_mapped;
- static const struct inet_connection_sock_af_ops ipv6_specific;
- #ifdef CONFIG_TCP_MD5SIG
-@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
- }
- #endif
-
--static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
-+void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
- {
- struct dst_entry *dst = skb_dst(skb);
- const struct rt6_info *rt = (const struct rt6_info *)dst;
-@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
- inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
- }
-
--static void tcp_v6_hash(struct sock *sk)
-+void tcp_v6_hash(struct sock *sk)
- {
- if (sk->sk_state != TCP_CLOSE) {
-- if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
-+ if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
-+ inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
- tcp_prot.hash(sk);
- return;
- }
-@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
- }
- }
-
--static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
-+__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
- {
- return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
- ipv6_hdr(skb)->saddr.s6_addr32,
-@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
- tcp_hdr(skb)->source);
- }
-
--static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
-+int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- int addr_len)
- {
- struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
-@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- sin.sin_port = usin->sin6_port;
- sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
-
-- icsk->icsk_af_ops = &ipv6_mapped;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v6_mapped;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv6_mapped;
- sk->sk_backlog_rcv = tcp_v4_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- tp->af_specific = &tcp_sock_ipv6_mapped_specific;
-@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
-
- if (err) {
- icsk->icsk_ext_hdr_len = exthdrlen;
-- icsk->icsk_af_ops = &ipv6_specific;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v6_specific;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv6_specific;
- sk->sk_backlog_rcv = tcp_v6_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- tp->af_specific = &tcp_sock_ipv6_specific;
-@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
- const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
- struct ipv6_pinfo *np;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk;
- int err;
- struct tcp_sock *tp;
- struct request_sock *fastopen;
-@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- return;
- }
-
-- bh_lock_sock(sk);
-- if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
-+ tp = tcp_sk(sk);
-+ if (mptcp(tp))
-+ meta_sk = mptcp_meta_sk(sk);
-+ else
-+ meta_sk = sk;
-+
-+ bh_lock_sock(meta_sk);
-+ if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
- NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
-
- if (sk->sk_state == TCP_CLOSE)
-@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- goto out;
- }
-
-- tp = tcp_sk(sk);
- seq = ntohl(th->seq);
- /* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
- fastopen = tp->fastopen_rsk;
-@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- goto out;
-
- tp->mtu_info = ntohl(info);
-- if (!sock_owned_by_user(sk))
-+ if (!sock_owned_by_user(meta_sk))
- tcp_v6_mtu_reduced(sk);
-- else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
-+ else {
-+ if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
- &tp->tsq_flags))
-- sock_hold(sk);
-+ sock_hold(sk);
-+ if (mptcp(tp))
-+ mptcp_tsq_flags(sk);
-+ }
- goto out;
- }
-
-@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- switch (sk->sk_state) {
- struct request_sock *req, **prev;
- case TCP_LISTEN:
-- if (sock_owned_by_user(sk))
-+ if (sock_owned_by_user(meta_sk))
- goto out;
-
- req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
-@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- if (fastopen && fastopen->sk == NULL)
- break;
-
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- sk->sk_err = err;
- sk->sk_error_report(sk); /* Wake people up to see the error (see connect in sock.c) */
-
-@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- goto out;
- }
-
-- if (!sock_owned_by_user(sk) && np->recverr) {
-+ if (!sock_owned_by_user(meta_sk) && np->recverr) {
- sk->sk_err = err;
- sk->sk_error_report(sk);
- } else
- sk->sk_err_soft = err;
-
- out:
-- bh_unlock_sock(sk);
-+ bh_unlock_sock(meta_sk);
- sock_put(sk);
- }
-
-
--static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-- struct flowi6 *fl6,
-- struct request_sock *req,
-- u16 queue_mapping,
-- struct tcp_fastopen_cookie *foc)
-+int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-+ struct flowi *fl,
-+ struct request_sock *req,
-+ u16 queue_mapping,
-+ struct tcp_fastopen_cookie *foc)
- {
- struct inet_request_sock *ireq = inet_rsk(req);
- struct ipv6_pinfo *np = inet6_sk(sk);
-+ struct flowi6 *fl6 = &fl->u.ip6;
- struct sk_buff *skb;
- int err = -ENOMEM;
-
-@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
- skb_set_queue_mapping(skb, queue_mapping);
- err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
- err = net_xmit_eval(err);
-+ if (!tcp_rsk(req)->snt_synack && !err)
-+ tcp_rsk(req)->snt_synack = tcp_time_stamp;
- }
-
- done:
- return err;
- }
-
--static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
-+int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
- {
-- struct flowi6 fl6;
-+ const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
-+ struct flowi fl;
- int res;
-
-- res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
-+ res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
- if (!res) {
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
- return res;
- }
-
--static void tcp_v6_reqsk_destructor(struct request_sock *req)
-+void tcp_v6_reqsk_destructor(struct request_sock *req)
- {
- kfree_skb(inet_rsk(req)->pktopts);
- }
-@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
- }
- #endif
-
-+static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ struct inet_request_sock *ireq = inet_rsk(req);
-+ struct ipv6_pinfo *np = inet6_sk(sk);
-+
-+ ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
-+ ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
-+
-+ ireq->ir_iif = sk->sk_bound_dev_if;
-+ ireq->ir_mark = inet_request_mark(sk, skb);
-+
-+ /* So that link locals have meaning */
-+ if (!sk->sk_bound_dev_if &&
-+ ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
-+ ireq->ir_iif = inet6_iif(skb);
-+
-+ if (!TCP_SKB_CB(skb)->when &&
-+ (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
-+ np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
-+ np->rxopt.bits.rxohlim || np->repflow)) {
-+ atomic_inc(&skb->users);
-+ ireq->pktopts = skb;
-+ }
-+
-+ return 0;
-+}
-+
-+static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
-+ const struct request_sock *req,
-+ bool *strict)
-+{
-+ if (strict)
-+ *strict = true;
-+ return inet6_csk_route_req(sk, &fl->u.ip6, req);
-+}
-+
- struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
- .family = AF_INET6,
- .obj_size = sizeof(struct tcp6_request_sock),
-- .rtx_syn_ack = tcp_v6_rtx_synack,
-+ .rtx_syn_ack = tcp_rtx_synack,
- .send_ack = tcp_v6_reqsk_send_ack,
- .destructor = tcp_v6_reqsk_destructor,
- .send_reset = tcp_v6_send_reset,
- .syn_ack_timeout = tcp_syn_ack_timeout,
- };
-
-+const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
-+ .mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) -
-+ sizeof(struct ipv6hdr),
- #ifdef CONFIG_TCP_MD5SIG
--static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
- .md5_lookup = tcp_v6_reqsk_md5_lookup,
- .calc_md5_hash = tcp_v6_md5_hash_skb,
--};
- #endif
-+ .init_req = tcp_v6_init_req,
-+#ifdef CONFIG_SYN_COOKIES
-+ .cookie_init_seq = cookie_v6_init_sequence,
-+#endif
-+ .route_req = tcp_v6_route_req,
-+ .init_seq = tcp_v6_init_sequence,
-+ .send_synack = tcp_v6_send_synack,
-+ .queue_hash_add = inet6_csk_reqsk_queue_hash_add,
-+};
-
--static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
-- u32 tsval, u32 tsecr, int oif,
-- struct tcp_md5sig_key *key, int rst, u8 tclass,
-- u32 label)
-+static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
-+ u32 data_ack, u32 win, u32 tsval, u32 tsecr,
-+ int oif, struct tcp_md5sig_key *key, int rst,
-+ u8 tclass, u32 label, int mptcp)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- struct tcphdr *t1;
-@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- if (key)
- tot_len += TCPOLEN_MD5SIG_ALIGNED;
- #endif
--
-+#ifdef CONFIG_MPTCP
-+ if (mptcp)
-+ tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
-+#endif
- buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
- GFP_ATOMIC);
- if (buff == NULL)
-@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- tcp_v6_md5_hash_hdr((__u8 *)topt, key,
- &ipv6_hdr(skb)->saddr,
- &ipv6_hdr(skb)->daddr, t1);
-+ topt += 4;
-+ }
-+#endif
-+#ifdef CONFIG_MPTCP
-+ if (mptcp) {
-+ /* Construction of 32-bit data_ack */
-+ *topt++ = htonl((TCPOPT_MPTCP << 24) |
-+ ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
-+ (0x20 << 8) |
-+ (0x01));
-+ *topt++ = htonl(data_ack);
- }
- #endif
-
-@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- kfree_skb(buff);
- }
-
--static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
-+void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- u32 seq = 0, ack_seq = 0;
-@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
- (th->doff << 2);
-
- oif = sk ? sk->sk_bound_dev_if : 0;
-- tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
-+ tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
-
- #ifdef CONFIG_TCP_MD5SIG
- release_sk1:
-@@ -902,45 +983,52 @@ release_sk1:
- #endif
- }
-
--static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
-+static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
- u32 win, u32 tsval, u32 tsecr, int oif,
- struct tcp_md5sig_key *key, u8 tclass,
-- u32 label)
-+ u32 label, int mptcp)
- {
-- tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
-- label);
-+ tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
-+ key, 0, tclass, label, mptcp);
- }
-
- static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
- {
- struct inet_timewait_sock *tw = inet_twsk(sk);
- struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
-+ u32 data_ack = 0;
-+ int mptcp = 0;
-
-+ if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
-+ data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
-+ mptcp = 1;
-+ }
- tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
-+ data_ack,
- tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
- tcp_time_stamp + tcptw->tw_ts_offset,
- tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
-- tw->tw_tclass, (tw->tw_flowlabel << 12));
-+ tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
-
- inet_twsk_put(tw);
- }
-
--static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-- struct request_sock *req)
-+void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req)
- {
- /* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
- * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
- */
- tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
- tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
-- tcp_rsk(req)->rcv_nxt,
-+ tcp_rsk(req)->rcv_nxt, 0,
- req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
- tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
-- 0, 0);
-+ 0, 0, 0);
- }
-
-
--static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
-+struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- struct request_sock *req, **prev;
- const struct tcphdr *th = tcp_hdr(skb);
-@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
-
- if (nsk) {
- if (nsk->sk_state != TCP_TIME_WAIT) {
-+ /* Don't lock again the meta-sk. It has been locked
-+ * before mptcp_v6_do_rcv.
-+ */
-+ if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
-+ bh_lock_sock(mptcp_meta_sk(nsk));
- bh_lock_sock(nsk);
-+
- return nsk;
- }
- inet_twsk_put(inet_twsk(nsk));
-@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- return sk;
- }
-
--/* FIXME: this is substantially similar to the ipv4 code.
-- * Can some kind of merge be done? -- erics
-- */
--static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
-+int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
- {
-- struct tcp_options_received tmp_opt;
-- struct request_sock *req;
-- struct inet_request_sock *ireq;
-- struct ipv6_pinfo *np = inet6_sk(sk);
-- struct tcp_sock *tp = tcp_sk(sk);
-- __u32 isn = TCP_SKB_CB(skb)->when;
-- struct dst_entry *dst = NULL;
-- struct tcp_fastopen_cookie foc = { .len = -1 };
-- bool want_cookie = false, fastopen;
-- struct flowi6 fl6;
-- int err;
--
- if (skb->protocol == htons(ETH_P_IP))
- return tcp_v4_conn_request(sk, skb);
-
- if (!ipv6_unicast_destination(skb))
- goto drop;
-
-- if ((sysctl_tcp_syncookies == 2 ||
-- inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-- want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
-- if (!want_cookie)
-- goto drop;
-- }
--
-- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-- goto drop;
-- }
--
-- req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
-- if (req == NULL)
-- goto drop;
--
--#ifdef CONFIG_TCP_MD5SIG
-- tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
--#endif
--
-- tcp_clear_options(&tmp_opt);
-- tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
-- tmp_opt.user_mss = tp->rx_opt.user_mss;
-- tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
--
-- if (want_cookie && !tmp_opt.saw_tstamp)
-- tcp_clear_options(&tmp_opt);
-+ return tcp_conn_request(&tcp6_request_sock_ops,
-+ &tcp_request_sock_ipv6_ops, sk, skb);
-
-- tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-- tcp_openreq_init(req, &tmp_opt, skb);
--
-- ireq = inet_rsk(req);
-- ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
-- ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
-- if (!want_cookie || tmp_opt.tstamp_ok)
-- TCP_ECN_create_request(req, skb, sock_net(sk));
--
-- ireq->ir_iif = sk->sk_bound_dev_if;
-- ireq->ir_mark = inet_request_mark(sk, skb);
--
-- /* So that link locals have meaning */
-- if (!sk->sk_bound_dev_if &&
-- ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
-- ireq->ir_iif = inet6_iif(skb);
--
-- if (!isn) {
-- if (ipv6_opt_accepted(sk, skb) ||
-- np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
-- np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
-- np->repflow) {
-- atomic_inc(&skb->users);
-- ireq->pktopts = skb;
-- }
--
-- if (want_cookie) {
-- isn = cookie_v6_init_sequence(sk, skb, &req->mss);
-- req->cookie_ts = tmp_opt.tstamp_ok;
-- goto have_isn;
-- }
--
-- /* VJ's idea. We save last timestamp seen
-- * from the destination in peer table, when entering
-- * state TIME-WAIT, and check against it before
-- * accepting new connection request.
-- *
-- * If "isn" is not zero, this request hit alive
-- * timewait bucket, so that all the necessary checks
-- * are made in the function processing timewait state.
-- */
-- if (tmp_opt.saw_tstamp &&
-- tcp_death_row.sysctl_tw_recycle &&
-- (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
-- if (!tcp_peer_is_proven(req, dst, true)) {
-- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
-- goto drop_and_release;
-- }
-- }
-- /* Kill the following clause, if you dislike this way. */
-- else if (!sysctl_tcp_syncookies &&
-- (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
-- (sysctl_max_syn_backlog >> 2)) &&
-- !tcp_peer_is_proven(req, dst, false)) {
-- /* Without syncookies last quarter of
-- * backlog is filled with destinations,
-- * proven to be alive.
-- * It means that we continue to communicate
-- * to destinations, already remembered
-- * to the moment of synflood.
-- */
-- LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
-- &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
-- goto drop_and_release;
-- }
--
-- isn = tcp_v6_init_sequence(skb);
-- }
--have_isn:
--
-- if (security_inet_conn_request(sk, skb, req))
-- goto drop_and_release;
--
-- if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
-- goto drop_and_free;
--
-- tcp_rsk(req)->snt_isn = isn;
-- tcp_rsk(req)->snt_synack = tcp_time_stamp;
-- tcp_openreq_init_rwin(req, sk, dst);
-- fastopen = !want_cookie &&
-- tcp_try_fastopen(sk, skb, req, &foc, dst);
-- err = tcp_v6_send_synack(sk, dst, &fl6, req,
-- skb_get_queue_mapping(skb), &foc);
-- if (!fastopen) {
-- if (err || want_cookie)
-- goto drop_and_free;
--
-- tcp_rsk(req)->listener = NULL;
-- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-- }
-- return 0;
--
--drop_and_release:
-- dst_release(dst);
--drop_and_free:
-- reqsk_free(req);
- drop:
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
- return 0; /* don't send reset */
- }
-
--static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-- struct request_sock *req,
-- struct dst_entry *dst)
-+struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-+ struct request_sock *req,
-+ struct dst_entry *dst)
- {
- struct inet_request_sock *ireq;
- struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
-@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-
- newsk->sk_v6_rcv_saddr = newnp->saddr;
-
-- inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(newsk))
-+ inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
-+ else
-+#endif
-+ inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
- newsk->sk_backlog_rcv = tcp_v4_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
-@@ -1329,7 +1292,7 @@ out:
- * This is because we cannot sleep with the original spinlock
- * held.
- */
--static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
-+int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
- {
- struct ipv6_pinfo *np = inet6_sk(sk);
- struct tcp_sock *tp;
-@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
- goto discard;
- #endif
-
-+ if (is_meta_sk(sk))
-+ return mptcp_v6_do_rcv(sk, skb);
-+
- if (sk_filter(sk, skb))
- goto discard;
-
-@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
- {
- const struct tcphdr *th;
- const struct ipv6hdr *hdr;
-- struct sock *sk;
-+ struct sock *sk, *meta_sk = NULL;
- int ret;
- struct net *net = dev_net(skb->dev);
-
-@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
- TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
- skb->len - th->doff*4);
- TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-+#ifdef CONFIG_MPTCP
-+ TCP_SKB_CB(skb)->mptcp_flags = 0;
-+ TCP_SKB_CB(skb)->dss_off = 0;
-+#endif
- TCP_SKB_CB(skb)->when = 0;
- TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
- TCP_SKB_CB(skb)->sacked = 0;
-
- sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
-- if (!sk)
-- goto no_tcp_socket;
-
- process:
-- if (sk->sk_state == TCP_TIME_WAIT)
-+ if (sk && sk->sk_state == TCP_TIME_WAIT)
- goto do_time_wait;
-
-+#ifdef CONFIG_MPTCP
-+ if (!sk && th->syn && !th->ack) {
-+ int ret = mptcp_lookup_join(skb, NULL);
-+
-+ if (ret < 0) {
-+ tcp_v6_send_reset(NULL, skb);
-+ goto discard_it;
-+ } else if (ret > 0) {
-+ return 0;
-+ }
-+ }
-+
-+ /* Is there a pending request sock for this segment ? */
-+ if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
-+ if (sk)
-+ sock_put(sk);
-+ return 0;
-+ }
-+#endif
-+
-+ if (!sk)
-+ goto no_tcp_socket;
-+
- if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
- NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- goto discard_and_relse;
-@@ -1529,11 +1520,21 @@ process:
- sk_mark_napi_id(sk, skb);
- skb->dev = NULL;
-
-- bh_lock_sock_nested(sk);
-+ if (mptcp(tcp_sk(sk))) {
-+ meta_sk = mptcp_meta_sk(sk);
-+
-+ bh_lock_sock_nested(meta_sk);
-+ if (sock_owned_by_user(meta_sk))
-+ skb->sk = sk;
-+ } else {
-+ meta_sk = sk;
-+ bh_lock_sock_nested(sk);
-+ }
-+
- ret = 0;
-- if (!sock_owned_by_user(sk)) {
-+ if (!sock_owned_by_user(meta_sk)) {
- #ifdef CONFIG_NET_DMA
-- struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_sock *tp = tcp_sk(meta_sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
- if (tp->ucopy.dma_chan)
-@@ -1541,16 +1542,17 @@ process:
- else
- #endif
- {
-- if (!tcp_prequeue(sk, skb))
-+ if (!tcp_prequeue(meta_sk, skb))
- ret = tcp_v6_do_rcv(sk, skb);
- }
-- } else if (unlikely(sk_add_backlog(sk, skb,
-- sk->sk_rcvbuf + sk->sk_sndbuf))) {
-- bh_unlock_sock(sk);
-+ } else if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+ bh_unlock_sock(meta_sk);
- NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
- goto discard_and_relse;
- }
-- bh_unlock_sock(sk);
-+
-+ bh_unlock_sock(meta_sk);
-
- sock_put(sk);
- return ret ? -1 : 0;
-@@ -1607,6 +1609,18 @@ do_time_wait:
- sk = sk2;
- goto process;
- }
-+#ifdef CONFIG_MPTCP
-+ if (th->syn && !th->ack) {
-+ int ret = mptcp_lookup_join(skb, inet_twsk(sk));
-+
-+ if (ret < 0) {
-+ tcp_v6_send_reset(NULL, skb);
-+ goto discard_it;
-+ } else if (ret > 0) {
-+ return 0;
-+ }
-+ }
-+#endif
- /* Fall through to ACK */
- }
- case TCP_TW_ACK:
-@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
- }
- }
-
--static struct timewait_sock_ops tcp6_timewait_sock_ops = {
-+struct timewait_sock_ops tcp6_timewait_sock_ops = {
- .twsk_obj_size = sizeof(struct tcp6_timewait_sock),
- .twsk_unique = tcp_twsk_unique,
- .twsk_destructor = tcp_twsk_destructor,
-@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
-
- tcp_init_sock(sk);
-
-- icsk->icsk_af_ops = &ipv6_specific;
-+#ifdef CONFIG_MPTCP
-+ if (is_mptcp_enabled(sk))
-+ icsk->icsk_af_ops = &mptcp_v6_specific;
-+ else
-+#endif
-+ icsk->icsk_af_ops = &ipv6_specific;
-
- #ifdef CONFIG_TCP_MD5SIG
- tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
-@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
- return 0;
- }
-
--static void tcp_v6_destroy_sock(struct sock *sk)
-+void tcp_v6_destroy_sock(struct sock *sk)
- {
- tcp_v4_destroy_sock(sk);
- inet6_destroy_sock(sk);
-@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
- static void tcp_v6_clear_sk(struct sock *sk, int size)
- {
- struct inet_sock *inet = inet_sk(sk);
-+#ifdef CONFIG_MPTCP
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ /* size_tk_table goes from the end of tk_table to the end of sk */
-+ int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
-+ sizeof(tp->tk_table);
-+#endif
-
- /* we do not want to clear pinet6 field, because of RCU lookups */
- sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
-
- size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
-+
-+#ifdef CONFIG_MPTCP
-+ /* We zero out only from pinet6 to tk_table */
-+ size -= size_tk_table + sizeof(tp->tk_table);
-+#endif
- memset(&inet->pinet6 + 1, 0, size);
-+
-+#ifdef CONFIG_MPTCP
-+ memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
-+#endif
-+
- }
-
- struct proto tcpv6_prot = {
-diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
-new file mode 100644
-index 000000000000..cdfc03adabf8
---- /dev/null
-+++ b/net/mptcp/Kconfig
-@@ -0,0 +1,115 @@
-+#
-+# MPTCP configuration
-+#
-+config MPTCP
-+ bool "MPTCP protocol"
-+ depends on (IPV6=y || IPV6=n)
-+ ---help---
-+ This replaces the normal TCP stack with a Multipath TCP stack,
-+ able to use several paths at once.
-+
-+menuconfig MPTCP_PM_ADVANCED
-+ bool "MPTCP: advanced path-manager control"
-+ depends on MPTCP=y
-+ ---help---
-+ Support for selection of different path-managers. You should choose 'Y' here,
-+ because otherwise you will not actively create new MPTCP-subflows.
-+
-+if MPTCP_PM_ADVANCED
-+
-+config MPTCP_FULLMESH
-+ tristate "MPTCP Full-Mesh Path-Manager"
-+ depends on MPTCP=y
-+ ---help---
-+ This path-management module will create a full-mesh among all IP-addresses.
-+
-+config MPTCP_NDIFFPORTS
-+ tristate "MPTCP ndiff-ports"
-+ depends on MPTCP=y
-+ ---help---
-+ This path-management module will create multiple subflows between the same
-+ pair of IP-addresses, modifying the source-port. You can set the number
-+ of subflows via the mptcp_ndiffports-sysctl.
-+
-+config MPTCP_BINDER
-+ tristate "MPTCP Binder"
-+ depends on (MPTCP=y)
-+ ---help---
-+ This path-management module works like ndiffports, and adds the sysctl
-+ option to set the gateway (and/or path to) per each additional subflow
-+ via Loose Source Routing (IPv4 only).
-+
-+choice
-+ prompt "Default MPTCP Path-Manager"
-+ default DEFAULT
-+ help
-+ Select the Path-Manager of your choice
-+
-+ config DEFAULT_FULLMESH
-+ bool "Full mesh" if MPTCP_FULLMESH=y
-+
-+ config DEFAULT_NDIFFPORTS
-+ bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
-+
-+ config DEFAULT_BINDER
-+ bool "binder" if MPTCP_BINDER=y
-+
-+ config DEFAULT_DUMMY
-+ bool "Default"
-+
-+endchoice
-+
-+endif
-+
-+config DEFAULT_MPTCP_PM
-+ string
-+ default "default" if DEFAULT_DUMMY
-+ default "fullmesh" if DEFAULT_FULLMESH
-+ default "ndiffports" if DEFAULT_NDIFFPORTS
-+ default "binder" if DEFAULT_BINDER
-+ default "default"
-+
-+menuconfig MPTCP_SCHED_ADVANCED
-+ bool "MPTCP: advanced scheduler control"
-+ depends on MPTCP=y
-+ ---help---
-+ Support for selection of different schedulers. You should choose 'Y' here,
-+ if you want to choose a different scheduler than the default one.
-+
-+if MPTCP_SCHED_ADVANCED
-+
-+config MPTCP_ROUNDROBIN
-+ tristate "MPTCP Round-Robin"
-+ depends on (MPTCP=y)
-+ ---help---
-+ This is a very simple round-robin scheduler. Probably has bad performance
-+ but might be interesting for researchers.
-+
-+choice
-+ prompt "Default MPTCP Scheduler"
-+ default DEFAULT
-+ help
-+ Select the Scheduler of your choice
-+
-+ config DEFAULT_SCHEDULER
-+ bool "Default"
-+ ---help---
-+ This is the default scheduler, sending first on the subflow
-+ with the lowest RTT.
-+
-+ config DEFAULT_ROUNDROBIN
-+ bool "Round-Robin" if MPTCP_ROUNDROBIN=y
-+ ---help---
-+ This is the round-rob scheduler, sending in a round-robin
-+ fashion..
-+
-+endchoice
-+endif
-+
-+config DEFAULT_MPTCP_SCHED
-+ string
-+ depends on (MPTCP=y)
-+ default "default" if DEFAULT_SCHEDULER
-+ default "roundrobin" if DEFAULT_ROUNDROBIN
-+ default "default"
-+
-diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
-new file mode 100644
-index 000000000000..35561a7012e3
---- /dev/null
-+++ b/net/mptcp/Makefile
-@@ -0,0 +1,20 @@
-+#
-+## Makefile for MultiPath TCP support code.
-+#
-+#
-+
-+obj-$(CONFIG_MPTCP) += mptcp.o
-+
-+mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
-+ mptcp_output.o mptcp_input.o mptcp_sched.o
-+
-+obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
-+obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
-+obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
-+obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
-+obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
-+obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
-+obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
-+
-+mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
-+
-diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
-new file mode 100644
-index 000000000000..95d8da560715
---- /dev/null
-+++ b/net/mptcp/mptcp_binder.c
-@@ -0,0 +1,487 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#include <linux/route.h>
-+#include <linux/inet.h>
-+#include <linux/mroute.h>
-+#include <linux/spinlock_types.h>
-+#include <net/inet_ecn.h>
-+#include <net/route.h>
-+#include <net/xfrm.h>
-+#include <net/compat.h>
-+#include <linux/slab.h>
-+
-+#define MPTCP_GW_MAX_LISTS 10
-+#define MPTCP_GW_LIST_MAX_LEN 6
-+#define MPTCP_GW_SYSCTL_MAX_LEN (15 * MPTCP_GW_LIST_MAX_LEN * \
-+ MPTCP_GW_MAX_LISTS)
-+
-+struct mptcp_gw_list {
-+ struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
-+ u8 len[MPTCP_GW_MAX_LISTS];
-+};
-+
-+struct binder_priv {
-+ /* Worker struct for subflow establishment */
-+ struct work_struct subflow_work;
-+
-+ struct mptcp_cb *mpcb;
-+
-+ /* Prevent multiple sub-sockets concurrently iterating over sockets */
-+ spinlock_t *flow_lock;
-+};
-+
-+static struct mptcp_gw_list *mptcp_gws;
-+static rwlock_t mptcp_gws_lock;
-+
-+static int mptcp_binder_ndiffports __read_mostly = 1;
-+
-+static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
-+
-+static int mptcp_get_avail_list_ipv4(struct sock *sk)
-+{
-+ int i, j, list_taken, opt_ret, opt_len;
-+ unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
-+
-+ for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
-+ if (mptcp_gws->len[i] == 0)
-+ goto error;
-+
-+ mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
-+ list_taken = 0;
-+
-+ /* Loop through all sub-sockets in this connection */
-+ mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
-+ mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
-+
-+ /* Reset length and options buffer, then retrieve
-+ * from socket
-+ */
-+ opt_len = MAX_IPOPTLEN;
-+ memset(opt, 0, MAX_IPOPTLEN);
-+ opt_ret = ip_getsockopt(sk, IPPROTO_IP,
-+ IP_OPTIONS, opt, &opt_len);
-+ if (opt_ret < 0) {
-+ mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
-+ __func__, opt_ret);
-+ goto error;
-+ }
-+
-+ /* If socket has no options, it has no stake in this list */
-+ if (opt_len <= 0)
-+ continue;
-+
-+ /* Iterate options buffer */
-+ for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
-+ if (*opt_ptr == IPOPT_LSRR) {
-+ mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
-+ goto sock_lsrr;
-+ }
-+ }
-+ continue;
-+
-+sock_lsrr:
-+ /* Pointer to the 2nd to last address */
-+ opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
-+
-+ /* Addresses start 3 bytes after type offset */
-+ opt_ptr += 3;
-+ j = 0;
-+
-+ /* Different length lists cannot be the same */
-+ if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
-+ continue;
-+
-+ /* Iterate if we are still inside options list
-+ * and sysctl list
-+ */
-+ while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
-+ /* If there is a different address, this list must
-+ * not be set on this socket
-+ */
-+ if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
-+ break;
-+
-+ /* Jump 4 bytes to next address */
-+ opt_ptr += 4;
-+ j++;
-+ }
-+
-+ /* Reached the end without a differing address, lists
-+ * are therefore identical.
-+ */
-+ if (j == mptcp_gws->len[i]) {
-+ mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
-+ list_taken = 1;
-+ break;
-+ }
-+ }
-+
-+ /* Free list found if not taken by a socket */
-+ if (!list_taken) {
-+ mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
-+ break;
-+ }
-+ }
-+
-+ if (i >= MPTCP_GW_MAX_LISTS)
-+ goto error;
-+
-+ return i;
-+error:
-+ return -1;
-+}
-+
-+/* The list of addresses is parsed each time a new connection is opened,
-+ * to make sure it's up to date. In case of error, all the lists are
-+ * marked as unavailable and the subflow's fingerprint is set to 0.
-+ */
-+static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
-+{
-+ int i, j, ret;
-+ unsigned char opt[MAX_IPOPTLEN] = {0};
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
-+
-+ /* Read lock: multiple sockets can read LSRR addresses at the same
-+ * time, but writes are done in mutual exclusion.
-+ * Spin lock: must search for free list for one socket at a time, or
-+ * multiple sockets could take the same list.
-+ */
-+ read_lock(&mptcp_gws_lock);
-+ spin_lock(fmp->flow_lock);
-+
-+ i = mptcp_get_avail_list_ipv4(sk);
-+
-+ /* Execution enters here only if a free path is found.
-+ */
-+ if (i >= 0) {
-+ opt[0] = IPOPT_NOP;
-+ opt[1] = IPOPT_LSRR;
-+ opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
-+ (mptcp_gws->len[i] + 1) + 3;
-+ opt[3] = IPOPT_MINOFF;
-+ for (j = 0; j < mptcp_gws->len[i]; ++j)
-+ memcpy(opt + 4 +
-+ (j * sizeof(mptcp_gws->list[i][0].s_addr)),
-+ &mptcp_gws->list[i][j].s_addr,
-+ sizeof(mptcp_gws->list[i][0].s_addr));
-+ /* Final destination must be part of IP_OPTIONS parameter. */
-+ memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
-+ sizeof(addr.s_addr));
-+
-+ /* setsockopt must be inside the lock, otherwise another
-+ * subflow could fail to see that we have taken a list.
-+ */
-+ ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
-+ 4 + sizeof(mptcp_gws->list[i][0].s_addr)
-+ * (mptcp_gws->len[i] + 1));
-+
-+ if (ret < 0) {
-+ mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
-+ __func__, ret);
-+ }
-+ }
-+
-+ spin_unlock(fmp->flow_lock);
-+ read_unlock(&mptcp_gws_lock);
-+
-+ return;
-+}
-+
-+/* Parses gateways string for a list of paths to different
-+ * gateways, and stores them for use with the Loose Source Routing (LSRR)
-+ * socket option. Each list must have "," separated addresses, and the lists
-+ * themselves must be separated by "-". Returns -1 in case one or more of the
-+ * addresses is not a valid ipv4/6 address.
-+ */
-+static int mptcp_parse_gateway_ipv4(char *gateways)
-+{
-+ int i, j, k, ret;
-+ char *tmp_string = NULL;
-+ struct in_addr tmp_addr;
-+
-+ tmp_string = kzalloc(16, GFP_KERNEL);
-+ if (tmp_string == NULL)
-+ return -ENOMEM;
-+
-+ write_lock(&mptcp_gws_lock);
-+
-+ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
-+
-+ /* A TMP string is used since inet_pton needs a null terminated string
-+ * but we do not want to modify the sysctl for obvious reasons.
-+ * i will iterate over the SYSCTL string, j will iterate over the
-+ * temporary string where each IP is copied into, k will iterate over
-+ * the IPs in each list.
-+ */
-+ for (i = j = k = 0;
-+ i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
-+ ++i) {
-+ if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
-+ /* If the temp IP is empty and the current list is
-+ * empty, we are done.
-+ */
-+ if (j == 0 && mptcp_gws->len[k] == 0)
-+ break;
-+
-+ /* Terminate the temp IP string, then if it is
-+ * non-empty parse the IP and copy it.
-+ */
-+ tmp_string[j] = '\0';
-+ if (j > 0) {
-+ mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
-+
-+ ret = in4_pton(tmp_string, strlen(tmp_string),
-+ (u8 *)&tmp_addr.s_addr, '\0',
-+ NULL);
-+
-+ if (ret) {
-+ mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
-+ ret,
-+ &tmp_addr.s_addr);
-+ memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
-+ &tmp_addr.s_addr,
-+ sizeof(tmp_addr.s_addr));
-+ mptcp_gws->len[k]++;
-+ j = 0;
-+ tmp_string[j] = '\0';
-+ /* Since we can't impose a limit to
-+ * what the user can input, make sure
-+ * there are not too many IPs in the
-+ * SYSCTL string.
-+ */
-+ if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
-+ mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
-+ k,
-+ MPTCP_GW_LIST_MAX_LEN);
-+ goto error;
-+ }
-+ } else {
-+ goto error;
-+ }
-+ }
-+
-+ if (gateways[i] == '-' || gateways[i] == '\0')
-+ ++k;
-+ } else {
-+ tmp_string[j] = gateways[i];
-+ ++j;
-+ }
-+ }
-+
-+ /* Number of flows is number of gateway lists plus master flow */
-+ mptcp_binder_ndiffports = k+1;
-+
-+ write_unlock(&mptcp_gws_lock);
-+ kfree(tmp_string);
-+
-+ return 0;
-+
-+error:
-+ memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
-+ memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
-+ write_unlock(&mptcp_gws_lock);
-+ kfree(tmp_string);
-+ return -1;
-+}
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+ const struct binder_priv *pm_priv = container_of(work,
-+ struct binder_priv,
-+ subflow_work);
-+ struct mptcp_cb *mpcb = pm_priv->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ int iter = 0;
-+
-+next_subflow:
-+ if (iter) {
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+
-+ cond_resched();
-+ }
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ iter++;
-+
-+ if (sock_flag(meta_sk, SOCK_DEAD))
-+ goto exit;
-+
-+ if (mpcb->master_sk &&
-+ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+ goto exit;
-+
-+ if (mptcp_binder_ndiffports > iter &&
-+ mptcp_binder_ndiffports > mpcb->cnt_subflows) {
-+ struct mptcp_loc4 loc;
-+ struct mptcp_rem4 rem;
-+
-+ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
-+ loc.loc4_id = 0;
-+ loc.low_prio = 0;
-+
-+ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
-+ rem.port = inet_sk(meta_sk)->inet_dport;
-+ rem.rem4_id = 0; /* Default 0 */
-+
-+ mptcp_init4_subsockets(meta_sk, &loc, &rem);
-+
-+ goto next_subflow;
-+ }
-+
-+exit:
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk);
-+}
-+
-+static void binder_new_session(const struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
-+ static DEFINE_SPINLOCK(flow_lock);
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ if (meta_sk->sk_family == AF_INET6 &&
-+ !mptcp_v6_is_v4_mapped(meta_sk)) {
-+ mptcp_fallback_default(mpcb);
-+ return;
-+ }
-+#endif
-+
-+ /* Initialize workqueue-struct */
-+ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+ fmp->mpcb = mpcb;
-+
-+ fmp->flow_lock = &flow_lock;
-+}
-+
-+static void binder_create_subflows(struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
-+
-+ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+ mpcb->send_infinite_mapping ||
-+ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+ return;
-+
-+ if (!work_pending(&pm_priv->subflow_work)) {
-+ sock_hold(meta_sk);
-+ queue_work(mptcp_wq, &pm_priv->subflow_work);
-+ }
-+}
-+
-+static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio)
-+{
-+ return 0;
-+}
-+
-+/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
-+ * Inspired from proc_tcp_congestion_control().
-+ */
-+static int proc_mptcp_gateways(ctl_table *ctl, int write,
-+ void __user *buffer, size_t *lenp,
-+ loff_t *ppos)
-+{
-+ int ret;
-+ ctl_table tbl = {
-+ .maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
-+ };
-+
-+ if (write) {
-+ tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
-+ if (tbl.data == NULL)
-+ return -1;
-+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+ if (ret == 0) {
-+ ret = mptcp_parse_gateway_ipv4(tbl.data);
-+ memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
-+ }
-+ kfree(tbl.data);
-+ } else {
-+ ret = proc_dostring(ctl, write, buffer, lenp, ppos);
-+ }
-+
-+
-+ return ret;
-+}
-+
-+static struct mptcp_pm_ops binder __read_mostly = {
-+ .new_session = binder_new_session,
-+ .fully_established = binder_create_subflows,
-+ .get_local_id = binder_get_local_id,
-+ .init_subsocket_v4 = mptcp_v4_add_lsrr,
-+ .name = "binder",
-+ .owner = THIS_MODULE,
-+};
-+
-+static struct ctl_table binder_table[] = {
-+ {
-+ .procname = "mptcp_binder_gateways",
-+ .data = &sysctl_mptcp_binder_gateways,
-+ .maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
-+ .mode = 0644,
-+ .proc_handler = &proc_mptcp_gateways
-+ },
-+ { }
-+};
-+
-+struct ctl_table_header *mptcp_sysctl_binder;
-+
-+/* General initialization of MPTCP_PM */
-+static int __init binder_register(void)
-+{
-+ mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
-+ if (!mptcp_gws)
-+ return -ENOMEM;
-+
-+ rwlock_init(&mptcp_gws_lock);
-+
-+ BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
-+
-+ mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
-+ binder_table);
-+ if (!mptcp_sysctl_binder)
-+ goto sysctl_fail;
-+
-+ if (mptcp_register_path_manager(&binder))
-+ goto pm_failed;
-+
-+ return 0;
-+
-+pm_failed:
-+ unregister_net_sysctl_table(mptcp_sysctl_binder);
-+sysctl_fail:
-+ kfree(mptcp_gws);
-+
-+ return -1;
-+}
-+
-+static void binder_unregister(void)
-+{
-+ mptcp_unregister_path_manager(&binder);
-+ unregister_net_sysctl_table(mptcp_sysctl_binder);
-+ kfree(mptcp_gws);
-+}
-+
-+module_init(binder_register);
-+module_exit(binder_unregister);
-+
-+MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("BINDER MPTCP");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
-new file mode 100644
-index 000000000000..5d761164eb85
---- /dev/null
-+++ b/net/mptcp/mptcp_coupled.c
-@@ -0,0 +1,270 @@
-+/*
-+ * MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+#include <linux/module.h>
-+
-+/* Scaling is done in the numerator with alpha_scale_num and in the denominator
-+ * with alpha_scale_den.
-+ *
-+ * To downscale, we just need to use alpha_scale.
-+ *
-+ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
-+ */
-+static int alpha_scale_den = 10;
-+static int alpha_scale_num = 32;
-+static int alpha_scale = 12;
-+
-+struct mptcp_ccc {
-+ u64 alpha;
-+ bool forced_update;
-+};
-+
-+static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
-+{
-+ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
-+}
-+
-+static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
-+{
-+ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
-+}
-+
-+static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
-+{
-+ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
-+}
-+
-+static inline u64 mptcp_ccc_scale(u32 val, int scale)
-+{
-+ return (u64) val << scale;
-+}
-+
-+static inline bool mptcp_get_forced(const struct sock *meta_sk)
-+{
-+ return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
-+}
-+
-+static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
-+{
-+ ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
-+}
-+
-+static void mptcp_ccc_recalc_alpha(const struct sock *sk)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+ const struct sock *sub_sk;
-+ int best_cwnd = 0, best_rtt = 0, can_send = 0;
-+ u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
-+
-+ if (!mpcb)
-+ return;
-+
-+ /* Only one subflow left - fall back to normal reno-behavior
-+ * (set alpha to 1)
-+ */
-+ if (mpcb->cnt_established <= 1)
-+ goto exit;
-+
-+ /* Do regular alpha-calculation for multiple subflows */
-+
-+ /* Find the max numerator of the alpha-calculation */
-+ mptcp_for_each_sk(mpcb, sub_sk) {
-+ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+ u64 tmp;
-+
-+ if (!mptcp_ccc_sk_can_send(sub_sk))
-+ continue;
-+
-+ can_send++;
-+
-+ /* We need to look for the path, that provides the max-value.
-+ * Integer-overflow is not possible here, because
-+ * tmp will be in u64.
-+ */
-+ tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
-+ alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
-+
-+ if (tmp >= max_numerator) {
-+ max_numerator = tmp;
-+ best_cwnd = sub_tp->snd_cwnd;
-+ best_rtt = sub_tp->srtt_us;
-+ }
-+ }
-+
-+ /* No subflow is able to send - we don't care anymore */
-+ if (unlikely(!can_send))
-+ goto exit;
-+
-+ /* Calculate the denominator */
-+ mptcp_for_each_sk(mpcb, sub_sk) {
-+ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+
-+ if (!mptcp_ccc_sk_can_send(sub_sk))
-+ continue;
-+
-+ sum_denominator += div_u64(
-+ mptcp_ccc_scale(sub_tp->snd_cwnd,
-+ alpha_scale_den) * best_rtt,
-+ sub_tp->srtt_us);
-+ }
-+ sum_denominator *= sum_denominator;
-+ if (unlikely(!sum_denominator)) {
-+ pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
-+ __func__, mpcb->cnt_established);
-+ mptcp_for_each_sk(mpcb, sub_sk) {
-+ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+ pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
-+ __func__, sub_tp->mptcp->path_index,
-+ sub_sk->sk_state, sub_tp->srtt_us,
-+ sub_tp->snd_cwnd);
-+ }
-+ }
-+
-+ alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
-+
-+ if (unlikely(!alpha))
-+ alpha = 1;
-+
-+exit:
-+ mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
-+}
-+
-+static void mptcp_ccc_init(struct sock *sk)
-+{
-+ if (mptcp(tcp_sk(sk))) {
-+ mptcp_set_forced(mptcp_meta_sk(sk), 0);
-+ mptcp_set_alpha(mptcp_meta_sk(sk), 1);
-+ }
-+ /* If we do not mptcp, behave like reno: return */
-+}
-+
-+static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
-+{
-+ if (event == CA_EVENT_LOSS)
-+ mptcp_ccc_recalc_alpha(sk);
-+}
-+
-+static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
-+{
-+ if (!mptcp(tcp_sk(sk)))
-+ return;
-+
-+ mptcp_set_forced(mptcp_meta_sk(sk), 1);
-+}
-+
-+static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ const struct mptcp_cb *mpcb = tp->mpcb;
-+ int snd_cwnd;
-+
-+ if (!mptcp(tp)) {
-+ tcp_reno_cong_avoid(sk, ack, acked);
-+ return;
-+ }
-+
-+ if (!tcp_is_cwnd_limited(sk))
-+ return;
-+
-+ if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+ /* In "safe" area, increase. */
-+ tcp_slow_start(tp, acked);
-+ mptcp_ccc_recalc_alpha(sk);
-+ return;
-+ }
-+
-+ if (mptcp_get_forced(mptcp_meta_sk(sk))) {
-+ mptcp_ccc_recalc_alpha(sk);
-+ mptcp_set_forced(mptcp_meta_sk(sk), 0);
-+ }
-+
-+ if (mpcb->cnt_established > 1) {
-+ u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
-+
-+ /* This may happen, if at the initialization, the mpcb
-+ * was not yet attached to the sock, and thus
-+ * initializing alpha failed.
-+ */
-+ if (unlikely(!alpha))
-+ alpha = 1;
-+
-+ snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
-+ alpha);
-+
-+ /* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
-+ * Thus, we select here the max value.
-+ */
-+ if (snd_cwnd < tp->snd_cwnd)
-+ snd_cwnd = tp->snd_cwnd;
-+ } else {
-+ snd_cwnd = tp->snd_cwnd;
-+ }
-+
-+ if (tp->snd_cwnd_cnt >= snd_cwnd) {
-+ if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
-+ tp->snd_cwnd++;
-+ mptcp_ccc_recalc_alpha(sk);
-+ }
-+
-+ tp->snd_cwnd_cnt = 0;
-+ } else {
-+ tp->snd_cwnd_cnt++;
-+ }
-+}
-+
-+static struct tcp_congestion_ops mptcp_ccc = {
-+ .init = mptcp_ccc_init,
-+ .ssthresh = tcp_reno_ssthresh,
-+ .cong_avoid = mptcp_ccc_cong_avoid,
-+ .cwnd_event = mptcp_ccc_cwnd_event,
-+ .set_state = mptcp_ccc_set_state,
-+ .owner = THIS_MODULE,
-+ .name = "lia",
-+};
-+
-+static int __init mptcp_ccc_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
-+ return tcp_register_congestion_control(&mptcp_ccc);
-+}
-+
-+static void __exit mptcp_ccc_unregister(void)
-+{
-+ tcp_unregister_congestion_control(&mptcp_ccc);
-+}
-+
-+module_init(mptcp_ccc_register);
-+module_exit(mptcp_ccc_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
-new file mode 100644
-index 000000000000..28dfa0479f5e
---- /dev/null
-+++ b/net/mptcp/mptcp_ctrl.c
-@@ -0,0 +1,2401 @@
-+/*
-+ * MPTCP implementation - MPTCP-control
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <net/inet_common.h>
-+#include <net/inet6_hashtables.h>
-+#include <net/ipv6.h>
-+#include <net/ip6_checksum.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/ip6_route.h>
-+#include <net/mptcp_v6.h>
-+#endif
-+#include <net/sock.h>
-+#include <net/tcp.h>
-+#include <net/tcp_states.h>
-+#include <net/transp_v6.h>
-+#include <net/xfrm.h>
-+
-+#include <linux/cryptohash.h>
-+#include <linux/kconfig.h>
-+#include <linux/module.h>
-+#include <linux/netpoll.h>
-+#include <linux/list.h>
-+#include <linux/jhash.h>
-+#include <linux/tcp.h>
-+#include <linux/net.h>
-+#include <linux/in.h>
-+#include <linux/random.h>
-+#include <linux/inetdevice.h>
-+#include <linux/workqueue.h>
-+#include <linux/atomic.h>
-+#include <linux/sysctl.h>
-+
-+static struct kmem_cache *mptcp_sock_cache __read_mostly;
-+static struct kmem_cache *mptcp_cb_cache __read_mostly;
-+static struct kmem_cache *mptcp_tw_cache __read_mostly;
-+
-+int sysctl_mptcp_enabled __read_mostly = 1;
-+int sysctl_mptcp_checksum __read_mostly = 1;
-+int sysctl_mptcp_debug __read_mostly;
-+EXPORT_SYMBOL(sysctl_mptcp_debug);
-+int sysctl_mptcp_syn_retries __read_mostly = 3;
-+
-+bool mptcp_init_failed __read_mostly;
-+
-+struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
-+EXPORT_SYMBOL(mptcp_static_key);
-+
-+static int proc_mptcp_path_manager(ctl_table *ctl, int write,
-+ void __user *buffer, size_t *lenp,
-+ loff_t *ppos)
-+{
-+ char val[MPTCP_PM_NAME_MAX];
-+ ctl_table tbl = {
-+ .data = val,
-+ .maxlen = MPTCP_PM_NAME_MAX,
-+ };
-+ int ret;
-+
-+ mptcp_get_default_path_manager(val);
-+
-+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+ if (write && ret == 0)
-+ ret = mptcp_set_default_path_manager(val);
-+ return ret;
-+}
-+
-+static int proc_mptcp_scheduler(ctl_table *ctl, int write,
-+ void __user *buffer, size_t *lenp,
-+ loff_t *ppos)
-+{
-+ char val[MPTCP_SCHED_NAME_MAX];
-+ ctl_table tbl = {
-+ .data = val,
-+ .maxlen = MPTCP_SCHED_NAME_MAX,
-+ };
-+ int ret;
-+
-+ mptcp_get_default_scheduler(val);
-+
-+ ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+ if (write && ret == 0)
-+ ret = mptcp_set_default_scheduler(val);
-+ return ret;
-+}
-+
-+static struct ctl_table mptcp_table[] = {
-+ {
-+ .procname = "mptcp_enabled",
-+ .data = &sysctl_mptcp_enabled,
-+ .maxlen = sizeof(int),
-+ .mode = 0644,
-+ .proc_handler = &proc_dointvec
-+ },
-+ {
-+ .procname = "mptcp_checksum",
-+ .data = &sysctl_mptcp_checksum,
-+ .maxlen = sizeof(int),
-+ .mode = 0644,
-+ .proc_handler = &proc_dointvec
-+ },
-+ {
-+ .procname = "mptcp_debug",
-+ .data = &sysctl_mptcp_debug,
-+ .maxlen = sizeof(int),
-+ .mode = 0644,
-+ .proc_handler = &proc_dointvec
-+ },
-+ {
-+ .procname = "mptcp_syn_retries",
-+ .data = &sysctl_mptcp_syn_retries,
-+ .maxlen = sizeof(int),
-+ .mode = 0644,
-+ .proc_handler = &proc_dointvec
-+ },
-+ {
-+ .procname = "mptcp_path_manager",
-+ .mode = 0644,
-+ .maxlen = MPTCP_PM_NAME_MAX,
-+ .proc_handler = proc_mptcp_path_manager,
-+ },
-+ {
-+ .procname = "mptcp_scheduler",
-+ .mode = 0644,
-+ .maxlen = MPTCP_SCHED_NAME_MAX,
-+ .proc_handler = proc_mptcp_scheduler,
-+ },
-+ { }
-+};
-+
-+static inline u32 mptcp_hash_tk(u32 token)
-+{
-+ return token % MPTCP_HASH_SIZE;
-+}
-+
-+struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
-+EXPORT_SYMBOL(tk_hashtable);
-+
-+/* This second hashtable is needed to retrieve request socks
-+ * created as a result of a join request. While the SYN contains
-+ * the token, the final ack does not, so we need a separate hashtable
-+ * to retrieve the mpcb.
-+ */
-+struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
-+spinlock_t mptcp_reqsk_hlock; /* hashtable protection */
-+
-+/* The following hash table is used to avoid collision of token */
-+static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
-+spinlock_t mptcp_tk_hashlock; /* hashtable protection */
-+
-+static bool mptcp_reqsk_find_tk(const u32 token)
-+{
-+ const u32 hash = mptcp_hash_tk(token);
-+ const struct mptcp_request_sock *mtreqsk;
-+ const struct hlist_nulls_node *node;
-+
-+begin:
-+ hlist_nulls_for_each_entry_rcu(mtreqsk, node,
-+ &mptcp_reqsk_tk_htb[hash], hash_entry) {
-+ if (token == mtreqsk->mptcp_loc_token)
-+ return true;
-+ }
-+ /* A request-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash)
-+ goto begin;
-+ return false;
-+}
-+
-+static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
-+{
-+ u32 hash = mptcp_hash_tk(token);
-+
-+ hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
-+ &mptcp_reqsk_tk_htb[hash]);
-+}
-+
-+static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
-+{
-+ rcu_read_lock();
-+ spin_lock(&mptcp_tk_hashlock);
-+ hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock();
-+}
-+
-+void mptcp_reqsk_destructor(struct request_sock *req)
-+{
-+ if (!mptcp_rsk(req)->is_sub) {
-+ if (in_softirq()) {
-+ mptcp_reqsk_remove_tk(req);
-+ } else {
-+ rcu_read_lock_bh();
-+ spin_lock(&mptcp_tk_hashlock);
-+ hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock_bh();
-+ }
-+ } else {
-+ mptcp_hash_request_remove(req);
-+ }
-+}
-+
-+static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
-+{
-+ u32 hash = mptcp_hash_tk(token);
-+ hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
-+ meta_tp->inside_tk_table = 1;
-+}
-+
-+static bool mptcp_find_token(u32 token)
-+{
-+ const u32 hash = mptcp_hash_tk(token);
-+ const struct tcp_sock *meta_tp;
-+ const struct hlist_nulls_node *node;
-+
-+begin:
-+ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
-+ if (token == meta_tp->mptcp_loc_token)
-+ return true;
-+ }
-+ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash)
-+ goto begin;
-+ return false;
-+}
-+
-+static void mptcp_set_key_reqsk(struct request_sock *req,
-+ const struct sk_buff *skb)
-+{
-+ const struct inet_request_sock *ireq = inet_rsk(req);
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+
-+ if (skb->protocol == htons(ETH_P_IP)) {
-+ mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
-+ ip_hdr(skb)->daddr,
-+ htons(ireq->ir_num),
-+ ireq->ir_rmt_port);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else {
-+ mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
-+ ipv6_hdr(skb)->daddr.s6_addr32,
-+ htons(ireq->ir_num),
-+ ireq->ir_rmt_port);
-+#endif
-+ }
-+
-+ mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
-+}
-+
-+/* New MPTCP-connection request, prepare a new token for the meta-socket that
-+ * will be created in mptcp_check_req_master(), and store the received token.
-+ */
-+void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+ const struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb)
-+{
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+
-+ inet_rsk(req)->saw_mpc = 1;
-+
-+ rcu_read_lock();
-+ spin_lock(&mptcp_tk_hashlock);
-+ do {
-+ mptcp_set_key_reqsk(req, skb);
-+ } while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
-+ mptcp_find_token(mtreq->mptcp_loc_token));
-+
-+ mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock();
-+ mtreq->mptcp_rem_key = mopt->mptcp_key;
-+}
-+
-+static void mptcp_set_key_sk(const struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ const struct inet_sock *isk = inet_sk(sk);
-+
-+ if (sk->sk_family == AF_INET)
-+ tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
-+ isk->inet_daddr,
-+ isk->inet_sport,
-+ isk->inet_dport);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
-+ sk->sk_v6_daddr.s6_addr32,
-+ isk->inet_sport,
-+ isk->inet_dport);
-+#endif
-+
-+ mptcp_key_sha1(tp->mptcp_loc_key,
-+ &tp->mptcp_loc_token, NULL);
-+}
-+
-+void mptcp_connect_init(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ rcu_read_lock_bh();
-+ spin_lock(&mptcp_tk_hashlock);
-+ do {
-+ mptcp_set_key_sk(sk);
-+ } while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
-+ mptcp_find_token(tp->mptcp_loc_token));
-+
-+ __mptcp_hash_insert(tp, tp->mptcp_loc_token);
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock_bh();
-+}
-+
-+/**
-+ * This function increments the refcount of the mpcb struct.
-+ * It is the responsibility of the caller to decrement when releasing
-+ * the structure.
-+ */
-+struct sock *mptcp_hash_find(const struct net *net, const u32 token)
-+{
-+ const u32 hash = mptcp_hash_tk(token);
-+ const struct tcp_sock *meta_tp;
-+ struct sock *meta_sk = NULL;
-+ const struct hlist_nulls_node *node;
-+
-+ rcu_read_lock();
-+begin:
-+ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
-+ tk_table) {
-+ meta_sk = (struct sock *)meta_tp;
-+ if (token == meta_tp->mptcp_loc_token &&
-+ net_eq(net, sock_net(meta_sk))) {
-+ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+ goto out;
-+ if (unlikely(token != meta_tp->mptcp_loc_token ||
-+ !net_eq(net, sock_net(meta_sk)))) {
-+ sock_gen_put(meta_sk);
-+ goto begin;
-+ }
-+ goto found;
-+ }
-+ }
-+ /* A TCP-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash)
-+ goto begin;
-+out:
-+ meta_sk = NULL;
-+found:
-+ rcu_read_unlock();
-+ return meta_sk;
-+}
-+
-+void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
-+{
-+ /* remove from the token hashtable */
-+ rcu_read_lock_bh();
-+ spin_lock(&mptcp_tk_hashlock);
-+ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
-+ meta_tp->inside_tk_table = 0;
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock_bh();
-+}
-+
-+void mptcp_hash_remove(struct tcp_sock *meta_tp)
-+{
-+ rcu_read_lock();
-+ spin_lock(&mptcp_tk_hashlock);
-+ hlist_nulls_del_init_rcu(&meta_tp->tk_table);
-+ meta_tp->inside_tk_table = 0;
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock();
-+}
-+
-+struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
-+{
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sock *sk, *rttsk = NULL, *lastsk = NULL;
-+ u32 min_time = 0, last_active = 0;
-+
-+ mptcp_for_each_sk(meta_tp->mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ u32 elapsed;
-+
-+ if (!mptcp_sk_can_send_ack(sk) || tp->pf)
-+ continue;
-+
-+ elapsed = keepalive_time_elapsed(tp);
-+
-+ /* We take the one with the lowest RTT within a reasonable
-+ * (meta-RTO)-timeframe
-+ */
-+ if (elapsed < inet_csk(meta_sk)->icsk_rto) {
-+ if (!min_time || tp->srtt_us < min_time) {
-+ min_time = tp->srtt_us;
-+ rttsk = sk;
-+ }
-+ continue;
-+ }
-+
-+ /* Otherwise, we just take the most recent active */
-+ if (!rttsk && (!last_active || elapsed < last_active)) {
-+ last_active = elapsed;
-+ lastsk = sk;
-+ }
-+ }
-+
-+ if (rttsk)
-+ return rttsk;
-+
-+ return lastsk;
-+}
-+EXPORT_SYMBOL(mptcp_select_ack_sock);
-+
-+static void mptcp_sock_def_error_report(struct sock *sk)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+ if (!sock_flag(sk, SOCK_DEAD))
-+ mptcp_sub_close(sk, 0);
-+
-+ if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
-+ mpcb->send_infinite_mapping) {
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ meta_sk->sk_err = sk->sk_err;
-+ meta_sk->sk_err_soft = sk->sk_err_soft;
-+
-+ if (!sock_flag(meta_sk, SOCK_DEAD))
-+ meta_sk->sk_error_report(meta_sk);
-+
-+ tcp_done(meta_sk);
-+ }
-+
-+ sk->sk_err = 0;
-+ return;
-+}
-+
-+static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
-+{
-+ if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
-+ mptcp_cleanup_path_manager(mpcb);
-+ mptcp_cleanup_scheduler(mpcb);
-+ kmem_cache_free(mptcp_cb_cache, mpcb);
-+ }
-+}
-+
-+static void mptcp_sock_destruct(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ inet_sock_destruct(sk);
-+
-+ if (!is_meta_sk(sk) && !tp->was_meta_sk) {
-+ BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
-+
-+ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
-+ tp->mptcp = NULL;
-+
-+ /* Taken when mpcb pointer was set */
-+ sock_put(mptcp_meta_sk(sk));
-+ mptcp_mpcb_put(tp->mpcb);
-+ } else {
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct mptcp_tw *mptw;
-+
-+ /* The mpcb is disappearing - we can make the final
-+ * update to the rcv_nxt of the time-wait-sock and remove
-+ * its reference to the mpcb.
-+ */
-+ spin_lock_bh(&mpcb->tw_lock);
-+ list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
-+ list_del_rcu(&mptw->list);
-+ mptw->in_list = 0;
-+ mptcp_mpcb_put(mpcb);
-+ rcu_assign_pointer(mptw->mpcb, NULL);
-+ }
-+ spin_unlock_bh(&mpcb->tw_lock);
-+
-+ mptcp_mpcb_put(mpcb);
-+
-+ mptcp_debug("%s destroying meta-sk\n", __func__);
-+ }
-+
-+ WARN_ON(!static_key_false(&mptcp_static_key));
-+ /* Must be the last call, because is_meta_sk() above still needs the
-+ * static key
-+ */
-+ static_key_slow_dec(&mptcp_static_key);
-+}
-+
-+void mptcp_destroy_sock(struct sock *sk)
-+{
-+ if (is_meta_sk(sk)) {
-+ struct sock *sk_it, *tmpsk;
-+
-+ __skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
-+ mptcp_purge_ofo_queue(tcp_sk(sk));
-+
-+ /* We have to close all remaining subflows. Normally, they
-+ * should all be about to get closed. But, if the kernel is
-+ * forcing a closure (e.g., tcp_write_err), the subflows might
-+ * not have been closed properly (as we are waiting for the
-+ * DATA_ACK of the DATA_FIN).
-+ */
-+ mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
-+ /* Already did call tcp_close - waiting for graceful
-+ * closure, or if we are retransmitting fast-close on
-+ * the subflow. The reset (or timeout) will kill the
-+ * subflow..
-+ */
-+ if (tcp_sk(sk_it)->closing ||
-+ tcp_sk(sk_it)->send_mp_fclose)
-+ continue;
-+
-+ /* Allow the delayed work first to prevent time-wait state */
-+ if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
-+ continue;
-+
-+ mptcp_sub_close(sk_it, 0);
-+ }
-+
-+ mptcp_delete_synack_timer(sk);
-+ } else {
-+ mptcp_del_sock(sk);
-+ }
-+}
-+
-+static void mptcp_set_state(struct sock *sk)
-+{
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ /* Meta is not yet established - wake up the application */
-+ if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
-+ sk->sk_state == TCP_ESTABLISHED) {
-+ tcp_set_state(meta_sk, TCP_ESTABLISHED);
-+
-+ if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+ meta_sk->sk_state_change(meta_sk);
-+ sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
-+ }
-+ }
-+
-+ if (sk->sk_state == TCP_ESTABLISHED) {
-+ tcp_sk(sk)->mptcp->establish_increased = 1;
-+ tcp_sk(sk)->mpcb->cnt_established++;
-+ }
-+}
-+
-+void mptcp_init_congestion_control(struct sock *sk)
-+{
-+ struct inet_connection_sock *icsk = inet_csk(sk);
-+ struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
-+ const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
-+
-+ /* The application didn't set the congestion control to use
-+ * fallback to the default one.
-+ */
-+ if (ca == &tcp_init_congestion_ops)
-+ goto use_default;
-+
-+ /* Use the same congestion control as set by the user. If the
-+ * module is not available fallback to the default one.
-+ */
-+ if (!try_module_get(ca->owner)) {
-+ pr_warn("%s: fallback to the system default CC\n", __func__);
-+ goto use_default;
-+ }
-+
-+ icsk->icsk_ca_ops = ca;
-+ if (icsk->icsk_ca_ops->init)
-+ icsk->icsk_ca_ops->init(sk);
-+
-+ return;
-+
-+use_default:
-+ icsk->icsk_ca_ops = &tcp_init_congestion_ops;
-+ tcp_init_congestion_control(sk);
-+}
-+
-+u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
-+u32 mptcp_seed = 0;
-+
-+void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
-+{
-+ u32 workspace[SHA_WORKSPACE_WORDS];
-+ u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
-+ u8 input[64];
-+ int i;
-+
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ /* Initialize input with appropriate padding */
-+ memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
-+ * is explicitly set too
-+ */
-+ memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
-+ input[8] = 0x80; /* Padding: First bit after message = 1 */
-+ input[63] = 0x40; /* Padding: Length of the message = 64 bits */
-+
-+ sha_init(mptcp_hashed_key);
-+ sha_transform(mptcp_hashed_key, input, workspace);
-+
-+ for (i = 0; i < 5; i++)
-+ mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
-+
-+ if (token)
-+ *token = mptcp_hashed_key[0];
-+ if (idsn)
-+ *idsn = *((u64 *)&mptcp_hashed_key[3]);
-+}
-+
-+void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
-+ u32 *hash_out)
-+{
-+ u32 workspace[SHA_WORKSPACE_WORDS];
-+ u8 input[128]; /* 2 512-bit blocks */
-+ int i;
-+
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ /* Generate key xored with ipad */
-+ memset(input, 0x36, 64);
-+ for (i = 0; i < 8; i++)
-+ input[i] ^= key_1[i];
-+ for (i = 0; i < 8; i++)
-+ input[i + 8] ^= key_2[i];
-+
-+ memcpy(&input[64], rand_1, 4);
-+ memcpy(&input[68], rand_2, 4);
-+ input[72] = 0x80; /* Padding: First bit after message = 1 */
-+ memset(&input[73], 0, 53);
-+
-+ /* Padding: Length of the message = 512 + 64 bits */
-+ input[126] = 0x02;
-+ input[127] = 0x40;
-+
-+ sha_init(hash_out);
-+ sha_transform(hash_out, input, workspace);
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ sha_transform(hash_out, &input[64], workspace);
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ for (i = 0; i < 5; i++)
-+ hash_out[i] = cpu_to_be32(hash_out[i]);
-+
-+ /* Prepare second part of hmac */
-+ memset(input, 0x5C, 64);
-+ for (i = 0; i < 8; i++)
-+ input[i] ^= key_1[i];
-+ for (i = 0; i < 8; i++)
-+ input[i + 8] ^= key_2[i];
-+
-+ memcpy(&input[64], hash_out, 20);
-+ input[84] = 0x80;
-+ memset(&input[85], 0, 41);
-+
-+ /* Padding: Length of the message = 512 + 160 bits */
-+ input[126] = 0x02;
-+ input[127] = 0xA0;
-+
-+ sha_init(hash_out);
-+ sha_transform(hash_out, input, workspace);
-+ memset(workspace, 0, sizeof(workspace));
-+
-+ sha_transform(hash_out, &input[64], workspace);
-+
-+ for (i = 0; i < 5; i++)
-+ hash_out[i] = cpu_to_be32(hash_out[i]);
-+}
-+
-+static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
-+{
-+ /* Socket-options handled by sk_clone_lock while creating the meta-sk.
-+ * ======
-+ * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
-+ * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
-+ * TCP_NODELAY, TCP_CORK
-+ *
-+ * Socket-options handled in this function here
-+ * ======
-+ * TCP_DEFER_ACCEPT
-+ * SO_KEEPALIVE
-+ *
-+ * Socket-options on the todo-list
-+ * ======
-+ * SO_BINDTODEVICE - should probably prevent creation of new subsocks
-+ * across other devices. - what about the api-draft?
-+ * SO_DEBUG
-+ * SO_REUSEADDR - probably we don't care about this
-+ * SO_DONTROUTE, SO_BROADCAST
-+ * SO_OOBINLINE
-+ * SO_LINGER
-+ * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
-+ * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
-+ * SO_RXQ_OVFL
-+ * TCP_COOKIE_TRANSACTIONS
-+ * TCP_MAXSEG
-+ * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
-+ * in mptcp_retransmit_timer. AND we need to check what is
-+ * about the subsockets.
-+ * TCP_LINGER2
-+ * TCP_WINDOW_CLAMP
-+ * TCP_USER_TIMEOUT
-+ * TCP_MD5SIG
-+ *
-+ * Socket-options of no concern for the meta-socket (but for the subsocket)
-+ * ======
-+ * SO_PRIORITY
-+ * SO_MARK
-+ * TCP_CONGESTION
-+ * TCP_SYNCNT
-+ * TCP_QUICKACK
-+ */
-+
-+ /* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
-+ inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
-+
-+ /* Keepalives are handled entirely at the MPTCP-layer */
-+ if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
-+ inet_csk_reset_keepalive_timer(meta_sk,
-+ keepalive_time_when(tcp_sk(meta_sk)));
-+ sock_reset_flag(master_sk, SOCK_KEEPOPEN);
-+ inet_csk_delete_keepalive_timer(master_sk);
-+ }
-+
-+ /* Do not propagate subflow-errors up to the MPTCP-layer */
-+ inet_sk(master_sk)->recverr = 0;
-+}
-+
-+static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
-+{
-+ /* IP_TOS also goes to the subflow. */
-+ if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
-+ inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
-+ sub_sk->sk_priority = meta_sk->sk_priority;
-+ sk_dst_reset(sub_sk);
-+ }
-+
-+ /* Inherit SO_REUSEADDR */
-+ sub_sk->sk_reuse = meta_sk->sk_reuse;
-+
-+ /* Inherit snd/rcv-buffer locks */
-+ sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
-+
-+ /* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
-+ tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
-+
-+ /* Keepalives are handled entirely at the MPTCP-layer */
-+ if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
-+ sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
-+ inet_csk_delete_keepalive_timer(sub_sk);
-+ }
-+
-+ /* Do not propagate subflow-errors up to the MPTCP-layer */
-+ inet_sk(sub_sk)->recverr = 0;
-+}
-+
-+int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ /* skb-sk may be NULL if we receive a packet immediatly after the
-+ * SYN/ACK + MP_CAPABLE.
-+ */
-+ struct sock *sk = skb->sk ? skb->sk : meta_sk;
-+ int ret = 0;
-+
-+ skb->sk = NULL;
-+
-+ if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+
-+ if (sk->sk_family == AF_INET)
-+ ret = tcp_v4_do_rcv(sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ ret = tcp_v6_do_rcv(sk, skb);
-+#endif
-+
-+ sock_put(sk);
-+ return ret;
-+}
-+
-+struct lock_class_key meta_key;
-+struct lock_class_key meta_slock_key;
-+
-+static void mptcp_synack_timer_handler(unsigned long data)
-+{
-+ struct sock *meta_sk = (struct sock *) data;
-+ struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
-+
-+ /* Only process if socket is not in use. */
-+ bh_lock_sock(meta_sk);
-+
-+ if (sock_owned_by_user(meta_sk)) {
-+ /* Try again later. */
-+ mptcp_reset_synack_timer(meta_sk, HZ/20);
-+ goto out;
-+ }
-+
-+ /* May happen if the queue got destructed in mptcp_close */
-+ if (!lopt)
-+ goto out;
-+
-+ inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
-+ TCP_TIMEOUT_INIT, TCP_RTO_MAX);
-+
-+ if (lopt->qlen)
-+ mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
-+
-+out:
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk);
-+}
-+
-+static const struct tcp_sock_ops mptcp_meta_specific = {
-+ .__select_window = __mptcp_select_window,
-+ .select_window = mptcp_select_window,
-+ .select_initial_window = mptcp_select_initial_window,
-+ .init_buffer_space = mptcp_init_buffer_space,
-+ .set_rto = mptcp_tcp_set_rto,
-+ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
-+ .init_congestion_control = mptcp_init_congestion_control,
-+ .send_fin = mptcp_send_fin,
-+ .write_xmit = mptcp_write_xmit,
-+ .send_active_reset = mptcp_send_active_reset,
-+ .write_wakeup = mptcp_write_wakeup,
-+ .prune_ofo_queue = mptcp_prune_ofo_queue,
-+ .retransmit_timer = mptcp_retransmit_timer,
-+ .time_wait = mptcp_time_wait,
-+ .cleanup_rbuf = mptcp_cleanup_rbuf,
-+};
-+
-+static const struct tcp_sock_ops mptcp_sub_specific = {
-+ .__select_window = __mptcp_select_window,
-+ .select_window = mptcp_select_window,
-+ .select_initial_window = mptcp_select_initial_window,
-+ .init_buffer_space = mptcp_init_buffer_space,
-+ .set_rto = mptcp_tcp_set_rto,
-+ .should_expand_sndbuf = mptcp_should_expand_sndbuf,
-+ .init_congestion_control = mptcp_init_congestion_control,
-+ .send_fin = tcp_send_fin,
-+ .write_xmit = tcp_write_xmit,
-+ .send_active_reset = tcp_send_active_reset,
-+ .write_wakeup = tcp_write_wakeup,
-+ .prune_ofo_queue = tcp_prune_ofo_queue,
-+ .retransmit_timer = tcp_retransmit_timer,
-+ .time_wait = tcp_time_wait,
-+ .cleanup_rbuf = tcp_cleanup_rbuf,
-+};
-+
-+static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
-+{
-+ struct mptcp_cb *mpcb;
-+ struct sock *master_sk;
-+ struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
-+ struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
-+ u64 idsn;
-+
-+ dst_release(meta_sk->sk_rx_dst);
-+ meta_sk->sk_rx_dst = NULL;
-+ /* This flag is set to announce sock_lock_init to
-+ * reclassify the lock-class of the master socket.
-+ */
-+ meta_tp->is_master_sk = 1;
-+ master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
-+ meta_tp->is_master_sk = 0;
-+ if (!master_sk)
-+ return -ENOBUFS;
-+
-+ master_tp = tcp_sk(master_sk);
-+ master_icsk = inet_csk(master_sk);
-+
-+ mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
-+ if (!mpcb) {
-+ /* sk_free (and __sk_free) requirese wmem_alloc to be 1.
-+ * All the rest is set to 0 thanks to __GFP_ZERO above.
-+ */
-+ atomic_set(&master_sk->sk_wmem_alloc, 1);
-+ sk_free(master_sk);
-+ return -ENOBUFS;
-+ }
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
-+ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
-+
-+ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
-+
-+ newnp = inet6_sk(master_sk);
-+ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
-+
-+ newnp->ipv6_mc_list = NULL;
-+ newnp->ipv6_ac_list = NULL;
-+ newnp->ipv6_fl_list = NULL;
-+ newnp->opt = NULL;
-+ newnp->pktoptions = NULL;
-+ (void)xchg(&newnp->rxpmtu, NULL);
-+ } else if (meta_sk->sk_family == AF_INET6) {
-+ struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
-+
-+ inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
-+
-+ newnp = inet6_sk(master_sk);
-+ memcpy(newnp, np, sizeof(struct ipv6_pinfo));
-+
-+ newnp->hop_limit = -1;
-+ newnp->mcast_hops = IPV6_DEFAULT_MCASTHOPS;
-+ newnp->mc_loop = 1;
-+ newnp->pmtudisc = IPV6_PMTUDISC_WANT;
-+ newnp->ipv6only = sock_net(master_sk)->ipv6.sysctl.bindv6only;
-+ }
-+#endif
-+
-+ meta_tp->mptcp = NULL;
-+
-+ /* Store the keys and generate the peer's token */
-+ mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
-+ mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
-+
-+ /* Generate Initial data-sequence-numbers */
-+ mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
-+ idsn = ntohll(idsn) + 1;
-+ mpcb->snd_high_order[0] = idsn >> 32;
-+ mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
-+
-+ meta_tp->write_seq = (u32)idsn;
-+ meta_tp->snd_sml = meta_tp->write_seq;
-+ meta_tp->snd_una = meta_tp->write_seq;
-+ meta_tp->snd_nxt = meta_tp->write_seq;
-+ meta_tp->pushed_seq = meta_tp->write_seq;
-+ meta_tp->snd_up = meta_tp->write_seq;
-+
-+ mpcb->mptcp_rem_key = remote_key;
-+ mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
-+ idsn = ntohll(idsn) + 1;
-+ mpcb->rcv_high_order[0] = idsn >> 32;
-+ mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
-+ meta_tp->copied_seq = (u32) idsn;
-+ meta_tp->rcv_nxt = (u32) idsn;
-+ meta_tp->rcv_wup = (u32) idsn;
-+
-+ meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
-+ meta_tp->snd_wnd = window;
-+ meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
-+
-+ meta_tp->packets_out = 0;
-+ meta_icsk->icsk_probes_out = 0;
-+
-+ /* Set mptcp-pointers */
-+ master_tp->mpcb = mpcb;
-+ master_tp->meta_sk = meta_sk;
-+ meta_tp->mpcb = mpcb;
-+ meta_tp->meta_sk = meta_sk;
-+ mpcb->meta_sk = meta_sk;
-+ mpcb->master_sk = master_sk;
-+
-+ meta_tp->was_meta_sk = 0;
-+
-+ /* Initialize the queues */
-+ skb_queue_head_init(&mpcb->reinject_queue);
-+ skb_queue_head_init(&master_tp->out_of_order_queue);
-+ tcp_prequeue_init(master_tp);
-+ INIT_LIST_HEAD(&master_tp->tsq_node);
-+
-+ master_tp->tsq_flags = 0;
-+
-+ mutex_init(&mpcb->mpcb_mutex);
-+
-+ /* Init the accept_queue structure, we support a queue of 32 pending
-+ * connections, it does not need to be huge, since we only store here
-+ * pending subflow creations.
-+ */
-+ if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
-+ inet_put_port(master_sk);
-+ kmem_cache_free(mptcp_cb_cache, mpcb);
-+ sk_free(master_sk);
-+ return -ENOMEM;
-+ }
-+
-+ /* Redefine function-pointers as the meta-sk is now fully ready */
-+ static_key_slow_inc(&mptcp_static_key);
-+ meta_tp->mpc = 1;
-+ meta_tp->ops = &mptcp_meta_specific;
-+
-+ meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
-+ meta_sk->sk_destruct = mptcp_sock_destruct;
-+
-+ /* Meta-level retransmit timer */
-+ meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
-+
-+ tcp_init_xmit_timers(master_sk);
-+ /* Has been set for sending out the SYN */
-+ inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
-+
-+ if (!meta_tp->inside_tk_table) {
-+ /* Adding the meta_tp in the token hashtable - coming from server-side */
-+ rcu_read_lock();
-+ spin_lock(&mptcp_tk_hashlock);
-+
-+ __mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
-+
-+ spin_unlock(&mptcp_tk_hashlock);
-+ rcu_read_unlock();
-+ }
-+ master_tp->inside_tk_table = 0;
-+
-+ /* Init time-wait stuff */
-+ INIT_LIST_HEAD(&mpcb->tw_list);
-+ spin_lock_init(&mpcb->tw_lock);
-+
-+ INIT_HLIST_HEAD(&mpcb->callback_list);
-+
-+ mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
-+
-+ mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
-+ mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
-+ mpcb->orig_window_clamp = meta_tp->window_clamp;
-+
-+ /* The meta is directly linked - set refcnt to 1 */
-+ atomic_set(&mpcb->mpcb_refcnt, 1);
-+
-+ mptcp_init_path_manager(mpcb);
-+ mptcp_init_scheduler(mpcb);
-+
-+ setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
-+ (unsigned long)meta_sk);
-+
-+ mptcp_debug("%s: created mpcb with token %#x\n",
-+ __func__, mpcb->mptcp_loc_token);
-+
-+ return 0;
-+}
-+
-+void mptcp_fallback_meta_sk(struct sock *meta_sk)
-+{
-+ kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
-+ kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
-+}
-+
-+int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
-+ gfp_t flags)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
-+ if (!tp->mptcp)
-+ return -ENOMEM;
-+
-+ tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
-+ /* No more space for more subflows? */
-+ if (!tp->mptcp->path_index) {
-+ kmem_cache_free(mptcp_sock_cache, tp->mptcp);
-+ return -EPERM;
-+ }
-+
-+ INIT_HLIST_NODE(&tp->mptcp->cb_list);
-+
-+ tp->mptcp->tp = tp;
-+ tp->mpcb = mpcb;
-+ tp->meta_sk = meta_sk;
-+
-+ static_key_slow_inc(&mptcp_static_key);
-+ tp->mpc = 1;
-+ tp->ops = &mptcp_sub_specific;
-+
-+ tp->mptcp->loc_id = loc_id;
-+ tp->mptcp->rem_id = rem_id;
-+ if (mpcb->sched_ops->init)
-+ mpcb->sched_ops->init(sk);
-+
-+ /* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
-+ * included in mptcp_del_sock(), because the mpcb must remain alive
-+ * until the last subsocket is completely destroyed.
-+ */
-+ sock_hold(meta_sk);
-+ atomic_inc(&mpcb->mpcb_refcnt);
-+
-+ tp->mptcp->next = mpcb->connection_list;
-+ mpcb->connection_list = tp;
-+ tp->mptcp->attached = 1;
-+
-+ mpcb->cnt_subflows++;
-+ atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
-+ &meta_sk->sk_rmem_alloc);
-+
-+ mptcp_sub_inherit_sockopts(meta_sk, sk);
-+ INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
-+
-+ /* As we successfully allocated the mptcp_tcp_sock, we have to
-+ * change the function-pointers here (for sk_destruct to work correctly)
-+ */
-+ sk->sk_error_report = mptcp_sock_def_error_report;
-+ sk->sk_data_ready = mptcp_data_ready;
-+ sk->sk_write_space = mptcp_write_space;
-+ sk->sk_state_change = mptcp_set_state;
-+ sk->sk_destruct = mptcp_sock_destruct;
-+
-+ if (sk->sk_family == AF_INET)
-+ mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
-+ __func__ , mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index,
-+ &((struct inet_sock *)tp)->inet_saddr,
-+ ntohs(((struct inet_sock *)tp)->inet_sport),
-+ &((struct inet_sock *)tp)->inet_daddr,
-+ ntohs(((struct inet_sock *)tp)->inet_dport),
-+ mpcb->cnt_subflows);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
-+ __func__ , mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index, &inet6_sk(sk)->saddr,
-+ ntohs(((struct inet_sock *)tp)->inet_sport),
-+ &sk->sk_v6_daddr,
-+ ntohs(((struct inet_sock *)tp)->inet_dport),
-+ mpcb->cnt_subflows);
-+#endif
-+
-+ return 0;
-+}
-+
-+void mptcp_del_sock(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
-+ struct mptcp_cb *mpcb;
-+
-+ if (!tp->mptcp || !tp->mptcp->attached)
-+ return;
-+
-+ mpcb = tp->mpcb;
-+ tp_prev = mpcb->connection_list;
-+
-+ mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
-+ __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
-+ sk->sk_state, is_meta_sk(sk));
-+
-+ if (tp_prev == tp) {
-+ mpcb->connection_list = tp->mptcp->next;
-+ } else {
-+ for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
-+ if (tp_prev->mptcp->next == tp) {
-+ tp_prev->mptcp->next = tp->mptcp->next;
-+ break;
-+ }
-+ }
-+ }
-+ mpcb->cnt_subflows--;
-+ if (tp->mptcp->establish_increased)
-+ mpcb->cnt_established--;
-+
-+ tp->mptcp->next = NULL;
-+ tp->mptcp->attached = 0;
-+ mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
-+
-+ if (!skb_queue_empty(&sk->sk_write_queue))
-+ mptcp_reinject_data(sk, 0);
-+
-+ if (is_master_tp(tp))
-+ mpcb->master_sk = NULL;
-+ else if (tp->mptcp->pre_established)
-+ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+
-+ rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
-+}
-+
-+/* Updates the metasocket ULID/port data, based on the given sock.
-+ * The argument sock must be the sock accessible to the application.
-+ * In this function, we update the meta socket info, based on the changes
-+ * in the application socket (bind, address allocation, ...)
-+ */
-+void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
-+{
-+ if (tcp_sk(sk)->mpcb->pm_ops->new_session)
-+ tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
-+
-+ tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
-+}
-+
-+/* Clean up the receive buffer for full frames taken by the user,
-+ * then send an ACK if necessary. COPIED is the number of bytes
-+ * tcp_recvmsg has given to the user so far, it speeds up the
-+ * calculation of whether or not we must ACK for the sake of
-+ * a window update.
-+ */
-+void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sock *sk;
-+ __u32 rcv_window_now = 0;
-+
-+ if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
-+ rcv_window_now = tcp_receive_window(meta_tp);
-+
-+ if (2 * rcv_window_now > meta_tp->window_clamp)
-+ rcv_window_now = 0;
-+ }
-+
-+ mptcp_for_each_sk(meta_tp->mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ const struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+ if (!mptcp_sk_can_send_ack(sk))
-+ continue;
-+
-+ if (!inet_csk_ack_scheduled(sk))
-+ goto second_part;
-+ /* Delayed ACKs frequently hit locked sockets during bulk
-+ * receive.
-+ */
-+ if (icsk->icsk_ack.blocked ||
-+ /* Once-per-two-segments ACK was not sent by tcp_input.c */
-+ tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
-+ /* If this read emptied read buffer, we send ACK, if
-+ * connection is not bidirectional, user drained
-+ * receive buffer and there was a small segment
-+ * in queue.
-+ */
-+ (copied > 0 &&
-+ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
-+ ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
-+ !icsk->icsk_ack.pingpong)) &&
-+ !atomic_read(&meta_sk->sk_rmem_alloc))) {
-+ tcp_send_ack(sk);
-+ continue;
-+ }
-+
-+second_part:
-+ /* This here is the second part of tcp_cleanup_rbuf */
-+ if (rcv_window_now) {
-+ __u32 new_window = tp->ops->__select_window(sk);
-+
-+ /* Send ACK now, if this read freed lots of space
-+ * in our buffer. Certainly, new_window is new window.
-+ * We can advertise it now, if it is not less than
-+ * current one.
-+ * "Lots" means "at least twice" here.
-+ */
-+ if (new_window && new_window >= 2 * rcv_window_now)
-+ tcp_send_ack(sk);
-+ }
-+ }
-+}
-+
-+static int mptcp_sub_send_fin(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sk_buff *skb = tcp_write_queue_tail(sk);
-+ int mss_now;
-+
-+ /* Optimization, tack on the FIN if we have a queue of
-+ * unsent frames. But be careful about outgoing SACKS
-+ * and IP options.
-+ */
-+ mss_now = tcp_current_mss(sk);
-+
-+ if (tcp_send_head(sk) != NULL) {
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+ TCP_SKB_CB(skb)->end_seq++;
-+ tp->write_seq++;
-+ } else {
-+ skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
-+ if (!skb)
-+ return 1;
-+
-+ /* Reserve space for headers and prepare control bits. */
-+ skb_reserve(skb, MAX_TCP_HEADER);
-+ /* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
-+ tcp_init_nondata_skb(skb, tp->write_seq,
-+ TCPHDR_ACK | TCPHDR_FIN);
-+ tcp_queue_skb(sk, skb);
-+ }
-+ __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
-+
-+ return 0;
-+}
-+
-+void mptcp_sub_close_wq(struct work_struct *work)
-+{
-+ struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
-+ struct sock *sk = (struct sock *)tp;
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ mutex_lock(&tp->mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ if (sock_flag(sk, SOCK_DEAD))
-+ goto exit;
-+
-+ /* We come from tcp_disconnect. We are sure that meta_sk is set */
-+ if (!mptcp(tp)) {
-+ tp->closing = 1;
-+ sock_rps_reset_flow(sk);
-+ tcp_close(sk, 0);
-+ goto exit;
-+ }
-+
-+ if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
-+ tp->closing = 1;
-+ sock_rps_reset_flow(sk);
-+ tcp_close(sk, 0);
-+ } else if (tcp_close_state(sk)) {
-+ sk->sk_shutdown |= SEND_SHUTDOWN;
-+ tcp_send_fin(sk);
-+ }
-+
-+exit:
-+ release_sock(meta_sk);
-+ mutex_unlock(&tp->mpcb->mpcb_mutex);
-+ sock_put(sk);
-+}
-+
-+void mptcp_sub_close(struct sock *sk, unsigned long delay)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
-+
-+ /* We are already closing - e.g., call from sock_def_error_report upon
-+ * tcp_disconnect in tcp_close.
-+ */
-+ if (tp->closing)
-+ return;
-+
-+ /* Work already scheduled ? */
-+ if (work_pending(&work->work)) {
-+ /* Work present - who will be first ? */
-+ if (jiffies + delay > work->timer.expires)
-+ return;
-+
-+ /* Try canceling - if it fails, work will be executed soon */
-+ if (!cancel_delayed_work(work))
-+ return;
-+ sock_put(sk);
-+ }
-+
-+ if (!delay) {
-+ unsigned char old_state = sk->sk_state;
-+
-+ /* If we are in user-context we can directly do the closing
-+ * procedure. No need to schedule a work-queue.
-+ */
-+ if (!in_softirq()) {
-+ if (sock_flag(sk, SOCK_DEAD))
-+ return;
-+
-+ if (!mptcp(tp)) {
-+ tp->closing = 1;
-+ sock_rps_reset_flow(sk);
-+ tcp_close(sk, 0);
-+ return;
-+ }
-+
-+ if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
-+ sk->sk_state == TCP_CLOSE) {
-+ tp->closing = 1;
-+ sock_rps_reset_flow(sk);
-+ tcp_close(sk, 0);
-+ } else if (tcp_close_state(sk)) {
-+ sk->sk_shutdown |= SEND_SHUTDOWN;
-+ tcp_send_fin(sk);
-+ }
-+
-+ return;
-+ }
-+
-+ /* We directly send the FIN. Because it may take so a long time,
-+ * untile the work-queue will get scheduled...
-+ *
-+ * If mptcp_sub_send_fin returns 1, it failed and thus we reset
-+ * the old state so that tcp_close will finally send the fin
-+ * in user-context.
-+ */
-+ if (!sk->sk_err && old_state != TCP_CLOSE &&
-+ tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
-+ if (old_state == TCP_ESTABLISHED)
-+ TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
-+ sk->sk_state = old_state;
-+ }
-+ }
-+
-+ sock_hold(sk);
-+ queue_delayed_work(mptcp_wq, work, delay);
-+}
-+
-+void mptcp_sub_force_close(struct sock *sk)
-+{
-+ /* The below tcp_done may have freed the socket, if he is already dead.
-+ * Thus, we are not allowed to access it afterwards. That's why
-+ * we have to store the dead-state in this local variable.
-+ */
-+ int sock_is_dead = sock_flag(sk, SOCK_DEAD);
-+
-+ tcp_sk(sk)->mp_killed = 1;
-+
-+ if (sk->sk_state != TCP_CLOSE)
-+ tcp_done(sk);
-+
-+ if (!sock_is_dead)
-+ mptcp_sub_close(sk, 0);
-+}
-+EXPORT_SYMBOL(mptcp_sub_force_close);
-+
-+/* Update the mpcb send window, based on the contributions
-+ * of each subflow
-+ */
-+void mptcp_update_sndbuf(const struct tcp_sock *tp)
-+{
-+ struct sock *meta_sk = tp->meta_sk, *sk;
-+ int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
-+
-+ mptcp_for_each_sk(tp->mpcb, sk) {
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ new_sndbuf += sk->sk_sndbuf;
-+
-+ if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
-+ new_sndbuf = sysctl_tcp_wmem[2];
-+ break;
-+ }
-+ }
-+ meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
-+
-+ /* The subflow's call to sk_write_space in tcp_new_space ends up in
-+ * mptcp_write_space.
-+ * It has nothing to do with waking up the application.
-+ * So, we do it here.
-+ */
-+ if (old_sndbuf != meta_sk->sk_sndbuf)
-+ meta_sk->sk_write_space(meta_sk);
-+}
-+
-+void mptcp_close(struct sock *meta_sk, long timeout)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sock *sk_it, *tmpsk;
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sk_buff *skb;
-+ int data_was_unread = 0;
-+ int state;
-+
-+ mptcp_debug("%s: Close of meta_sk with tok %#x\n",
-+ __func__, mpcb->mptcp_loc_token);
-+
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock(meta_sk);
-+
-+ if (meta_tp->inside_tk_table) {
-+ /* Detach the mpcb from the token hashtable */
-+ mptcp_hash_remove_bh(meta_tp);
-+ reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
-+ }
-+
-+ meta_sk->sk_shutdown = SHUTDOWN_MASK;
-+ /* We need to flush the recv. buffs. We do this only on the
-+ * descriptor close, not protocol-sourced closes, because the
-+ * reader process may not have drained the data yet!
-+ */
-+ while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
-+ u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
-+ tcp_hdr(skb)->fin;
-+ data_was_unread += len;
-+ __kfree_skb(skb);
-+ }
-+
-+ sk_mem_reclaim(meta_sk);
-+
-+ /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
-+ if (meta_sk->sk_state == TCP_CLOSE) {
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+ if (tcp_sk(sk_it)->send_mp_fclose)
-+ continue;
-+ mptcp_sub_close(sk_it, 0);
-+ }
-+ goto adjudge_to_death;
-+ }
-+
-+ if (data_was_unread) {
-+ /* Unread data was tossed, zap the connection. */
-+ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
-+ tcp_set_state(meta_sk, TCP_CLOSE);
-+ tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
-+ meta_sk->sk_allocation);
-+ } else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
-+ /* Check zero linger _after_ checking for unread data. */
-+ meta_sk->sk_prot->disconnect(meta_sk, 0);
-+ NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+ } else if (tcp_close_state(meta_sk)) {
-+ mptcp_send_fin(meta_sk);
-+ } else if (meta_tp->snd_una == meta_tp->write_seq) {
-+ /* The DATA_FIN has been sent and acknowledged
-+ * (e.g., by sk_shutdown). Close all the other subflows
-+ */
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+ unsigned long delay = 0;
-+ /* If we are the passive closer, don't trigger
-+ * subflow-fin until the subflow has been finned
-+ * by the peer. - thus we add a delay
-+ */
-+ if (mpcb->passive_close &&
-+ sk_it->sk_state == TCP_ESTABLISHED)
-+ delay = inet_csk(sk_it)->icsk_rto << 3;
-+
-+ mptcp_sub_close(sk_it, delay);
-+ }
-+ }
-+
-+ sk_stream_wait_close(meta_sk, timeout);
-+
-+adjudge_to_death:
-+ state = meta_sk->sk_state;
-+ sock_hold(meta_sk);
-+ sock_orphan(meta_sk);
-+
-+ /* socket will be freed after mptcp_close - we have to prevent
-+ * access from the subflows.
-+ */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ /* Similar to sock_orphan, but we don't set it DEAD, because
-+ * the callbacks are still set and must be called.
-+ */
-+ write_lock_bh(&sk_it->sk_callback_lock);
-+ sk_set_socket(sk_it, NULL);
-+ sk_it->sk_wq = NULL;
-+ write_unlock_bh(&sk_it->sk_callback_lock);
-+ }
-+
-+ /* It is the last release_sock in its life. It will remove backlog. */
-+ release_sock(meta_sk);
-+
-+ /* Now socket is owned by kernel and we acquire BH lock
-+ * to finish close. No need to check for user refs.
-+ */
-+ local_bh_disable();
-+ bh_lock_sock(meta_sk);
-+ WARN_ON(sock_owned_by_user(meta_sk));
-+
-+ percpu_counter_inc(meta_sk->sk_prot->orphan_count);
-+
-+ /* Have we already been destroyed by a softirq or backlog? */
-+ if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
-+ goto out;
-+
-+ /* This is a (useful) BSD violating of the RFC. There is a
-+ * problem with TCP as specified in that the other end could
-+ * keep a socket open forever with no application left this end.
-+ * We use a 3 minute timeout (about the same as BSD) then kill
-+ * our end. If they send after that then tough - BUT: long enough
-+ * that we won't make the old 4*rto = almost no time - whoops
-+ * reset mistake.
-+ *
-+ * Nope, it was not mistake. It is really desired behaviour
-+ * f.e. on http servers, when such sockets are useless, but
-+ * consume significant resources. Let's do it with special
-+ * linger2 option. --ANK
-+ */
-+
-+ if (meta_sk->sk_state == TCP_FIN_WAIT2) {
-+ if (meta_tp->linger2 < 0) {
-+ tcp_set_state(meta_sk, TCP_CLOSE);
-+ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
-+ NET_INC_STATS_BH(sock_net(meta_sk),
-+ LINUX_MIB_TCPABORTONLINGER);
-+ } else {
-+ const int tmo = tcp_fin_time(meta_sk);
-+
-+ if (tmo > TCP_TIMEWAIT_LEN) {
-+ inet_csk_reset_keepalive_timer(meta_sk,
-+ tmo - TCP_TIMEWAIT_LEN);
-+ } else {
-+ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
-+ tmo);
-+ goto out;
-+ }
-+ }
-+ }
-+ if (meta_sk->sk_state != TCP_CLOSE) {
-+ sk_mem_reclaim(meta_sk);
-+ if (tcp_too_many_orphans(meta_sk, 0)) {
-+ if (net_ratelimit())
-+ pr_info("MPTCP: too many of orphaned sockets\n");
-+ tcp_set_state(meta_sk, TCP_CLOSE);
-+ meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
-+ NET_INC_STATS_BH(sock_net(meta_sk),
-+ LINUX_MIB_TCPABORTONMEMORY);
-+ }
-+ }
-+
-+
-+ if (meta_sk->sk_state == TCP_CLOSE)
-+ inet_csk_destroy_sock(meta_sk);
-+ /* Otherwise, socket is reprieved until protocol close. */
-+
-+out:
-+ bh_unlock_sock(meta_sk);
-+ local_bh_enable();
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk); /* Taken by sock_hold */
-+}
-+
-+void mptcp_disconnect(struct sock *sk)
-+{
-+ struct sock *subsk, *tmpsk;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ mptcp_delete_synack_timer(sk);
-+
-+ __skb_queue_purge(&tp->mpcb->reinject_queue);
-+
-+ if (tp->inside_tk_table) {
-+ mptcp_hash_remove_bh(tp);
-+ reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
-+ }
-+
-+ local_bh_disable();
-+ mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
-+ /* The socket will get removed from the subsocket-list
-+ * and made non-mptcp by setting mpc to 0.
-+ *
-+ * This is necessary, because tcp_disconnect assumes
-+ * that the connection is completly dead afterwards.
-+ * Thus we need to do a mptcp_del_sock. Due to this call
-+ * we have to make it non-mptcp.
-+ *
-+ * We have to lock the socket, because we set mpc to 0.
-+ * An incoming packet would take the subsocket's lock
-+ * and go on into the receive-path.
-+ * This would be a race.
-+ */
-+
-+ bh_lock_sock(subsk);
-+ mptcp_del_sock(subsk);
-+ tcp_sk(subsk)->mpc = 0;
-+ tcp_sk(subsk)->ops = &tcp_specific;
-+ mptcp_sub_force_close(subsk);
-+ bh_unlock_sock(subsk);
-+ }
-+ local_bh_enable();
-+
-+ tp->was_meta_sk = 1;
-+ tp->mpc = 0;
-+ tp->ops = &tcp_specific;
-+}
-+
-+
-+/* Returns 1 if we should enable MPTCP for that socket. */
-+int mptcp_doit(struct sock *sk)
-+{
-+ /* Do not allow MPTCP enabling if the MPTCP initialization failed */
-+ if (mptcp_init_failed)
-+ return 0;
-+
-+ if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
-+ return 0;
-+
-+ /* Socket may already be established (e.g., called from tcp_recvmsg) */
-+ if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
-+ return 1;
-+
-+ /* Don't do mptcp over loopback */
-+ if (sk->sk_family == AF_INET &&
-+ (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
-+ ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
-+ return 0;
-+#if IS_ENABLED(CONFIG_IPV6)
-+ if (sk->sk_family == AF_INET6 &&
-+ (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
-+ ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
-+ return 0;
-+#endif
-+ if (mptcp_v6_is_v4_mapped(sk) &&
-+ ipv4_is_loopback(inet_sk(sk)->inet_saddr))
-+ return 0;
-+
-+#ifdef CONFIG_TCP_MD5SIG
-+ /* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
-+ if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
-+ return 0;
-+#endif
-+
-+ return 1;
-+}
-+
-+int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
-+{
-+ struct tcp_sock *master_tp;
-+ struct sock *master_sk;
-+
-+ if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
-+ goto err_alloc_mpcb;
-+
-+ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
-+ master_tp = tcp_sk(master_sk);
-+
-+ if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
-+ goto err_add_sock;
-+
-+ if (__inet_inherit_port(meta_sk, master_sk) < 0)
-+ goto err_add_sock;
-+
-+ meta_sk->sk_prot->unhash(meta_sk);
-+
-+ if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
-+ __inet_hash_nolisten(master_sk, NULL);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else
-+ __inet6_hash(master_sk, NULL);
-+#endif
-+
-+ master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
-+
-+ return 0;
-+
-+err_add_sock:
-+ mptcp_fallback_meta_sk(meta_sk);
-+
-+ inet_csk_prepare_forced_close(master_sk);
-+ tcp_done(master_sk);
-+ inet_csk_prepare_forced_close(meta_sk);
-+ tcp_done(meta_sk);
-+
-+err_alloc_mpcb:
-+ return -ENOBUFS;
-+}
-+
-+static int __mptcp_check_req_master(struct sock *child,
-+ struct request_sock *req)
-+{
-+ struct tcp_sock *child_tp = tcp_sk(child);
-+ struct sock *meta_sk = child;
-+ struct mptcp_cb *mpcb;
-+ struct mptcp_request_sock *mtreq;
-+
-+ /* Never contained an MP_CAPABLE */
-+ if (!inet_rsk(req)->mptcp_rqsk)
-+ return 1;
-+
-+ if (!inet_rsk(req)->saw_mpc) {
-+ /* Fallback to regular TCP, because we saw one SYN without
-+ * MP_CAPABLE. In tcp_check_req we continue the regular path.
-+ * But, the socket has been added to the reqsk_tk_htb, so we
-+ * must still remove it.
-+ */
-+ mptcp_reqsk_remove_tk(req);
-+ return 1;
-+ }
-+
-+ /* Just set this values to pass them to mptcp_alloc_mpcb */
-+ mtreq = mptcp_rsk(req);
-+ child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
-+ child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
-+
-+ if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
-+ child_tp->snd_wnd))
-+ return -ENOBUFS;
-+
-+ child = tcp_sk(child)->mpcb->master_sk;
-+ child_tp = tcp_sk(child);
-+ mpcb = child_tp->mpcb;
-+
-+ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
-+ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
-+
-+ mpcb->dss_csum = mtreq->dss_csum;
-+ mpcb->server_side = 1;
-+
-+ /* Will be moved to ESTABLISHED by tcp_rcv_state_process() */
-+ mptcp_update_metasocket(child, meta_sk);
-+
-+ /* Needs to be done here additionally, because when accepting a
-+ * new connection we pass by __reqsk_free and not reqsk_free.
-+ */
-+ mptcp_reqsk_remove_tk(req);
-+
-+ /* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
-+ sock_put(meta_sk);
-+
-+ return 0;
-+}
-+
-+int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
-+{
-+ struct sock *meta_sk = child, *master_sk;
-+ struct sk_buff *skb;
-+ u32 new_mapping;
-+ int ret;
-+
-+ ret = __mptcp_check_req_master(child, req);
-+ if (ret)
-+ return ret;
-+
-+ master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
-+
-+ /* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
-+ * pre-MPTCP data in the receive queue.
-+ */
-+ tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
-+ tcp_rsk(req)->rcv_isn - 1;
-+
-+ /* Map subflow sequence number to data sequence numbers. We need to map
-+ * these data to [IDSN - len - 1, IDSN[.
-+ */
-+ new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
-+
-+ /* There should be only one skb: the SYN + data. */
-+ skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
-+ TCP_SKB_CB(skb)->seq += new_mapping;
-+ TCP_SKB_CB(skb)->end_seq += new_mapping;
-+ }
-+
-+ /* With fastopen we change the semantics of the relative subflow
-+ * sequence numbers to deal with middleboxes that could add/remove
-+ * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
-+ * instead of the regular TCP ISN.
-+ */
-+ tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
-+
-+ /* We need to update copied_seq of the master_sk to account for the
-+ * already moved data to the meta receive queue.
-+ */
-+ tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
-+
-+ /* Handled by the master_sk */
-+ tcp_sk(meta_sk)->fastopen_rsk = NULL;
-+
-+ return 0;
-+}
-+
-+int mptcp_check_req_master(struct sock *sk, struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev)
-+{
-+ struct sock *meta_sk = child;
-+ int ret;
-+
-+ ret = __mptcp_check_req_master(child, req);
-+ if (ret)
-+ return ret;
-+
-+ inet_csk_reqsk_queue_unlink(sk, req, prev);
-+ inet_csk_reqsk_queue_removed(sk, req);
-+ inet_csk_reqsk_queue_add(sk, req, meta_sk);
-+
-+ return 0;
-+}
-+
-+struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
-+ struct request_sock *req,
-+ struct request_sock **prev,
-+ const struct mptcp_options_received *mopt)
-+{
-+ struct tcp_sock *child_tp = tcp_sk(child);
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ u8 hash_mac_check[20];
-+
-+ child_tp->inside_tk_table = 0;
-+
-+ if (!mopt->join_ack)
-+ goto teardown;
-+
-+ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
-+ (u8 *)&mpcb->mptcp_loc_key,
-+ (u8 *)&mtreq->mptcp_rem_nonce,
-+ (u8 *)&mtreq->mptcp_loc_nonce,
-+ (u32 *)hash_mac_check);
-+
-+ if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
-+ goto teardown;
-+
-+ /* Point it to the same struct socket and wq as the meta_sk */
-+ sk_set_socket(child, meta_sk->sk_socket);
-+ child->sk_wq = meta_sk->sk_wq;
-+
-+ if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
-+ /* Has been inherited, but now child_tp->mptcp is NULL */
-+ child_tp->mpc = 0;
-+ child_tp->ops = &tcp_specific;
-+
-+ /* TODO when we support acking the third ack for new subflows,
-+ * we should silently discard this third ack, by returning NULL.
-+ *
-+ * Maybe, at the retransmission we will have enough memory to
-+ * fully add the socket to the meta-sk.
-+ */
-+ goto teardown;
-+ }
-+
-+ /* The child is a clone of the meta socket, we must now reset
-+ * some of the fields
-+ */
-+ child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
-+
-+ /* We should allow proper increase of the snd/rcv-buffers. Thus, we
-+ * use the original values instead of the bloated up ones from the
-+ * clone.
-+ */
-+ child->sk_sndbuf = mpcb->orig_sk_sndbuf;
-+ child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
-+
-+ child_tp->mptcp->slave_sk = 1;
-+ child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
-+ child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
-+ child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
-+
-+ child_tp->tsq_flags = 0;
-+
-+ /* Subflows do not use the accept queue, as they
-+ * are attached immediately to the mpcb.
-+ */
-+ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
-+ reqsk_free(req);
-+ return child;
-+
-+teardown:
-+ /* Drop this request - sock creation failed. */
-+ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
-+ reqsk_free(req);
-+ inet_csk_prepare_forced_close(child);
-+ tcp_done(child);
-+ return meta_sk;
-+}
-+
-+int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
-+{
-+ struct mptcp_tw *mptw;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+
-+ /* A subsocket in tw can only receive data. So, if we are in
-+ * infinite-receive, then we should not reply with a data-ack or act
-+ * upon general MPTCP-signaling. We prevent this by simply not creating
-+ * the mptcp_tw_sock.
-+ */
-+ if (mpcb->infinite_mapping_rcv) {
-+ tw->mptcp_tw = NULL;
-+ return 0;
-+ }
-+
-+ /* Alloc MPTCP-tw-sock */
-+ mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
-+ if (!mptw)
-+ return -ENOBUFS;
-+
-+ atomic_inc(&mpcb->mpcb_refcnt);
-+
-+ tw->mptcp_tw = mptw;
-+ mptw->loc_key = mpcb->mptcp_loc_key;
-+ mptw->meta_tw = mpcb->in_time_wait;
-+ if (mptw->meta_tw) {
-+ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
-+ if (mpcb->mptw_state != TCP_TIME_WAIT)
-+ mptw->rcv_nxt++;
-+ }
-+ rcu_assign_pointer(mptw->mpcb, mpcb);
-+
-+ spin_lock(&mpcb->tw_lock);
-+ list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
-+ mptw->in_list = 1;
-+ spin_unlock(&mpcb->tw_lock);
-+
-+ return 0;
-+}
-+
-+void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
-+{
-+ struct mptcp_cb *mpcb;
-+
-+ rcu_read_lock();
-+ mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
-+
-+ /* If we are still holding a ref to the mpcb, we have to remove ourself
-+ * from the list and drop the ref properly.
-+ */
-+ if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
-+ spin_lock(&mpcb->tw_lock);
-+ if (tw->mptcp_tw->in_list) {
-+ list_del_rcu(&tw->mptcp_tw->list);
-+ tw->mptcp_tw->in_list = 0;
-+ }
-+ spin_unlock(&mpcb->tw_lock);
-+
-+ /* Twice, because we increased it above */
-+ mptcp_mpcb_put(mpcb);
-+ mptcp_mpcb_put(mpcb);
-+ }
-+
-+ rcu_read_unlock();
-+
-+ kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
-+}
-+
-+/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
-+ * data-fin.
-+ */
-+void mptcp_time_wait(struct sock *sk, int state, int timeo)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_tw *mptw;
-+
-+ /* Used for sockets that go into tw after the meta
-+ * (see mptcp_init_tw_sock())
-+ */
-+ tp->mpcb->in_time_wait = 1;
-+ tp->mpcb->mptw_state = state;
-+
-+ /* Update the time-wait-sock's information */
-+ rcu_read_lock_bh();
-+ list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
-+ mptw->meta_tw = 1;
-+ mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
-+
-+ /* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
-+ * pretend as if the DATA_FIN has already reached us, that way
-+ * the checks in tcp_timewait_state_process will be good as the
-+ * DATA_FIN comes in.
-+ */
-+ if (state != TCP_TIME_WAIT)
-+ mptw->rcv_nxt++;
-+ }
-+ rcu_read_unlock_bh();
-+
-+ tcp_done(sk);
-+}
-+
-+void mptcp_tsq_flags(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ /* It will be handled as a regular deferred-call */
-+ if (is_meta_sk(sk))
-+ return;
-+
-+ if (hlist_unhashed(&tp->mptcp->cb_list)) {
-+ hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
-+ /* We need to hold it here, as the sock_hold is not assured
-+ * by the release_sock as it is done in regular TCP.
-+ *
-+ * The subsocket may get inet_csk_destroy'd while it is inside
-+ * the callback_list.
-+ */
-+ sock_hold(sk);
-+ }
-+
-+ if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
-+ sock_hold(meta_sk);
-+}
-+
-+void mptcp_tsq_sub_deferred(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_tcp_sock *mptcp;
-+ struct hlist_node *tmp;
-+
-+ BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
-+
-+ __sock_put(meta_sk);
-+ hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
-+ struct tcp_sock *tp = mptcp->tp;
-+ struct sock *sk = (struct sock *)tp;
-+
-+ hlist_del_init(&mptcp->cb_list);
-+ sk->sk_prot->release_cb(sk);
-+ /* Final sock_put (cfr. mptcp_tsq_flags */
-+ sock_put(sk);
-+ }
-+}
-+
-+void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
-+ struct sk_buff *skb)
-+{
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+ struct mptcp_options_received mopt;
-+ u8 mptcp_hash_mac[20];
-+
-+ mptcp_init_mp_opt(&mopt);
-+ tcp_parse_mptcp_options(skb, &mopt);
-+
-+ mtreq = mptcp_rsk(req);
-+ mtreq->mptcp_mpcb = mpcb;
-+ mtreq->is_sub = 1;
-+ inet_rsk(req)->mptcp_rqsk = 1;
-+
-+ mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
-+
-+ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
-+ (u8 *)&mpcb->mptcp_rem_key,
-+ (u8 *)&mtreq->mptcp_loc_nonce,
-+ (u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
-+ mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
-+
-+ mtreq->rem_id = mopt.rem_id;
-+ mtreq->rcv_low_prio = mopt.low_prio;
-+ inet_rsk(req)->saw_mpc = 1;
-+}
-+
-+void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
-+{
-+ struct mptcp_options_received mopt;
-+ struct mptcp_request_sock *mreq = mptcp_rsk(req);
-+
-+ mptcp_init_mp_opt(&mopt);
-+ tcp_parse_mptcp_options(skb, &mopt);
-+
-+ mreq->is_sub = 0;
-+ inet_rsk(req)->mptcp_rqsk = 1;
-+ mreq->dss_csum = mopt.dss_csum;
-+ mreq->hash_entry.pprev = NULL;
-+
-+ mptcp_reqsk_new_mptcp(req, &mopt, skb);
-+}
-+
-+int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
-+{
-+ struct mptcp_options_received mopt;
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ __u32 isn = TCP_SKB_CB(skb)->when;
-+ bool want_cookie = false;
-+
-+ if ((sysctl_tcp_syncookies == 2 ||
-+ inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-+ want_cookie = tcp_syn_flood_action(sk, skb,
-+ mptcp_request_sock_ops.slab_name);
-+ if (!want_cookie)
-+ goto drop;
-+ }
-+
-+ mptcp_init_mp_opt(&mopt);
-+ tcp_parse_mptcp_options(skb, &mopt);
-+
-+ if (mopt.is_mp_join)
-+ return mptcp_do_join_short(skb, &mopt, sock_net(sk));
-+ if (mopt.drop_me)
-+ goto drop;
-+
-+ if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
-+ mopt.saw_mpc = 0;
-+
-+ if (skb->protocol == htons(ETH_P_IP)) {
-+ if (mopt.saw_mpc && !want_cookie) {
-+ if (skb_rtable(skb)->rt_flags &
-+ (RTCF_BROADCAST | RTCF_MULTICAST))
-+ goto drop;
-+
-+ return tcp_conn_request(&mptcp_request_sock_ops,
-+ &mptcp_request_sock_ipv4_ops,
-+ sk, skb);
-+ }
-+
-+ return tcp_v4_conn_request(sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else {
-+ if (mopt.saw_mpc && !want_cookie) {
-+ if (!ipv6_unicast_destination(skb))
-+ goto drop;
-+
-+ return tcp_conn_request(&mptcp6_request_sock_ops,
-+ &mptcp_request_sock_ipv6_ops,
-+ sk, skb);
-+ }
-+
-+ return tcp_v6_conn_request(sk, skb);
-+#endif
-+ }
-+drop:
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-+ return 0;
-+}
-+
-+struct workqueue_struct *mptcp_wq;
-+EXPORT_SYMBOL(mptcp_wq);
-+
-+/* Output /proc/net/mptcp */
-+static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
-+{
-+ struct tcp_sock *meta_tp;
-+ const struct net *net = seq->private;
-+ int i, n = 0;
-+
-+ seq_printf(seq, " sl loc_tok rem_tok v6 local_address remote_address st ns tx_queue rx_queue inode");
-+ seq_putc(seq, '\n');
-+
-+ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+ struct hlist_nulls_node *node;
-+ rcu_read_lock_bh();
-+ hlist_nulls_for_each_entry_rcu(meta_tp, node,
-+ &tk_hashtable[i], tk_table) {
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sock *meta_sk = (struct sock *)meta_tp;
-+ struct inet_sock *isk = inet_sk(meta_sk);
-+
-+ if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
-+ continue;
-+
-+ if (capable(CAP_NET_ADMIN)) {
-+ seq_printf(seq, "%4d: %04X %04X ", n++,
-+ mpcb->mptcp_loc_token,
-+ mpcb->mptcp_rem_token);
-+ } else {
-+ seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
-+ }
-+ if (meta_sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(meta_sk)) {
-+ seq_printf(seq, " 0 %08X:%04X %08X:%04X ",
-+ isk->inet_rcv_saddr,
-+ ntohs(isk->inet_sport),
-+ isk->inet_daddr,
-+ ntohs(isk->inet_dport));
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else if (meta_sk->sk_family == AF_INET6) {
-+ struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
-+ struct in6_addr *dst = &meta_sk->sk_v6_daddr;
-+ seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
-+ src->s6_addr32[0], src->s6_addr32[1],
-+ src->s6_addr32[2], src->s6_addr32[3],
-+ ntohs(isk->inet_sport),
-+ dst->s6_addr32[0], dst->s6_addr32[1],
-+ dst->s6_addr32[2], dst->s6_addr32[3],
-+ ntohs(isk->inet_dport));
-+#endif
-+ }
-+ seq_printf(seq, " %02X %02X %08X:%08X %lu",
-+ meta_sk->sk_state, mpcb->cnt_subflows,
-+ meta_tp->write_seq - meta_tp->snd_una,
-+ max_t(int, meta_tp->rcv_nxt -
-+ meta_tp->copied_seq, 0),
-+ sock_i_ino(meta_sk));
-+ seq_putc(seq, '\n');
-+ }
-+
-+ rcu_read_unlock_bh();
-+ }
-+
-+ return 0;
-+}
-+
-+static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
-+{
-+ return single_open_net(inode, file, mptcp_pm_seq_show);
-+}
-+
-+static const struct file_operations mptcp_pm_seq_fops = {
-+ .owner = THIS_MODULE,
-+ .open = mptcp_pm_seq_open,
-+ .read = seq_read,
-+ .llseek = seq_lseek,
-+ .release = single_release_net,
-+};
-+
-+static int mptcp_pm_init_net(struct net *net)
-+{
-+ if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
-+ return -ENOMEM;
-+
-+ return 0;
-+}
-+
-+static void mptcp_pm_exit_net(struct net *net)
-+{
-+ remove_proc_entry("mptcp", net->proc_net);
-+}
-+
-+static struct pernet_operations mptcp_pm_proc_ops = {
-+ .init = mptcp_pm_init_net,
-+ .exit = mptcp_pm_exit_net,
-+};
-+
-+/* General initialization of mptcp */
-+void __init mptcp_init(void)
-+{
-+ int i;
-+ struct ctl_table_header *mptcp_sysctl;
-+
-+ mptcp_sock_cache = kmem_cache_create("mptcp_sock",
-+ sizeof(struct mptcp_tcp_sock),
-+ 0, SLAB_HWCACHE_ALIGN,
-+ NULL);
-+ if (!mptcp_sock_cache)
-+ goto mptcp_sock_cache_failed;
-+
-+ mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
-+ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+ NULL);
-+ if (!mptcp_cb_cache)
-+ goto mptcp_cb_cache_failed;
-+
-+ mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
-+ 0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+ NULL);
-+ if (!mptcp_tw_cache)
-+ goto mptcp_tw_cache_failed;
-+
-+ get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
-+
-+ mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
-+ if (!mptcp_wq)
-+ goto alloc_workqueue_failed;
-+
-+ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+ INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
-+ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
-+ i + MPTCP_REQSK_NULLS_BASE);
-+ INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
-+ }
-+
-+ spin_lock_init(&mptcp_reqsk_hlock);
-+ spin_lock_init(&mptcp_tk_hashlock);
-+
-+ if (register_pernet_subsys(&mptcp_pm_proc_ops))
-+ goto pernet_failed;
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ if (mptcp_pm_v6_init())
-+ goto mptcp_pm_v6_failed;
-+#endif
-+ if (mptcp_pm_v4_init())
-+ goto mptcp_pm_v4_failed;
-+
-+ mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
-+ if (!mptcp_sysctl)
-+ goto register_sysctl_failed;
-+
-+ if (mptcp_register_path_manager(&mptcp_pm_default))
-+ goto register_pm_failed;
-+
-+ if (mptcp_register_scheduler(&mptcp_sched_default))
-+ goto register_sched_failed;
-+
-+ pr_info("MPTCP: Stable release v0.89.0-rc");
-+
-+ mptcp_init_failed = false;
-+
-+ return;
-+
-+register_sched_failed:
-+ mptcp_unregister_path_manager(&mptcp_pm_default);
-+register_pm_failed:
-+ unregister_net_sysctl_table(mptcp_sysctl);
-+register_sysctl_failed:
-+ mptcp_pm_v4_undo();
-+mptcp_pm_v4_failed:
-+#if IS_ENABLED(CONFIG_IPV6)
-+ mptcp_pm_v6_undo();
-+mptcp_pm_v6_failed:
-+#endif
-+ unregister_pernet_subsys(&mptcp_pm_proc_ops);
-+pernet_failed:
-+ destroy_workqueue(mptcp_wq);
-+alloc_workqueue_failed:
-+ kmem_cache_destroy(mptcp_tw_cache);
-+mptcp_tw_cache_failed:
-+ kmem_cache_destroy(mptcp_cb_cache);
-+mptcp_cb_cache_failed:
-+ kmem_cache_destroy(mptcp_sock_cache);
-+mptcp_sock_cache_failed:
-+ mptcp_init_failed = true;
-+}
-diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
-new file mode 100644
-index 000000000000..3a54413ce25b
---- /dev/null
-+++ b/net/mptcp/mptcp_fullmesh.c
-@@ -0,0 +1,1722 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#include <net/addrconf.h>
-+#endif
-+
-+enum {
-+ MPTCP_EVENT_ADD = 1,
-+ MPTCP_EVENT_DEL,
-+ MPTCP_EVENT_MOD,
-+};
-+
-+#define MPTCP_SUBFLOW_RETRY_DELAY 1000
-+
-+/* Max number of local or remote addresses we can store.
-+ * When changing, see the bitfield below in fullmesh_rem4/6.
-+ */
-+#define MPTCP_MAX_ADDR 8
-+
-+struct fullmesh_rem4 {
-+ u8 rem4_id;
-+ u8 bitfield;
-+ u8 retry_bitfield;
-+ __be16 port;
-+ struct in_addr addr;
-+};
-+
-+struct fullmesh_rem6 {
-+ u8 rem6_id;
-+ u8 bitfield;
-+ u8 retry_bitfield;
-+ __be16 port;
-+ struct in6_addr addr;
-+};
-+
-+struct mptcp_loc_addr {
-+ struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
-+ u8 loc4_bits;
-+ u8 next_v4_index;
-+
-+ struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
-+ u8 loc6_bits;
-+ u8 next_v6_index;
-+};
-+
-+struct mptcp_addr_event {
-+ struct list_head list;
-+ unsigned short family;
-+ u8 code:7,
-+ low_prio:1;
-+ union inet_addr addr;
-+};
-+
-+struct fullmesh_priv {
-+ /* Worker struct for subflow establishment */
-+ struct work_struct subflow_work;
-+ /* Delayed worker, when the routing-tables are not yet ready. */
-+ struct delayed_work subflow_retry_work;
-+
-+ /* Remote addresses */
-+ struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
-+ struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
-+
-+ struct mptcp_cb *mpcb;
-+
-+ u16 remove_addrs; /* Addresses to remove */
-+ u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
-+ u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
-+
-+ u8 add_addr; /* Are we sending an add_addr? */
-+
-+ u8 rem4_bits;
-+ u8 rem6_bits;
-+};
-+
-+struct mptcp_fm_ns {
-+ struct mptcp_loc_addr __rcu *local;
-+ spinlock_t local_lock; /* Protecting the above pointer */
-+ struct list_head events;
-+ struct delayed_work address_worker;
-+
-+ struct net *net;
-+};
-+
-+static struct mptcp_pm_ops full_mesh __read_mostly;
-+
-+static void full_mesh_create_subflows(struct sock *meta_sk);
-+
-+static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
-+{
-+ return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
-+}
-+
-+static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
-+{
-+ return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
-+}
-+
-+/* Find the first free index in the bitfield */
-+static int __mptcp_find_free_index(u8 bitfield, u8 base)
-+{
-+ int i;
-+
-+ /* There are anyways no free bits... */
-+ if (bitfield == 0xff)
-+ goto exit;
-+
-+ i = ffs(~(bitfield >> base)) - 1;
-+ if (i < 0)
-+ goto exit;
-+
-+ /* No free bits when starting at base, try from 0 on */
-+ if (i + base >= sizeof(bitfield) * 8)
-+ return __mptcp_find_free_index(bitfield, 0);
-+
-+ return i + base;
-+exit:
-+ return -1;
-+}
-+
-+static int mptcp_find_free_index(u8 bitfield)
-+{
-+ return __mptcp_find_free_index(bitfield, 0);
-+}
-+
-+static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
-+ const struct in_addr *addr,
-+ __be16 port, u8 id)
-+{
-+ int i;
-+ struct fullmesh_rem4 *rem4;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ rem4 = &fmp->remaddr4[i];
-+
-+ /* Address is already in the list --- continue */
-+ if (rem4->rem4_id == id &&
-+ rem4->addr.s_addr == addr->s_addr && rem4->port == port)
-+ return;
-+
-+ /* This may be the case, when the peer is behind a NAT. He is
-+ * trying to JOIN, thus sending the JOIN with a certain ID.
-+ * However the src_addr of the IP-packet has been changed. We
-+ * update the addr in the list, because this is the address as
-+ * OUR BOX sees it.
-+ */
-+ if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
-+ /* update the address */
-+ mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
-+ __func__, &rem4->addr.s_addr,
-+ &addr->s_addr, id);
-+ rem4->addr.s_addr = addr->s_addr;
-+ rem4->port = port;
-+ mpcb->list_rcvd = 1;
-+ return;
-+ }
-+ }
-+
-+ i = mptcp_find_free_index(fmp->rem4_bits);
-+ /* Do we have already the maximum number of local/remote addresses? */
-+ if (i < 0) {
-+ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
-+ __func__, MPTCP_MAX_ADDR, &addr->s_addr);
-+ return;
-+ }
-+
-+ rem4 = &fmp->remaddr4[i];
-+
-+ /* Address is not known yet, store it */
-+ rem4->addr.s_addr = addr->s_addr;
-+ rem4->port = port;
-+ rem4->bitfield = 0;
-+ rem4->retry_bitfield = 0;
-+ rem4->rem4_id = id;
-+ mpcb->list_rcvd = 1;
-+ fmp->rem4_bits |= (1 << i);
-+
-+ return;
-+}
-+
-+static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
-+ const struct in6_addr *addr,
-+ __be16 port, u8 id)
-+{
-+ int i;
-+ struct fullmesh_rem6 *rem6;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ rem6 = &fmp->remaddr6[i];
-+
-+ /* Address is already in the list --- continue */
-+ if (rem6->rem6_id == id &&
-+ ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
-+ return;
-+
-+ /* This may be the case, when the peer is behind a NAT. He is
-+ * trying to JOIN, thus sending the JOIN with a certain ID.
-+ * However the src_addr of the IP-packet has been changed. We
-+ * update the addr in the list, because this is the address as
-+ * OUR BOX sees it.
-+ */
-+ if (rem6->rem6_id == id) {
-+ /* update the address */
-+ mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
-+ __func__, &rem6->addr, addr, id);
-+ rem6->addr = *addr;
-+ rem6->port = port;
-+ mpcb->list_rcvd = 1;
-+ return;
-+ }
-+ }
-+
-+ i = mptcp_find_free_index(fmp->rem6_bits);
-+ /* Do we have already the maximum number of local/remote addresses? */
-+ if (i < 0) {
-+ mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
-+ __func__, MPTCP_MAX_ADDR, addr);
-+ return;
-+ }
-+
-+ rem6 = &fmp->remaddr6[i];
-+
-+ /* Address is not known yet, store it */
-+ rem6->addr = *addr;
-+ rem6->port = port;
-+ rem6->bitfield = 0;
-+ rem6->retry_bitfield = 0;
-+ rem6->rem6_id = id;
-+ mpcb->list_rcvd = 1;
-+ fmp->rem6_bits |= (1 << i);
-+
-+ return;
-+}
-+
-+static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
-+{
-+ int i;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ if (fmp->remaddr4[i].rem4_id == id) {
-+ /* remove address from bitfield */
-+ fmp->rem4_bits &= ~(1 << i);
-+
-+ break;
-+ }
-+ }
-+}
-+
-+static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
-+{
-+ int i;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ if (fmp->remaddr6[i].rem6_id == id) {
-+ /* remove address from bitfield */
-+ fmp->rem6_bits &= ~(1 << i);
-+
-+ break;
-+ }
-+ }
-+}
-+
-+/* Sets the bitfield of the remote-address field */
-+static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
-+ const struct in_addr *addr, u8 index)
-+{
-+ int i;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
-+ fmp->remaddr4[i].bitfield |= (1 << index);
-+ return;
-+ }
-+ }
-+}
-+
-+/* Sets the bitfield of the remote-address field */
-+static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
-+ const struct in6_addr *addr, u8 index)
-+{
-+ int i;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
-+ fmp->remaddr6[i].bitfield |= (1 << index);
-+ return;
-+ }
-+ }
-+}
-+
-+static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
-+ const union inet_addr *addr,
-+ sa_family_t family, u8 id)
-+{
-+ if (family == AF_INET)
-+ mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
-+ else
-+ mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
-+}
-+
-+static void retry_subflow_worker(struct work_struct *work)
-+{
-+ struct delayed_work *delayed_work = container_of(work,
-+ struct delayed_work,
-+ work);
-+ struct fullmesh_priv *fmp = container_of(delayed_work,
-+ struct fullmesh_priv,
-+ subflow_retry_work);
-+ struct mptcp_cb *mpcb = fmp->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+ int iter = 0, i;
-+
-+ /* We need a local (stable) copy of the address-list. Really, it is not
-+ * such a big deal, if the address-list is not 100% up-to-date.
-+ */
-+ rcu_read_lock_bh();
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
-+ rcu_read_unlock_bh();
-+
-+ if (!mptcp_local)
-+ return;
-+
-+next_subflow:
-+ if (iter) {
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+
-+ cond_resched();
-+ }
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ iter++;
-+
-+ if (sock_flag(meta_sk, SOCK_DEAD))
-+ goto exit;
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
-+ /* Do we need to retry establishing a subflow ? */
-+ if (rem->retry_bitfield) {
-+ int i = mptcp_find_free_index(~rem->retry_bitfield);
-+ struct mptcp_rem4 rem4;
-+
-+ rem->bitfield |= (1 << i);
-+ rem->retry_bitfield &= ~(1 << i);
-+
-+ rem4.addr = rem->addr;
-+ rem4.port = rem->port;
-+ rem4.rem4_id = rem->rem4_id;
-+
-+ mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
-+ goto next_subflow;
-+ }
-+ }
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
-+
-+ /* Do we need to retry establishing a subflow ? */
-+ if (rem->retry_bitfield) {
-+ int i = mptcp_find_free_index(~rem->retry_bitfield);
-+ struct mptcp_rem6 rem6;
-+
-+ rem->bitfield |= (1 << i);
-+ rem->retry_bitfield &= ~(1 << i);
-+
-+ rem6.addr = rem->addr;
-+ rem6.port = rem->port;
-+ rem6.rem6_id = rem->rem6_id;
-+
-+ mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
-+ goto next_subflow;
-+ }
-+ }
-+#endif
-+
-+exit:
-+ kfree(mptcp_local);
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk);
-+}
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+ struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
-+ subflow_work);
-+ struct mptcp_cb *mpcb = fmp->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ struct mptcp_loc_addr *mptcp_local;
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+ int iter = 0, retry = 0;
-+ int i;
-+
-+ /* We need a local (stable) copy of the address-list. Really, it is not
-+ * such a big deal, if the address-list is not 100% up-to-date.
-+ */
-+ rcu_read_lock_bh();
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
-+ rcu_read_unlock_bh();
-+
-+ if (!mptcp_local)
-+ return;
-+
-+next_subflow:
-+ if (iter) {
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+
-+ cond_resched();
-+ }
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ iter++;
-+
-+ if (sock_flag(meta_sk, SOCK_DEAD))
-+ goto exit;
-+
-+ if (mpcb->master_sk &&
-+ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+ goto exit;
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ struct fullmesh_rem4 *rem;
-+ u8 remaining_bits;
-+
-+ rem = &fmp->remaddr4[i];
-+ remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
-+
-+ /* Are there still combinations to handle? */
-+ if (remaining_bits) {
-+ int i = mptcp_find_free_index(~remaining_bits);
-+ struct mptcp_rem4 rem4;
-+
-+ rem->bitfield |= (1 << i);
-+
-+ rem4.addr = rem->addr;
-+ rem4.port = rem->port;
-+ rem4.rem4_id = rem->rem4_id;
-+
-+ /* If a route is not yet available then retry once */
-+ if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
-+ &rem4) == -ENETUNREACH)
-+ retry = rem->retry_bitfield |= (1 << i);
-+ goto next_subflow;
-+ }
-+ }
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ struct fullmesh_rem6 *rem;
-+ u8 remaining_bits;
-+
-+ rem = &fmp->remaddr6[i];
-+ remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
-+
-+ /* Are there still combinations to handle? */
-+ if (remaining_bits) {
-+ int i = mptcp_find_free_index(~remaining_bits);
-+ struct mptcp_rem6 rem6;
-+
-+ rem->bitfield |= (1 << i);
-+
-+ rem6.addr = rem->addr;
-+ rem6.port = rem->port;
-+ rem6.rem6_id = rem->rem6_id;
-+
-+ /* If a route is not yet available then retry once */
-+ if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
-+ &rem6) == -ENETUNREACH)
-+ retry = rem->retry_bitfield |= (1 << i);
-+ goto next_subflow;
-+ }
-+ }
-+#endif
-+
-+ if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
-+ sock_hold(meta_sk);
-+ queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
-+ msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
-+ }
-+
-+exit:
-+ kfree(mptcp_local);
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk);
-+}
-+
-+static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ struct sock *sk = mptcp_select_ack_sock(meta_sk);
-+
-+ fmp->remove_addrs |= (1 << addr_id);
-+ mpcb->addr_signal = 1;
-+
-+ if (sk)
-+ tcp_send_ack(sk);
-+}
-+
-+static void update_addr_bitfields(struct sock *meta_sk,
-+ const struct mptcp_loc_addr *mptcp_local)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ int i;
-+
-+ /* The bits in announced_addrs_* always match with loc*_bits. So, a
-+ * simply & operation unsets the correct bits, because these go from
-+ * announced to non-announced
-+ */
-+ fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
-+
-+ mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+ fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
-+ fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
-+ }
-+
-+ fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
-+
-+ mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+ fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
-+ fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
-+ }
-+}
-+
-+static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
-+ sa_family_t family, const union inet_addr *addr)
-+{
-+ int i;
-+ u8 loc_bits;
-+ bool found = false;
-+
-+ if (family == AF_INET)
-+ loc_bits = mptcp_local->loc4_bits;
-+ else
-+ loc_bits = mptcp_local->loc6_bits;
-+
-+ mptcp_for_each_bit_set(loc_bits, i) {
-+ if (family == AF_INET &&
-+ mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
-+ found = true;
-+ break;
-+ }
-+ if (family == AF_INET6 &&
-+ ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
-+ &addr->in6)) {
-+ found = true;
-+ break;
-+ }
-+ }
-+
-+ if (!found)
-+ return -1;
-+
-+ return i;
-+}
-+
-+static void mptcp_address_worker(struct work_struct *work)
-+{
-+ const struct delayed_work *delayed_work = container_of(work,
-+ struct delayed_work,
-+ work);
-+ struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
-+ struct mptcp_fm_ns,
-+ address_worker);
-+ struct net *net = fm_ns->net;
-+ struct mptcp_addr_event *event = NULL;
-+ struct mptcp_loc_addr *mptcp_local, *old;
-+ int i, id = -1; /* id is used in the socket-code on a delete-event */
-+ bool success; /* Used to indicate if we succeeded handling the event */
-+
-+next_event:
-+ success = false;
-+ kfree(event);
-+
-+ /* First, let's dequeue an event from our event-list */
-+ rcu_read_lock_bh();
-+ spin_lock(&fm_ns->local_lock);
-+
-+ event = list_first_entry_or_null(&fm_ns->events,
-+ struct mptcp_addr_event, list);
-+ if (!event) {
-+ spin_unlock(&fm_ns->local_lock);
-+ rcu_read_unlock_bh();
-+ return;
-+ }
-+
-+ list_del(&event->list);
-+
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+
-+ if (event->code == MPTCP_EVENT_DEL) {
-+ id = mptcp_find_address(mptcp_local, event->family, &event->addr);
-+
-+ /* Not in the list - so we don't care */
-+ if (id < 0) {
-+ mptcp_debug("%s could not find id\n", __func__);
-+ goto duno;
-+ }
-+
-+ old = mptcp_local;
-+ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
-+ GFP_ATOMIC);
-+ if (!mptcp_local)
-+ goto duno;
-+
-+ if (event->family == AF_INET)
-+ mptcp_local->loc4_bits &= ~(1 << id);
-+ else
-+ mptcp_local->loc6_bits &= ~(1 << id);
-+
-+ rcu_assign_pointer(fm_ns->local, mptcp_local);
-+ kfree(old);
-+ } else {
-+ int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
-+ int j = i;
-+
-+ if (j < 0) {
-+ /* Not in the list, so we have to find an empty slot */
-+ if (event->family == AF_INET)
-+ i = __mptcp_find_free_index(mptcp_local->loc4_bits,
-+ mptcp_local->next_v4_index);
-+ if (event->family == AF_INET6)
-+ i = __mptcp_find_free_index(mptcp_local->loc6_bits,
-+ mptcp_local->next_v6_index);
-+
-+ if (i < 0) {
-+ mptcp_debug("%s no more space\n", __func__);
-+ goto duno;
-+ }
-+
-+ /* It might have been a MOD-event. */
-+ event->code = MPTCP_EVENT_ADD;
-+ } else {
-+ /* Let's check if anything changes */
-+ if (event->family == AF_INET &&
-+ event->low_prio == mptcp_local->locaddr4[i].low_prio)
-+ goto duno;
-+
-+ if (event->family == AF_INET6 &&
-+ event->low_prio == mptcp_local->locaddr6[i].low_prio)
-+ goto duno;
-+ }
-+
-+ old = mptcp_local;
-+ mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
-+ GFP_ATOMIC);
-+ if (!mptcp_local)
-+ goto duno;
-+
-+ if (event->family == AF_INET) {
-+ mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
-+ mptcp_local->locaddr4[i].loc4_id = i + 1;
-+ mptcp_local->locaddr4[i].low_prio = event->low_prio;
-+ } else {
-+ mptcp_local->locaddr6[i].addr = event->addr.in6;
-+ mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
-+ mptcp_local->locaddr6[i].low_prio = event->low_prio;
-+ }
-+
-+ if (j < 0) {
-+ if (event->family == AF_INET) {
-+ mptcp_local->loc4_bits |= (1 << i);
-+ mptcp_local->next_v4_index = i + 1;
-+ } else {
-+ mptcp_local->loc6_bits |= (1 << i);
-+ mptcp_local->next_v6_index = i + 1;
-+ }
-+ }
-+
-+ rcu_assign_pointer(fm_ns->local, mptcp_local);
-+ kfree(old);
-+ }
-+ success = true;
-+
-+duno:
-+ spin_unlock(&fm_ns->local_lock);
-+ rcu_read_unlock_bh();
-+
-+ if (!success)
-+ goto next_event;
-+
-+ /* Now we iterate over the MPTCP-sockets and apply the event. */
-+ for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+ const struct hlist_nulls_node *node;
-+ struct tcp_sock *meta_tp;
-+
-+ rcu_read_lock_bh();
-+ hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
-+ tk_table) {
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sock *meta_sk = (struct sock *)meta_tp, *sk;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+ if (sock_net(meta_sk) != net)
-+ continue;
-+
-+ if (meta_v4) {
-+ /* skip IPv6 events if meta is IPv4 */
-+ if (event->family == AF_INET6)
-+ continue;
-+ }
-+ /* skip IPv4 events if IPV6_V6ONLY is set */
-+ else if (event->family == AF_INET &&
-+ inet6_sk(meta_sk)->ipv6only)
-+ continue;
-+
-+ if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+ continue;
-+
-+ bh_lock_sock(meta_sk);
-+
-+ if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
-+ mpcb->infinite_mapping_snd ||
-+ mpcb->infinite_mapping_rcv ||
-+ mpcb->send_infinite_mapping)
-+ goto next;
-+
-+ /* May be that the pm has changed in-between */
-+ if (mpcb->pm_ops != &full_mesh)
-+ goto next;
-+
-+ if (sock_owned_by_user(meta_sk)) {
-+ if (!test_and_set_bit(MPTCP_PATH_MANAGER,
-+ &meta_tp->tsq_flags))
-+ sock_hold(meta_sk);
-+
-+ goto next;
-+ }
-+
-+ if (event->code == MPTCP_EVENT_ADD) {
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+
-+ sk = mptcp_select_ack_sock(meta_sk);
-+ if (sk)
-+ tcp_send_ack(sk);
-+
-+ full_mesh_create_subflows(meta_sk);
-+ }
-+
-+ if (event->code == MPTCP_EVENT_DEL) {
-+ struct sock *sk, *tmpsk;
-+ struct mptcp_loc_addr *mptcp_local;
-+ bool found = false;
-+
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+
-+ /* In any case, we need to update our bitfields */
-+ if (id >= 0)
-+ update_addr_bitfields(meta_sk, mptcp_local);
-+
-+ /* Look for the socket and remove him */
-+ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
-+ if ((event->family == AF_INET6 &&
-+ (sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(sk))) ||
-+ (event->family == AF_INET &&
-+ (sk->sk_family == AF_INET6 &&
-+ !mptcp_v6_is_v4_mapped(sk))))
-+ continue;
-+
-+ if (event->family == AF_INET &&
-+ (sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(sk)) &&
-+ inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
-+ continue;
-+
-+ if (event->family == AF_INET6 &&
-+ sk->sk_family == AF_INET6 &&
-+ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
-+ continue;
-+
-+ /* Reinject, so that pf = 1 and so we
-+ * won't select this one as the
-+ * ack-sock.
-+ */
-+ mptcp_reinject_data(sk, 0);
-+
-+ /* We announce the removal of this id */
-+ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
-+
-+ mptcp_sub_force_close(sk);
-+ found = true;
-+ }
-+
-+ if (found)
-+ goto next;
-+
-+ /* The id may have been given by the event,
-+ * matching on a local address. And it may not
-+ * have matched on one of the above sockets,
-+ * because the client never created a subflow.
-+ * So, we have to finally remove it here.
-+ */
-+ if (id > 0)
-+ announce_remove_addr(id, meta_sk);
-+ }
-+
-+ if (event->code == MPTCP_EVENT_MOD) {
-+ struct sock *sk;
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ if (event->family == AF_INET &&
-+ (sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(sk)) &&
-+ inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
-+ if (event->low_prio != tp->mptcp->low_prio) {
-+ tp->mptcp->send_mp_prio = 1;
-+ tp->mptcp->low_prio = event->low_prio;
-+
-+ tcp_send_ack(sk);
-+ }
-+ }
-+
-+ if (event->family == AF_INET6 &&
-+ sk->sk_family == AF_INET6 &&
-+ !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
-+ if (event->low_prio != tp->mptcp->low_prio) {
-+ tp->mptcp->send_mp_prio = 1;
-+ tp->mptcp->low_prio = event->low_prio;
-+
-+ tcp_send_ack(sk);
-+ }
-+ }
-+ }
-+ }
-+next:
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk);
-+ }
-+ rcu_read_unlock_bh();
-+ }
-+ goto next_event;
-+}
-+
-+static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
-+ const struct mptcp_addr_event *event)
-+{
-+ struct mptcp_addr_event *eventq;
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+
-+ list_for_each_entry(eventq, &fm_ns->events, list) {
-+ if (eventq->family != event->family)
-+ continue;
-+ if (event->family == AF_INET) {
-+ if (eventq->addr.in.s_addr == event->addr.in.s_addr)
-+ return eventq;
-+ } else {
-+ if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
-+ return eventq;
-+ }
-+ }
-+ return NULL;
-+}
-+
-+/* We already hold the net-namespace MPTCP-lock */
-+static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
-+{
-+ struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+
-+ if (eventq) {
-+ switch (event->code) {
-+ case MPTCP_EVENT_DEL:
-+ mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
-+ list_del(&eventq->list);
-+ kfree(eventq);
-+ break;
-+ case MPTCP_EVENT_ADD:
-+ mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
-+ eventq->low_prio = event->low_prio;
-+ eventq->code = MPTCP_EVENT_ADD;
-+ return;
-+ case MPTCP_EVENT_MOD:
-+ mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
-+ eventq->low_prio = event->low_prio;
-+ eventq->code = MPTCP_EVENT_MOD;
-+ return;
-+ }
-+ }
-+
-+ /* OK, we have to add the new address to the wait queue */
-+ eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
-+ if (!eventq)
-+ return;
-+
-+ list_add_tail(&eventq->list, &fm_ns->events);
-+
-+ /* Create work-queue */
-+ if (!delayed_work_pending(&fm_ns->address_worker))
-+ queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
-+ msecs_to_jiffies(500));
-+}
-+
-+static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
-+ struct net *net)
-+{
-+ const struct net_device *netdev = ifa->ifa_dev->dev;
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+ struct mptcp_addr_event mpevent;
-+
-+ if (ifa->ifa_scope > RT_SCOPE_LINK ||
-+ ipv4_is_loopback(ifa->ifa_local))
-+ return;
-+
-+ spin_lock_bh(&fm_ns->local_lock);
-+
-+ mpevent.family = AF_INET;
-+ mpevent.addr.in.s_addr = ifa->ifa_local;
-+ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
-+
-+ if (event == NETDEV_DOWN || !netif_running(netdev) ||
-+ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
-+ mpevent.code = MPTCP_EVENT_DEL;
-+ else if (event == NETDEV_UP)
-+ mpevent.code = MPTCP_EVENT_ADD;
-+ else if (event == NETDEV_CHANGE)
-+ mpevent.code = MPTCP_EVENT_MOD;
-+
-+ mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
-+ &ifa->ifa_local, mpevent.code, mpevent.low_prio);
-+ add_pm_event(net, &mpevent);
-+
-+ spin_unlock_bh(&fm_ns->local_lock);
-+ return;
-+}
-+
-+/* React on IPv4-addr add/rem-events */
-+static int mptcp_pm_inetaddr_event(struct notifier_block *this,
-+ unsigned long event, void *ptr)
-+{
-+ const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
-+ struct net *net = dev_net(ifa->ifa_dev->dev);
-+
-+ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+ event == NETDEV_CHANGE))
-+ return NOTIFY_DONE;
-+
-+ addr4_event_handler(ifa, event, net);
-+
-+ return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block mptcp_pm_inetaddr_notifier = {
-+ .notifier_call = mptcp_pm_inetaddr_event,
-+};
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+
-+/* IPV6-related address/interface watchers */
-+struct mptcp_dad_data {
-+ struct timer_list timer;
-+ struct inet6_ifaddr *ifa;
-+};
-+
-+static void dad_callback(unsigned long arg);
-+static int inet6_addr_event(struct notifier_block *this,
-+ unsigned long event, void *ptr);
-+
-+static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
-+{
-+ return (ifa->flags & IFA_F_TENTATIVE) &&
-+ ifa->state == INET6_IFADDR_STATE_DAD;
-+}
-+
-+static void dad_init_timer(struct mptcp_dad_data *data,
-+ struct inet6_ifaddr *ifa)
-+{
-+ data->ifa = ifa;
-+ data->timer.data = (unsigned long)data;
-+ data->timer.function = dad_callback;
-+ if (ifa->idev->cnf.rtr_solicit_delay)
-+ data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
-+ else
-+ data->timer.expires = jiffies + (HZ/10);
-+}
-+
-+static void dad_callback(unsigned long arg)
-+{
-+ struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
-+
-+ if (ipv6_is_in_dad_state(data->ifa)) {
-+ dad_init_timer(data, data->ifa);
-+ add_timer(&data->timer);
-+ } else {
-+ inet6_addr_event(NULL, NETDEV_UP, data->ifa);
-+ in6_ifa_put(data->ifa);
-+ kfree(data);
-+ }
-+}
-+
-+static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
-+{
-+ struct mptcp_dad_data *data;
-+
-+ data = kmalloc(sizeof(*data), GFP_ATOMIC);
-+
-+ if (!data)
-+ return;
-+
-+ init_timer(&data->timer);
-+ dad_init_timer(data, ifa);
-+ add_timer(&data->timer);
-+ in6_ifa_hold(ifa);
-+}
-+
-+static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
-+ struct net *net)
-+{
-+ const struct net_device *netdev = ifa->idev->dev;
-+ int addr_type = ipv6_addr_type(&ifa->addr);
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+ struct mptcp_addr_event mpevent;
-+
-+ if (ifa->scope > RT_SCOPE_LINK ||
-+ addr_type == IPV6_ADDR_ANY ||
-+ (addr_type & IPV6_ADDR_LOOPBACK) ||
-+ (addr_type & IPV6_ADDR_LINKLOCAL))
-+ return;
-+
-+ spin_lock_bh(&fm_ns->local_lock);
-+
-+ mpevent.family = AF_INET6;
-+ mpevent.addr.in6 = ifa->addr;
-+ mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
-+
-+ if (event == NETDEV_DOWN || !netif_running(netdev) ||
-+ (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
-+ mpevent.code = MPTCP_EVENT_DEL;
-+ else if (event == NETDEV_UP)
-+ mpevent.code = MPTCP_EVENT_ADD;
-+ else if (event == NETDEV_CHANGE)
-+ mpevent.code = MPTCP_EVENT_MOD;
-+
-+ mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
-+ &ifa->addr, mpevent.code, mpevent.low_prio);
-+ add_pm_event(net, &mpevent);
-+
-+ spin_unlock_bh(&fm_ns->local_lock);
-+ return;
-+}
-+
-+/* React on IPv6-addr add/rem-events */
-+static int inet6_addr_event(struct notifier_block *this, unsigned long event,
-+ void *ptr)
-+{
-+ struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
-+ struct net *net = dev_net(ifa6->idev->dev);
-+
-+ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+ event == NETDEV_CHANGE))
-+ return NOTIFY_DONE;
-+
-+ if (ipv6_is_in_dad_state(ifa6))
-+ dad_setup_timer(ifa6);
-+ else
-+ addr6_event_handler(ifa6, event, net);
-+
-+ return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block inet6_addr_notifier = {
-+ .notifier_call = inet6_addr_event,
-+};
-+
-+#endif
-+
-+/* React on ifup/down-events */
-+static int netdev_event(struct notifier_block *this, unsigned long event,
-+ void *ptr)
-+{
-+ const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-+ struct in_device *in_dev;
-+#if IS_ENABLED(CONFIG_IPV6)
-+ struct inet6_dev *in6_dev;
-+#endif
-+
-+ if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+ event == NETDEV_CHANGE))
-+ return NOTIFY_DONE;
-+
-+ rcu_read_lock();
-+ in_dev = __in_dev_get_rtnl(dev);
-+
-+ if (in_dev) {
-+ for_ifa(in_dev) {
-+ mptcp_pm_inetaddr_event(NULL, event, ifa);
-+ } endfor_ifa(in_dev);
-+ }
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ in6_dev = __in6_dev_get(dev);
-+
-+ if (in6_dev) {
-+ struct inet6_ifaddr *ifa6;
-+ list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
-+ inet6_addr_event(NULL, event, ifa6);
-+ }
-+#endif
-+
-+ rcu_read_unlock();
-+ return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block mptcp_pm_netdev_notifier = {
-+ .notifier_call = netdev_event,
-+};
-+
-+static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
-+ const union inet_addr *addr,
-+ sa_family_t family, __be16 port, u8 id)
-+{
-+ if (family == AF_INET)
-+ mptcp_addv4_raddr(mpcb, &addr->in, port, id);
-+ else
-+ mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
-+}
-+
-+static void full_mesh_new_session(const struct sock *meta_sk)
-+{
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+ int i, index;
-+ union inet_addr saddr, daddr;
-+ sa_family_t family;
-+ bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+ /* Init local variables necessary for the rest */
-+ if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
-+ saddr.ip = inet_sk(meta_sk)->inet_saddr;
-+ daddr.ip = inet_sk(meta_sk)->inet_daddr;
-+ family = AF_INET;
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else {
-+ saddr.in6 = inet6_sk(meta_sk)->saddr;
-+ daddr.in6 = meta_sk->sk_v6_daddr;
-+ family = AF_INET6;
-+#endif
-+ }
-+
-+ rcu_read_lock();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ index = mptcp_find_address(mptcp_local, family, &saddr);
-+ if (index < 0)
-+ goto fallback;
-+
-+ full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
-+ mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
-+
-+ /* Initialize workqueue-struct */
-+ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+ INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
-+ fmp->mpcb = mpcb;
-+
-+ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+ goto skip_ipv4;
-+
-+ /* Look for the address among the local addresses */
-+ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+ __be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
-+
-+ /* We do not need to announce the initial subflow's address again */
-+ if (family == AF_INET && saddr.ip == ifa_address)
-+ continue;
-+
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+ }
-+
-+skip_ipv4:
-+#if IS_ENABLED(CONFIG_IPV6)
-+ /* skip IPv6 addresses if meta-socket is IPv4 */
-+ if (meta_v4)
-+ goto skip_ipv6;
-+
-+ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+ const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
-+
-+ /* We do not need to announce the initial subflow's address again */
-+ if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
-+ continue;
-+
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+ }
-+
-+skip_ipv6:
-+#endif
-+
-+ rcu_read_unlock();
-+
-+ if (family == AF_INET)
-+ fmp->announced_addrs_v4 |= (1 << index);
-+ else
-+ fmp->announced_addrs_v6 |= (1 << index);
-+
-+ for (i = fmp->add_addr; i && fmp->add_addr; i--)
-+ tcp_send_ack(mpcb->master_sk);
-+
-+ return;
-+
-+fallback:
-+ rcu_read_unlock();
-+ mptcp_fallback_default(mpcb);
-+ return;
-+}
-+
-+static void full_mesh_create_subflows(struct sock *meta_sk)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+ mpcb->send_infinite_mapping ||
-+ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+ return;
-+
-+ if (mpcb->master_sk &&
-+ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+ return;
-+
-+ if (!work_pending(&fmp->subflow_work)) {
-+ sock_hold(meta_sk);
-+ queue_work(mptcp_wq, &fmp->subflow_work);
-+ }
-+}
-+
-+/* Called upon release_sock, if the socket was owned by the user during
-+ * a path-management event.
-+ */
-+static void full_mesh_release_sock(struct sock *meta_sk)
-+{
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+ struct sock *sk, *tmpsk;
-+ bool meta_v4 = meta_sk->sk_family == AF_INET;
-+ int i;
-+
-+ rcu_read_lock();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+ goto skip_ipv4;
-+
-+ /* First, detect modifications or additions */
-+ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+ struct in_addr ifa = mptcp_local->locaddr4[i].addr;
-+ bool found = false;
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (sk->sk_family == AF_INET6 &&
-+ !mptcp_v6_is_v4_mapped(sk))
-+ continue;
-+
-+ if (inet_sk(sk)->inet_saddr != ifa.s_addr)
-+ continue;
-+
-+ found = true;
-+
-+ if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
-+ tp->mptcp->send_mp_prio = 1;
-+ tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
-+
-+ tcp_send_ack(sk);
-+ }
-+ }
-+
-+ if (!found) {
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+
-+ sk = mptcp_select_ack_sock(meta_sk);
-+ if (sk)
-+ tcp_send_ack(sk);
-+ full_mesh_create_subflows(meta_sk);
-+ }
-+ }
-+
-+skip_ipv4:
-+#if IS_ENABLED(CONFIG_IPV6)
-+ /* skip IPv6 addresses if meta-socket is IPv4 */
-+ if (meta_v4)
-+ goto removal;
-+
-+ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+ struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
-+ bool found = false;
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(sk))
-+ continue;
-+
-+ if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
-+ continue;
-+
-+ found = true;
-+
-+ if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
-+ tp->mptcp->send_mp_prio = 1;
-+ tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
-+
-+ tcp_send_ack(sk);
-+ }
-+ }
-+
-+ if (!found) {
-+ fmp->add_addr++;
-+ mpcb->addr_signal = 1;
-+
-+ sk = mptcp_select_ack_sock(meta_sk);
-+ if (sk)
-+ tcp_send_ack(sk);
-+ full_mesh_create_subflows(meta_sk);
-+ }
-+ }
-+
-+removal:
-+#endif
-+
-+ /* Now, detect address-removals */
-+ mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
-+ bool shall_remove = true;
-+
-+ if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
-+ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+ if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
-+ shall_remove = false;
-+ break;
-+ }
-+ }
-+ } else {
-+ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+ if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
-+ shall_remove = false;
-+ break;
-+ }
-+ }
-+ }
-+
-+ if (shall_remove) {
-+ /* Reinject, so that pf = 1 and so we
-+ * won't select this one as the
-+ * ack-sock.
-+ */
-+ mptcp_reinject_data(sk, 0);
-+
-+ announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
-+ meta_sk);
-+
-+ mptcp_sub_force_close(sk);
-+ }
-+ }
-+
-+ /* Just call it optimistically. It actually cannot do any harm */
-+ update_addr_bitfields(meta_sk, mptcp_local);
-+
-+ rcu_read_unlock();
-+}
-+
-+static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio)
-+{
-+ struct mptcp_loc_addr *mptcp_local;
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+ int index, id = -1;
-+
-+ /* Handle the backup-flows */
-+ rcu_read_lock();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ index = mptcp_find_address(mptcp_local, family, addr);
-+
-+ if (index != -1) {
-+ if (family == AF_INET) {
-+ id = mptcp_local->locaddr4[index].loc4_id;
-+ *low_prio = mptcp_local->locaddr4[index].low_prio;
-+ } else {
-+ id = mptcp_local->locaddr6[index].loc6_id;
-+ *low_prio = mptcp_local->locaddr6[index].low_prio;
-+ }
-+ }
-+
-+
-+ rcu_read_unlock();
-+
-+ return id;
-+}
-+
-+static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
-+ struct tcp_out_options *opts,
-+ struct sk_buff *skb)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
-+ int remove_addr_len;
-+ u8 unannouncedv4 = 0, unannouncedv6 = 0;
-+ bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+ mpcb->addr_signal = 0;
-+
-+ if (likely(!fmp->add_addr))
-+ goto remove_addr;
-+
-+ rcu_read_lock();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+ goto skip_ipv4;
-+
-+ /* IPv4 */
-+ unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
-+ if (unannouncedv4 &&
-+ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
-+ int ind = mptcp_find_free_index(~unannouncedv4);
-+
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_ADD_ADDR;
-+ opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
-+ opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
-+ opts->add_addr_v4 = 1;
-+
-+ if (skb) {
-+ fmp->announced_addrs_v4 |= (1 << ind);
-+ fmp->add_addr--;
-+ }
-+ *size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
-+ }
-+
-+ if (meta_v4)
-+ goto skip_ipv6;
-+
-+skip_ipv4:
-+ /* IPv6 */
-+ unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
-+ if (unannouncedv6 &&
-+ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
-+ int ind = mptcp_find_free_index(~unannouncedv6);
-+
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_ADD_ADDR;
-+ opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
-+ opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
-+ opts->add_addr_v6 = 1;
-+
-+ if (skb) {
-+ fmp->announced_addrs_v6 |= (1 << ind);
-+ fmp->add_addr--;
-+ }
-+ *size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
-+ }
-+
-+skip_ipv6:
-+ rcu_read_unlock();
-+
-+ if (!unannouncedv4 && !unannouncedv6 && skb)
-+ fmp->add_addr--;
-+
-+remove_addr:
-+ if (likely(!fmp->remove_addrs))
-+ goto exit;
-+
-+ remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
-+ if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
-+ goto exit;
-+
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_REMOVE_ADDR;
-+ opts->remove_addrs = fmp->remove_addrs;
-+ *size += remove_addr_len;
-+ if (skb)
-+ fmp->remove_addrs = 0;
-+
-+exit:
-+ mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
-+}
-+
-+static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
-+{
-+ mptcp_v4_rem_raddress(mpcb, rem_id);
-+ mptcp_v6_rem_raddress(mpcb, rem_id);
-+}
-+
-+/* Output /proc/net/mptcp_fullmesh */
-+static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
-+{
-+ const struct net *net = seq->private;
-+ struct mptcp_loc_addr *mptcp_local;
-+ const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+ int i;
-+
-+ seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
-+
-+ rcu_read_lock_bh();
-+ mptcp_local = rcu_dereference(fm_ns->local);
-+
-+ seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
-+
-+ mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+ struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
-+
-+ seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
-+ loc4->low_prio, &loc4->addr);
-+ }
-+
-+ seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
-+
-+ mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+ struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
-+
-+ seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
-+ loc6->low_prio, &loc6->addr);
-+ }
-+ rcu_read_unlock_bh();
-+
-+ return 0;
-+}
-+
-+static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
-+{
-+ return single_open_net(inode, file, mptcp_fm_seq_show);
-+}
-+
-+static const struct file_operations mptcp_fm_seq_fops = {
-+ .owner = THIS_MODULE,
-+ .open = mptcp_fm_seq_open,
-+ .read = seq_read,
-+ .llseek = seq_lseek,
-+ .release = single_release_net,
-+};
-+
-+static int mptcp_fm_init_net(struct net *net)
-+{
-+ struct mptcp_loc_addr *mptcp_local;
-+ struct mptcp_fm_ns *fm_ns;
-+ int err = 0;
-+
-+ fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
-+ if (!fm_ns)
-+ return -ENOBUFS;
-+
-+ mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
-+ if (!mptcp_local) {
-+ err = -ENOBUFS;
-+ goto err_mptcp_local;
-+ }
-+
-+ if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
-+ &mptcp_fm_seq_fops)) {
-+ err = -ENOMEM;
-+ goto err_seq_fops;
-+ }
-+
-+ mptcp_local->next_v4_index = 1;
-+
-+ rcu_assign_pointer(fm_ns->local, mptcp_local);
-+ INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
-+ INIT_LIST_HEAD(&fm_ns->events);
-+ spin_lock_init(&fm_ns->local_lock);
-+ fm_ns->net = net;
-+ net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
-+
-+ return 0;
-+err_seq_fops:
-+ kfree(mptcp_local);
-+err_mptcp_local:
-+ kfree(fm_ns);
-+ return err;
-+}
-+
-+static void mptcp_fm_exit_net(struct net *net)
-+{
-+ struct mptcp_addr_event *eventq, *tmp;
-+ struct mptcp_fm_ns *fm_ns;
-+ struct mptcp_loc_addr *mptcp_local;
-+
-+ fm_ns = fm_get_ns(net);
-+ cancel_delayed_work_sync(&fm_ns->address_worker);
-+
-+ rcu_read_lock_bh();
-+
-+ mptcp_local = rcu_dereference_bh(fm_ns->local);
-+ kfree(mptcp_local);
-+
-+ spin_lock(&fm_ns->local_lock);
-+ list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
-+ list_del(&eventq->list);
-+ kfree(eventq);
-+ }
-+ spin_unlock(&fm_ns->local_lock);
-+
-+ rcu_read_unlock_bh();
-+
-+ remove_proc_entry("mptcp_fullmesh", net->proc_net);
-+
-+ kfree(fm_ns);
-+}
-+
-+static struct pernet_operations full_mesh_net_ops = {
-+ .init = mptcp_fm_init_net,
-+ .exit = mptcp_fm_exit_net,
-+};
-+
-+static struct mptcp_pm_ops full_mesh __read_mostly = {
-+ .new_session = full_mesh_new_session,
-+ .release_sock = full_mesh_release_sock,
-+ .fully_established = full_mesh_create_subflows,
-+ .new_remote_address = full_mesh_create_subflows,
-+ .get_local_id = full_mesh_get_local_id,
-+ .addr_signal = full_mesh_addr_signal,
-+ .add_raddr = full_mesh_add_raddr,
-+ .rem_raddr = full_mesh_rem_raddr,
-+ .name = "fullmesh",
-+ .owner = THIS_MODULE,
-+};
-+
-+/* General initialization of MPTCP_PM */
-+static int __init full_mesh_register(void)
-+{
-+ int ret;
-+
-+ BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
-+
-+ ret = register_pernet_subsys(&full_mesh_net_ops);
-+ if (ret)
-+ goto out;
-+
-+ ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+ if (ret)
-+ goto err_reg_inetaddr;
-+ ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+ if (ret)
-+ goto err_reg_netdev;
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+ ret = register_inet6addr_notifier(&inet6_addr_notifier);
-+ if (ret)
-+ goto err_reg_inet6addr;
-+#endif
-+
-+ ret = mptcp_register_path_manager(&full_mesh);
-+ if (ret)
-+ goto err_reg_pm;
-+
-+out:
-+ return ret;
-+
-+
-+err_reg_pm:
-+#if IS_ENABLED(CONFIG_IPV6)
-+ unregister_inet6addr_notifier(&inet6_addr_notifier);
-+err_reg_inet6addr:
-+#endif
-+ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+err_reg_netdev:
-+ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+err_reg_inetaddr:
-+ unregister_pernet_subsys(&full_mesh_net_ops);
-+ goto out;
-+}
-+
-+static void full_mesh_unregister(void)
-+{
-+#if IS_ENABLED(CONFIG_IPV6)
-+ unregister_inet6addr_notifier(&inet6_addr_notifier);
-+#endif
-+ unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+ unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+ unregister_pernet_subsys(&full_mesh_net_ops);
-+ mptcp_unregister_path_manager(&full_mesh);
-+}
-+
-+module_init(full_mesh_register);
-+module_exit(full_mesh_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("Full-Mesh MPTCP");
-+MODULE_VERSION("0.88");
-diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
-new file mode 100644
-index 000000000000..43704ccb639e
---- /dev/null
-+++ b/net/mptcp/mptcp_input.c
-@@ -0,0 +1,2405 @@
-+/*
-+ * MPTCP implementation - Sending side
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <asm/unaligned.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-+
-+#include <linux/kconfig.h>
-+
-+/* is seq1 < seq2 ? */
-+static inline bool before64(const u64 seq1, const u64 seq2)
-+{
-+ return (s64)(seq1 - seq2) < 0;
-+}
-+
-+/* is seq1 > seq2 ? */
-+#define after64(seq1, seq2) before64(seq2, seq1)
-+
-+static inline void mptcp_become_fully_estab(struct sock *sk)
-+{
-+ tcp_sk(sk)->mptcp->fully_established = 1;
-+
-+ if (is_master_tp(tcp_sk(sk)) &&
-+ tcp_sk(sk)->mpcb->pm_ops->fully_established)
-+ tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
-+}
-+
-+/* Similar to tcp_tso_acked without any memory accounting */
-+static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
-+ struct sk_buff *skb)
-+{
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ u32 packets_acked, len;
-+
-+ BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
-+
-+ packets_acked = tcp_skb_pcount(skb);
-+
-+ if (skb_unclone(skb, GFP_ATOMIC))
-+ return 0;
-+
-+ len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
-+ __pskb_trim_head(skb, len);
-+
-+ TCP_SKB_CB(skb)->seq += len;
-+ skb->ip_summed = CHECKSUM_PARTIAL;
-+ skb->truesize -= len;
-+
-+ /* Any change of skb->len requires recalculation of tso factor. */
-+ if (tcp_skb_pcount(skb) > 1)
-+ tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
-+ packets_acked -= tcp_skb_pcount(skb);
-+
-+ if (packets_acked) {
-+ BUG_ON(tcp_skb_pcount(skb) == 0);
-+ BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
-+ }
-+
-+ return packets_acked;
-+}
-+
-+/**
-+ * Cleans the meta-socket retransmission queue and the reinject-queue.
-+ * @sk must be the metasocket.
-+ */
-+static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
-+{
-+ struct sk_buff *skb, *tmp;
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ bool acked = false;
-+ u32 acked_pcount;
-+
-+ while ((skb = tcp_write_queue_head(meta_sk)) &&
-+ skb != tcp_send_head(meta_sk)) {
-+ bool fully_acked = true;
-+
-+ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
-+ if (tcp_skb_pcount(skb) == 1 ||
-+ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
-+ break;
-+
-+ acked_pcount = tcp_tso_acked(meta_sk, skb);
-+ if (!acked_pcount)
-+ break;
-+
-+ fully_acked = false;
-+ } else {
-+ acked_pcount = tcp_skb_pcount(skb);
-+ }
-+
-+ acked = true;
-+ meta_tp->packets_out -= acked_pcount;
-+ meta_tp->retrans_stamp = 0;
-+
-+ if (!fully_acked)
-+ break;
-+
-+ tcp_unlink_write_queue(skb, meta_sk);
-+
-+ if (mptcp_is_data_fin(skb)) {
-+ struct sock *sk_it;
-+
-+ /* DATA_FIN has been acknowledged - now we can close
-+ * the subflows
-+ */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ unsigned long delay = 0;
-+
-+ /* If we are the passive closer, don't trigger
-+ * subflow-fin until the subflow has been finned
-+ * by the peer - thus we add a delay.
-+ */
-+ if (mpcb->passive_close &&
-+ sk_it->sk_state == TCP_ESTABLISHED)
-+ delay = inet_csk(sk_it)->icsk_rto << 3;
-+
-+ mptcp_sub_close(sk_it, delay);
-+ }
-+ }
-+ sk_wmem_free_skb(meta_sk, skb);
-+ }
-+ /* Remove acknowledged data from the reinject queue */
-+ skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
-+ if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
-+ if (tcp_skb_pcount(skb) == 1 ||
-+ !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
-+ break;
-+
-+ mptcp_tso_acked_reinject(meta_sk, skb);
-+ break;
-+ }
-+
-+ __skb_unlink(skb, &mpcb->reinject_queue);
-+ __kfree_skb(skb);
-+ }
-+
-+ if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
-+ meta_tp->snd_up = meta_tp->snd_una;
-+
-+ if (acked) {
-+ tcp_rearm_rto(meta_sk);
-+ /* Normally this is done in tcp_try_undo_loss - but MPTCP
-+ * does not call this function.
-+ */
-+ inet_csk(meta_sk)->icsk_retransmits = 0;
-+ }
-+}
-+
-+/* Inspired by tcp_rcv_state_process */
-+static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
-+ const struct sk_buff *skb, u32 data_seq,
-+ u16 data_len)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-+ const struct tcphdr *th = tcp_hdr(skb);
-+
-+ /* State-machine handling if FIN has been enqueued and he has
-+ * been acked (snd_una == write_seq) - it's important that this
-+ * here is after sk_wmem_free_skb because otherwise
-+ * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
-+ */
-+ switch (meta_sk->sk_state) {
-+ case TCP_FIN_WAIT1: {
-+ struct dst_entry *dst;
-+ int tmo;
-+
-+ if (meta_tp->snd_una != meta_tp->write_seq)
-+ break;
-+
-+ tcp_set_state(meta_sk, TCP_FIN_WAIT2);
-+ meta_sk->sk_shutdown |= SEND_SHUTDOWN;
-+
-+ dst = __sk_dst_get(sk);
-+ if (dst)
-+ dst_confirm(dst);
-+
-+ if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+ /* Wake up lingering close() */
-+ meta_sk->sk_state_change(meta_sk);
-+ break;
-+ }
-+
-+ if (meta_tp->linger2 < 0 ||
-+ (data_len &&
-+ after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
-+ meta_tp->rcv_nxt))) {
-+ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
-+ tcp_done(meta_sk);
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+ return 1;
-+ }
-+
-+ tmo = tcp_fin_time(meta_sk);
-+ if (tmo > TCP_TIMEWAIT_LEN) {
-+ inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
-+ } else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
-+ /* Bad case. We could lose such FIN otherwise.
-+ * It is not a big problem, but it looks confusing
-+ * and not so rare event. We still can lose it now,
-+ * if it spins in bh_lock_sock(), but it is really
-+ * marginal case.
-+ */
-+ inet_csk_reset_keepalive_timer(meta_sk, tmo);
-+ } else {
-+ meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
-+ }
-+ break;
-+ }
-+ case TCP_CLOSING:
-+ case TCP_LAST_ACK:
-+ if (meta_tp->snd_una == meta_tp->write_seq) {
-+ tcp_done(meta_sk);
-+ return 1;
-+ }
-+ break;
-+ }
-+
-+ /* step 7: process the segment text */
-+ switch (meta_sk->sk_state) {
-+ case TCP_FIN_WAIT1:
-+ case TCP_FIN_WAIT2:
-+ /* RFC 793 says to queue data in these states,
-+ * RFC 1122 says we MUST send a reset.
-+ * BSD 4.4 also does reset.
-+ */
-+ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
-+ if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
-+ after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
-+ !mptcp_is_data_fin2(skb, tp)) {
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+ mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
-+ tcp_reset(meta_sk);
-+ return 1;
-+ }
-+ }
-+ break;
-+ }
-+
-+ return 0;
-+}
-+
-+/**
-+ * @return:
-+ * i) 1: Everything's fine.
-+ * ii) -1: A reset has been sent on the subflow - csum-failure
-+ * iii) 0: csum-failure but no reset sent, because it's the last subflow.
-+ * Last packet should not be destroyed by the caller because it has
-+ * been done here.
-+ */
-+static int mptcp_verif_dss_csum(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sk_buff *tmp, *tmp1, *last = NULL;
-+ __wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
-+ int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
-+ int iter = 0;
-+
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
-+ unsigned int csum_len;
-+
-+ if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
-+ /* Mapping ends in the middle of the packet -
-+ * csum only these bytes
-+ */
-+ csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
-+ else
-+ csum_len = tmp->len;
-+
-+ offset = 0;
-+ if (overflowed) {
-+ char first_word[4];
-+ first_word[0] = 0;
-+ first_word[1] = 0;
-+ first_word[2] = 0;
-+ first_word[3] = *(tmp->data);
-+ csum_tcp = csum_partial(first_word, 4, csum_tcp);
-+ offset = 1;
-+ csum_len--;
-+ overflowed = 0;
-+ }
-+
-+ csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
-+
-+ /* Was it on an odd-length? Then we have to merge the next byte
-+ * correctly (see above)
-+ */
-+ if (csum_len != (csum_len & (~1)))
-+ overflowed = 1;
-+
-+ if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
-+ __be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
-+
-+ /* If a 64-bit dss is present, we increase the offset
-+ * by 4 bytes, as the high-order 64-bits will be added
-+ * in the final csum_partial-call.
-+ */
-+ u32 offset = skb_transport_offset(tmp) +
-+ TCP_SKB_CB(tmp)->dss_off;
-+ if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
-+ offset += 4;
-+
-+ csum_tcp = skb_checksum(tmp, offset,
-+ MPTCP_SUB_LEN_SEQ_CSUM,
-+ csum_tcp);
-+
-+ csum_tcp = csum_partial(&data_seq,
-+ sizeof(data_seq), csum_tcp);
-+
-+ dss_csum_added = 1; /* Just do it once */
-+ }
-+ last = tmp;
-+ iter++;
-+
-+ if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
-+ !before(TCP_SKB_CB(tmp1)->seq,
-+ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ break;
-+ }
-+
-+ /* Now, checksum must be 0 */
-+ if (unlikely(csum_fold(csum_tcp))) {
-+ pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
-+ __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
-+ dss_csum_added, overflowed, iter);
-+
-+ tp->mptcp->send_mp_fail = 1;
-+
-+ /* map_data_seq is the data-seq number of the
-+ * mapping we are currently checking
-+ */
-+ tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
-+
-+ if (tp->mpcb->cnt_subflows > 1) {
-+ mptcp_send_reset(sk);
-+ ans = -1;
-+ } else {
-+ tp->mpcb->send_infinite_mapping = 1;
-+
-+ /* Need to purge the rcv-queue as it's no more valid */
-+ while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
-+ tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
-+ kfree_skb(tmp);
-+ }
-+
-+ ans = 0;
-+ }
-+ }
-+
-+ return ans;
-+}
-+
-+static inline void mptcp_prepare_skb(struct sk_buff *skb,
-+ const struct sock *sk)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ u32 inc = 0;
-+
-+ /* If skb is the end of this mapping (end is always at mapping-boundary
-+ * thanks to the splitting/trimming), then we need to increase
-+ * data-end-seq by 1 if this here is a data-fin.
-+ *
-+ * We need to do -1 because end_seq includes the subflow-FIN.
-+ */
-+ if (tp->mptcp->map_data_fin &&
-+ (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
-+ (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
-+ inc = 1;
-+
-+ /* We manually set the fin-flag if it is a data-fin. For easy
-+ * processing in tcp_recvmsg.
-+ */
-+ tcp_hdr(skb)->fin = 1;
-+ } else {
-+ /* We may have a subflow-fin with data but without data-fin */
-+ tcp_hdr(skb)->fin = 0;
-+ }
-+
-+ /* Adapt data-seq's to the packet itself. We kinda transform the
-+ * dss-mapping to a per-packet granularity. This is necessary to
-+ * correctly handle overlapping mappings coming from different
-+ * subflows. Otherwise it would be a complete mess.
-+ */
-+ tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
-+ tcb->end_seq = tcb->seq + skb->len + inc;
-+}
-+
-+/**
-+ * @return: 1 if the segment has been eaten and can be suppressed,
-+ * otherwise 0.
-+ */
-+static inline int mptcp_direct_copy(const struct sk_buff *skb,
-+ struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
-+ int eaten = 0;
-+
-+ __set_current_state(TASK_RUNNING);
-+
-+ local_bh_enable();
-+ if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
-+ meta_tp->ucopy.len -= chunk;
-+ meta_tp->copied_seq += chunk;
-+ eaten = (chunk == skb->len);
-+ tcp_rcv_space_adjust(meta_sk);
-+ }
-+ local_bh_disable();
-+ return eaten;
-+}
-+
-+static inline void mptcp_reset_mapping(struct tcp_sock *tp)
-+{
-+ tp->mptcp->map_data_len = 0;
-+ tp->mptcp->map_data_seq = 0;
-+ tp->mptcp->map_subseq = 0;
-+ tp->mptcp->map_data_fin = 0;
-+ tp->mptcp->mapping_present = 0;
-+}
-+
-+/* The DSS-mapping received on the sk only covers the second half of the skb
-+ * (cut at seq). We trim the head from the skb.
-+ * Data will be freed upon kfree().
-+ *
-+ * Inspired by tcp_trim_head().
-+ */
-+static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
-+{
-+ int len = seq - TCP_SKB_CB(skb)->seq;
-+ u32 new_seq = TCP_SKB_CB(skb)->seq + len;
-+
-+ if (len < skb_headlen(skb))
-+ __skb_pull(skb, len);
-+ else
-+ __pskb_trim_head(skb, len - skb_headlen(skb));
-+
-+ TCP_SKB_CB(skb)->seq = new_seq;
-+
-+ skb->truesize -= len;
-+ atomic_sub(len, &sk->sk_rmem_alloc);
-+ sk_mem_uncharge(sk, len);
-+}
-+
-+/* The DSS-mapping received on the sk only covers the first half of the skb
-+ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
-+ * as further packets may resolve the mapping of the second half of data.
-+ *
-+ * Inspired by tcp_fragment().
-+ */
-+static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
-+{
-+ struct sk_buff *buff;
-+ int nsize;
-+ int nlen, len;
-+
-+ len = seq - TCP_SKB_CB(skb)->seq;
-+ nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
-+ if (nsize < 0)
-+ nsize = 0;
-+
-+ /* Get a new skb... force flag on. */
-+ buff = alloc_skb(nsize, GFP_ATOMIC);
-+ if (buff == NULL)
-+ return -ENOMEM;
-+
-+ skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
-+ skb_reset_transport_header(buff);
-+
-+ tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
-+ tcp_hdr(skb)->fin = 0;
-+
-+ /* We absolutly need to call skb_set_owner_r before refreshing the
-+ * truesize of buff, otherwise the moved data will account twice.
-+ */
-+ skb_set_owner_r(buff, sk);
-+ nlen = skb->len - len - nsize;
-+ buff->truesize += nlen;
-+ skb->truesize -= nlen;
-+
-+ /* Correct the sequence numbers. */
-+ TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
-+ TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
-+ TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
-+
-+ skb_split(skb, buff, len);
-+
-+ __skb_queue_after(&sk->sk_receive_queue, skb, buff);
-+
-+ return 0;
-+}
-+
-+/* @return: 0 everything is fine. Just continue processing
-+ * 1 subflow is broken stop everything
-+ * -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ /* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
-+ if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
-+ !tp->mpcb->infinite_mapping_rcv) {
-+ /* Remove a pure subflow-fin from the queue and increase
-+ * copied_seq.
-+ */
-+ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+ __skb_unlink(skb, &sk->sk_receive_queue);
-+ __kfree_skb(skb);
-+ return -1;
-+ }
-+
-+ /* If we are not yet fully established and do not know the mapping for
-+ * this segment, this path has to fallback to infinite or be torn down.
-+ */
-+ if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
-+ !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
-+ pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
-+ __func__, tp->mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index, __builtin_return_address(0),
-+ TCP_SKB_CB(skb)->seq);
-+
-+ if (!is_master_tp(tp)) {
-+ mptcp_send_reset(sk);
-+ return 1;
-+ }
-+
-+ tp->mpcb->infinite_mapping_snd = 1;
-+ tp->mpcb->infinite_mapping_rcv = 1;
-+ /* We do a seamless fallback and should not send a inf.mapping. */
-+ tp->mpcb->send_infinite_mapping = 0;
-+ tp->mptcp->fully_established = 1;
-+ }
-+
-+ /* Receiver-side becomes fully established when a whole rcv-window has
-+ * been received without the need to fallback due to the previous
-+ * condition.
-+ */
-+ if (!tp->mptcp->fully_established) {
-+ tp->mptcp->init_rcv_wnd -= skb->len;
-+ if (tp->mptcp->init_rcv_wnd < 0)
-+ mptcp_become_fully_estab(sk);
-+ }
-+
-+ return 0;
-+}
-+
-+/* @return: 0 everything is fine. Just continue processing
-+ * 1 subflow is broken stop everything
-+ * -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ u32 *ptr;
-+ u32 data_seq, sub_seq, data_len, tcp_end_seq;
-+
-+ /* If we are in infinite-mapping-mode, the subflow is guaranteed to be
-+ * in-order at the data-level. Thus data-seq-numbers can be inferred
-+ * from what is expected at the data-level.
-+ */
-+ if (mpcb->infinite_mapping_rcv) {
-+ tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
-+ tp->mptcp->map_subseq = tcb->seq;
-+ tp->mptcp->map_data_len = skb->len;
-+ tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
-+ tp->mptcp->mapping_present = 1;
-+ return 0;
-+ }
-+
-+ /* No mapping here? Exit - it is either already set or still on its way */
-+ if (!mptcp_is_data_seq(skb)) {
-+ /* Too many packets without a mapping - this subflow is broken */
-+ if (!tp->mptcp->mapping_present &&
-+ tp->rcv_nxt - tp->copied_seq > 65536) {
-+ mptcp_send_reset(sk);
-+ return 1;
-+ }
-+
-+ return 0;
-+ }
-+
-+ ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
-+ ptr++;
-+ sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
-+ ptr++;
-+ data_len = get_unaligned_be16(ptr);
-+
-+ /* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
-+ * The draft sets it to 0, but we really would like to have the
-+ * real value, to have an easy handling afterwards here in this
-+ * function.
-+ */
-+ if (mptcp_is_data_fin(skb) && skb->len == 0)
-+ sub_seq = TCP_SKB_CB(skb)->seq;
-+
-+ /* If there is already a mapping - we check if it maps with the current
-+ * one. If not - we reset.
-+ */
-+ if (tp->mptcp->mapping_present &&
-+ (data_seq != (u32)tp->mptcp->map_data_seq ||
-+ sub_seq != tp->mptcp->map_subseq ||
-+ data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
-+ mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
-+ /* Mapping in packet is different from what we want */
-+ pr_err("%s Mappings do not match!\n", __func__);
-+ pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
-+ __func__, data_seq, (u32)tp->mptcp->map_data_seq,
-+ sub_seq, tp->mptcp->map_subseq, data_len,
-+ tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
-+ tp->mptcp->map_data_fin);
-+ mptcp_send_reset(sk);
-+ return 1;
-+ }
-+
-+ /* If the previous check was good, the current mapping is valid and we exit. */
-+ if (tp->mptcp->mapping_present)
-+ return 0;
-+
-+ /* Mapping not yet set on this subflow - we set it here! */
-+
-+ if (!data_len) {
-+ mpcb->infinite_mapping_rcv = 1;
-+ tp->mptcp->fully_established = 1;
-+ /* We need to repeat mp_fail's until the sender felt
-+ * back to infinite-mapping - here we stop repeating it.
-+ */
-+ tp->mptcp->send_mp_fail = 0;
-+
-+ /* We have to fixup data_len - it must be the same as skb->len */
-+ data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
-+ sub_seq = tcb->seq;
-+
-+ /* TODO kill all other subflows than this one */
-+ /* data_seq and so on are set correctly */
-+
-+ /* At this point, the meta-ofo-queue has to be emptied,
-+ * as the following data is guaranteed to be in-order at
-+ * the data and subflow-level
-+ */
-+ mptcp_purge_ofo_queue(meta_tp);
-+ }
-+
-+ /* We are sending mp-fail's and thus are in fallback mode.
-+ * Ignore packets which do not announce the fallback and still
-+ * want to provide a mapping.
-+ */
-+ if (tp->mptcp->send_mp_fail) {
-+ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+ __skb_unlink(skb, &sk->sk_receive_queue);
-+ __kfree_skb(skb);
-+ return -1;
-+ }
-+
-+ /* FIN increased the mapping-length by 1 */
-+ if (mptcp_is_data_fin(skb))
-+ data_len--;
-+
-+ /* Subflow-sequences of packet must be
-+ * (at least partially) be part of the DSS-mapping's
-+ * subflow-sequence-space.
-+ *
-+ * Basically the mapping is not valid, if either of the
-+ * following conditions is true:
-+ *
-+ * 1. It's not a data_fin and
-+ * MPTCP-sub_seq >= TCP-end_seq
-+ *
-+ * 2. It's a data_fin and TCP-end_seq > TCP-seq and
-+ * MPTCP-sub_seq >= TCP-end_seq
-+ *
-+ * The previous two can be merged into:
-+ * TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
-+ * Because if it's not a data-fin, TCP-end_seq > TCP-seq
-+ *
-+ * 3. It's a data_fin and skb->len == 0 and
-+ * MPTCP-sub_seq > TCP-end_seq
-+ *
-+ * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
-+ * MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
-+ *
-+ * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
-+ */
-+
-+ /* subflow-fin is not part of the mapping - ignore it here ! */
-+ tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
-+ if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
-+ (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
-+ (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
-+ before(sub_seq, tp->copied_seq)) {
-+ /* Subflow-sequences of packet is different from what is in the
-+ * packet's dss-mapping. The peer is misbehaving - reset
-+ */
-+ pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
-+ "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
-+ "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
-+ skb->len, data_len, tp->copied_seq);
-+ mptcp_send_reset(sk);
-+ return 1;
-+ }
-+
-+ /* Does the DSS had 64-bit seqnum's ? */
-+ if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
-+ /* Wrapped around? */
-+ if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
-+ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
-+ } else {
-+ /* Else, access the default high-order bits */
-+ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
-+ }
-+ } else {
-+ tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
-+
-+ if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
-+ /* We make sure that the data_seq is invalid.
-+ * It will be dropped later.
-+ */
-+ tp->mptcp->map_data_seq += 0xFFFFFFFF;
-+ tp->mptcp->map_data_seq += 0xFFFFFFFF;
-+ }
-+ }
-+
-+ tp->mptcp->map_data_len = data_len;
-+ tp->mptcp->map_subseq = sub_seq;
-+ tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
-+ tp->mptcp->mapping_present = 1;
-+
-+ return 0;
-+}
-+
-+/* Similar to tcp_sequence(...) */
-+static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
-+ u64 data_seq, u64 end_data_seq)
-+{
-+ const struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ u64 rcv_wup64;
-+
-+ /* Wrap-around? */
-+ if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
-+ rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
-+ meta_tp->rcv_wup;
-+ } else {
-+ rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
-+ meta_tp->rcv_wup);
-+ }
-+
-+ return !before64(end_data_seq, rcv_wup64) &&
-+ !after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
-+}
-+
-+/* @return: 0 everything is fine. Just continue processing
-+ * -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sk_buff *tmp, *tmp1;
-+ u32 tcp_end_seq;
-+
-+ if (!tp->mptcp->mapping_present)
-+ return 0;
-+
-+ /* either, the new skb gave us the mapping and the first segment
-+ * in the sub-rcv-queue has to be trimmed ...
-+ */
-+ tmp = skb_peek(&sk->sk_receive_queue);
-+ if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
-+ after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
-+ mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
-+
-+ /* ... or the new skb (tail) has to be split at the end. */
-+ tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
-+ if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
-+ u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
-+ if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
-+ /* TODO : maybe handle this here better.
-+ * We now just force meta-retransmission.
-+ */
-+ tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+ __skb_unlink(skb, &sk->sk_receive_queue);
-+ __kfree_skb(skb);
-+ return -1;
-+ }
-+ }
-+
-+ /* Now, remove old sk_buff's from the receive-queue.
-+ * This may happen if the mapping has been lost for these segments and
-+ * the next mapping has already been received.
-+ */
-+ if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+ if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
-+ break;
-+
-+ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+ __skb_unlink(tmp1, &sk->sk_receive_queue);
-+
-+ /* Impossible that we could free skb here, because his
-+ * mapping is known to be valid from previous checks
-+ */
-+ __kfree_skb(tmp1);
-+ }
-+ }
-+
-+ return 0;
-+}
-+
-+/* @return: 0 everything is fine. Just continue processing
-+ * 1 subflow is broken stop everything
-+ * -1 this mapping has been put in the meta-receive-queue
-+ * -2 this mapping has been eaten by the application
-+ */
-+static int mptcp_queue_skb(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct sk_buff *tmp, *tmp1;
-+ u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
-+ bool data_queued = false;
-+
-+ /* Have we not yet received the full mapping? */
-+ if (!tp->mptcp->mapping_present ||
-+ before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ return 0;
-+
-+ /* Is this an overlapping mapping? rcv_nxt >= end_data_seq
-+ * OR
-+ * This mapping is out of window
-+ */
-+ if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
-+ !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
-+ tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+ __skb_unlink(tmp1, &sk->sk_receive_queue);
-+ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+ __kfree_skb(tmp1);
-+
-+ if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+ !before(TCP_SKB_CB(tmp)->seq,
-+ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ break;
-+ }
-+
-+ mptcp_reset_mapping(tp);
-+
-+ return -1;
-+ }
-+
-+ /* Record it, because we want to send our data_fin on the same path */
-+ if (tp->mptcp->map_data_fin) {
-+ mpcb->dfin_path_index = tp->mptcp->path_index;
-+ mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
-+ }
-+
-+ /* Verify the checksum */
-+ if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
-+ int ret = mptcp_verif_dss_csum(sk);
-+
-+ if (ret <= 0) {
-+ mptcp_reset_mapping(tp);
-+ return 1;
-+ }
-+ }
-+
-+ if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
-+ /* Seg's have to go to the meta-ofo-queue */
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+ mptcp_prepare_skb(tmp1, sk);
-+ __skb_unlink(tmp1, &sk->sk_receive_queue);
-+ /* MUST be done here, because fragstolen may be true later.
-+ * Then, kfree_skb_partial will not account the memory.
-+ */
-+ skb_orphan(tmp1);
-+
-+ if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
-+ mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
-+ else
-+ __kfree_skb(tmp1);
-+
-+ if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+ !before(TCP_SKB_CB(tmp)->seq,
-+ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ break;
-+ }
-+ tcp_enter_quickack_mode(sk);
-+ } else {
-+ /* Ready for the meta-rcv-queue */
-+ skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+ int eaten = 0;
-+ const bool copied_early = false;
-+ bool fragstolen = false;
-+ u32 old_rcv_nxt = meta_tp->rcv_nxt;
-+
-+ tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+ mptcp_prepare_skb(tmp1, sk);
-+ __skb_unlink(tmp1, &sk->sk_receive_queue);
-+ /* MUST be done here, because fragstolen may be true.
-+ * Then, kfree_skb_partial will not account the memory.
-+ */
-+ skb_orphan(tmp1);
-+
-+ /* This segment has already been received */
-+ if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
-+ __kfree_skb(tmp1);
-+ goto next;
-+ }
-+
-+#ifdef CONFIG_NET_DMA
-+ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
-+ meta_tp->ucopy.task == current &&
-+ meta_tp->copied_seq == meta_tp->rcv_nxt &&
-+ tmp1->len <= meta_tp->ucopy.len &&
-+ sock_owned_by_user(meta_sk) &&
-+ tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
-+ copied_early = true;
-+ eaten = 1;
-+ }
-+#endif
-+
-+ /* Is direct copy possible ? */
-+ if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
-+ meta_tp->ucopy.task == current &&
-+ meta_tp->copied_seq == meta_tp->rcv_nxt &&
-+ meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
-+ !copied_early)
-+ eaten = mptcp_direct_copy(tmp1, meta_sk);
-+
-+ if (mpcb->in_time_wait) /* In time-wait, do not receive data */
-+ eaten = 1;
-+
-+ if (!eaten)
-+ eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
-+
-+ meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
-+ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
-+
-+#ifdef CONFIG_NET_DMA
-+ if (copied_early)
-+ meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
-+#endif
-+
-+ if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
-+ mptcp_fin(meta_sk);
-+
-+ /* Check if this fills a gap in the ofo queue */
-+ if (!skb_queue_empty(&meta_tp->out_of_order_queue))
-+ mptcp_ofo_queue(meta_sk);
-+
-+#ifdef CONFIG_NET_DMA
-+ if (copied_early)
-+ __skb_queue_tail(&meta_sk->sk_async_wait_queue,
-+ tmp1);
-+ else
-+#endif
-+ if (eaten)
-+ kfree_skb_partial(tmp1, fragstolen);
-+
-+ data_queued = true;
-+next:
-+ if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+ !before(TCP_SKB_CB(tmp)->seq,
-+ tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+ break;
-+ }
-+ }
-+
-+ inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
-+ mptcp_reset_mapping(tp);
-+
-+ return data_queued ? -1 : -2;
-+}
-+
-+void mptcp_data_ready(struct sock *sk)
-+{
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct sk_buff *skb, *tmp;
-+ int queued = 0;
-+
-+ /* restart before the check, because mptcp_fin might have changed the
-+ * state.
-+ */
-+restart:
-+ /* If the meta cannot receive data, there is no point in pushing data.
-+ * If we are in time-wait, we may still be waiting for the final FIN.
-+ * So, we should proceed with the processing.
-+ */
-+ if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
-+ skb_queue_purge(&sk->sk_receive_queue);
-+ tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
-+ goto exit;
-+ }
-+
-+ /* Iterate over all segments, detect their mapping (if we don't have
-+ * one yet), validate them and push everything one level higher.
-+ */
-+ skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
-+ int ret;
-+ /* Pre-validation - e.g., early fallback */
-+ ret = mptcp_prevalidate_skb(sk, skb);
-+ if (ret < 0)
-+ goto restart;
-+ else if (ret > 0)
-+ break;
-+
-+ /* Set the current mapping */
-+ ret = mptcp_detect_mapping(sk, skb);
-+ if (ret < 0)
-+ goto restart;
-+ else if (ret > 0)
-+ break;
-+
-+ /* Validation */
-+ if (mptcp_validate_mapping(sk, skb) < 0)
-+ goto restart;
-+
-+ /* Push a level higher */
-+ ret = mptcp_queue_skb(sk);
-+ if (ret < 0) {
-+ if (ret == -1)
-+ queued = ret;
-+ goto restart;
-+ } else if (ret == 0) {
-+ continue;
-+ } else { /* ret == 1 */
-+ break;
-+ }
-+ }
-+
-+exit:
-+ if (tcp_sk(sk)->close_it) {
-+ tcp_send_ack(sk);
-+ tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
-+ }
-+
-+ if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
-+ meta_sk->sk_data_ready(meta_sk);
-+}
-+
-+
-+int mptcp_check_req(struct sk_buff *skb, struct net *net)
-+{
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ struct sock *meta_sk = NULL;
-+
-+ /* MPTCP structures not initialized */
-+ if (mptcp_init_failed)
-+ return 0;
-+
-+ if (skb->protocol == htons(ETH_P_IP))
-+ meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
-+ ip_hdr(skb)->daddr, net);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else /* IPv6 */
-+ meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
-+ &ipv6_hdr(skb)->daddr, net);
-+#endif /* CONFIG_IPV6 */
-+
-+ if (!meta_sk)
-+ return 0;
-+
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+
-+ bh_lock_sock_nested(meta_sk);
-+ if (sock_owned_by_user(meta_sk)) {
-+ skb->sk = meta_sk;
-+ if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+ bh_unlock_sock(meta_sk);
-+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-+ sock_put(meta_sk); /* Taken by mptcp_search_req */
-+ kfree_skb(skb);
-+ return 1;
-+ }
-+ } else if (skb->protocol == htons(ETH_P_IP)) {
-+ tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else { /* IPv6 */
-+ tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+ }
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
-+ return 1;
-+}
-+
-+struct mp_join *mptcp_find_join(const struct sk_buff *skb)
-+{
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ unsigned char *ptr;
-+ int length = (th->doff * 4) - sizeof(struct tcphdr);
-+
-+ /* Jump through the options to check whether JOIN is there */
-+ ptr = (unsigned char *)(th + 1);
-+ while (length > 0) {
-+ int opcode = *ptr++;
-+ int opsize;
-+
-+ switch (opcode) {
-+ case TCPOPT_EOL:
-+ return NULL;
-+ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
-+ length--;
-+ continue;
-+ default:
-+ opsize = *ptr++;
-+ if (opsize < 2) /* "silly options" */
-+ return NULL;
-+ if (opsize > length)
-+ return NULL; /* don't parse partial options */
-+ if (opcode == TCPOPT_MPTCP &&
-+ ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
-+ return (struct mp_join *)(ptr - 2);
-+ }
-+ ptr += opsize - 2;
-+ length -= opsize;
-+ }
-+ }
-+ return NULL;
-+}
-+
-+int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
-+{
-+ const struct mptcp_cb *mpcb;
-+ struct sock *meta_sk;
-+ u32 token;
-+ bool meta_v4;
-+ struct mp_join *join_opt = mptcp_find_join(skb);
-+ if (!join_opt)
-+ return 0;
-+
-+ /* MPTCP structures were not initialized, so return error */
-+ if (mptcp_init_failed)
-+ return -1;
-+
-+ token = join_opt->u.syn.token;
-+ meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
-+ if (!meta_sk) {
-+ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
-+ return -1;
-+ }
-+
-+ meta_v4 = meta_sk->sk_family == AF_INET;
-+ if (meta_v4) {
-+ if (skb->protocol == htons(ETH_P_IPV6)) {
-+ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+ } else if (skb->protocol == htons(ETH_P_IP) &&
-+ inet6_sk(meta_sk)->ipv6only) {
-+ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+
-+ mpcb = tcp_sk(meta_sk)->mpcb;
-+ if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
-+ /* We are in fallback-mode on the reception-side -
-+ * no new subflows!
-+ */
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+
-+ /* Coming from time-wait-sock processing in tcp_v4_rcv.
-+ * We have to deschedule it before continuing, because otherwise
-+ * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
-+ */
-+ if (tw) {
-+ inet_twsk_deschedule(tw, &tcp_death_row);
-+ inet_twsk_put(tw);
-+ }
-+
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+ /* OK, this is a new syn/join, let's create a new open request and
-+ * send syn+ack
-+ */
-+ bh_lock_sock_nested(meta_sk);
-+ if (sock_owned_by_user(meta_sk)) {
-+ skb->sk = meta_sk;
-+ if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+ bh_unlock_sock(meta_sk);
-+ NET_INC_STATS_BH(sock_net(meta_sk),
-+ LINUX_MIB_TCPBACKLOGDROP);
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ kfree_skb(skb);
-+ return 1;
-+ }
-+ } else if (skb->protocol == htons(ETH_P_IP)) {
-+ tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else {
-+ tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+ }
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return 1;
-+}
-+
-+int mptcp_do_join_short(struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt,
-+ struct net *net)
-+{
-+ struct sock *meta_sk;
-+ u32 token;
-+ bool meta_v4;
-+
-+ token = mopt->mptcp_rem_token;
-+ meta_sk = mptcp_hash_find(net, token);
-+ if (!meta_sk) {
-+ mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
-+ return -1;
-+ }
-+
-+ meta_v4 = meta_sk->sk_family == AF_INET;
-+ if (meta_v4) {
-+ if (skb->protocol == htons(ETH_P_IPV6)) {
-+ mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+ } else if (skb->protocol == htons(ETH_P_IP) &&
-+ inet6_sk(meta_sk)->ipv6only) {
-+ mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+
-+ /* OK, this is a new syn/join, let's create a new open request and
-+ * send syn+ack
-+ */
-+ bh_lock_sock(meta_sk);
-+
-+ /* This check is also done in mptcp_vX_do_rcv. But, there we cannot
-+ * call tcp_vX_send_reset, because we hold already two socket-locks.
-+ * (the listener and the meta from above)
-+ *
-+ * And the send-reset will try to take yet another one (ip_send_reply).
-+ * Thus, we propagate the reset up to tcp_rcv_state_process.
-+ */
-+ if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
-+ tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
-+ meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return -1;
-+ }
-+
-+ if (sock_owned_by_user(meta_sk)) {
-+ skb->sk = meta_sk;
-+ if (unlikely(sk_add_backlog(meta_sk, skb,
-+ meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
-+ NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-+ else
-+ /* Must make sure that upper layers won't free the
-+ * skb if it is added to the backlog-queue.
-+ */
-+ skb_get(skb);
-+ } else {
-+ /* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
-+ * the skb will finally be freed by tcp_v4_do_rcv (where we are
-+ * coming from)
-+ */
-+ skb_get(skb);
-+ if (skb->protocol == htons(ETH_P_IP)) {
-+ tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else { /* IPv6 */
-+ tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+ }
-+ }
-+
-+ bh_unlock_sock(meta_sk);
-+ sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+ return 0;
-+}
-+
-+/**
-+ * Equivalent of tcp_fin() for MPTCP
-+ * Can be called only when the FIN is validly part
-+ * of the data seqnum space. Not before when we get holes.
-+ */
-+void mptcp_fin(struct sock *meta_sk)
-+{
-+ struct sock *sk = NULL, *sk_it;
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
-+ sk = sk_it;
-+ break;
-+ }
-+ }
-+
-+ if (!sk || sk->sk_state == TCP_CLOSE)
-+ sk = mptcp_select_ack_sock(meta_sk);
-+
-+ inet_csk_schedule_ack(sk);
-+
-+ meta_sk->sk_shutdown |= RCV_SHUTDOWN;
-+ sock_set_flag(meta_sk, SOCK_DONE);
-+
-+ switch (meta_sk->sk_state) {
-+ case TCP_SYN_RECV:
-+ case TCP_ESTABLISHED:
-+ /* Move to CLOSE_WAIT */
-+ tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
-+ inet_csk(sk)->icsk_ack.pingpong = 1;
-+ break;
-+
-+ case TCP_CLOSE_WAIT:
-+ case TCP_CLOSING:
-+ /* Received a retransmission of the FIN, do
-+ * nothing.
-+ */
-+ break;
-+ case TCP_LAST_ACK:
-+ /* RFC793: Remain in the LAST-ACK state. */
-+ break;
-+
-+ case TCP_FIN_WAIT1:
-+ /* This case occurs when a simultaneous close
-+ * happens, we must ack the received FIN and
-+ * enter the CLOSING state.
-+ */
-+ tcp_send_ack(sk);
-+ tcp_set_state(meta_sk, TCP_CLOSING);
-+ break;
-+ case TCP_FIN_WAIT2:
-+ /* Received a FIN -- send ACK and enter TIME_WAIT. */
-+ tcp_send_ack(sk);
-+ meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
-+ break;
-+ default:
-+ /* Only TCP_LISTEN and TCP_CLOSE are left, in these
-+ * cases we should never reach this piece of code.
-+ */
-+ pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
-+ meta_sk->sk_state);
-+ break;
-+ }
-+
-+ /* It _is_ possible, that we have something out-of-order _after_ FIN.
-+ * Probably, we should reset in this case. For now drop them.
-+ */
-+ mptcp_purge_ofo_queue(meta_tp);
-+ sk_mem_reclaim(meta_sk);
-+
-+ if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+ meta_sk->sk_state_change(meta_sk);
-+
-+ /* Do not send POLL_HUP for half duplex close. */
-+ if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
-+ meta_sk->sk_state == TCP_CLOSE)
-+ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
-+ else
-+ sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
-+ }
-+
-+ return;
-+}
-+
-+static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sk_buff *skb;
-+
-+ if (!meta_tp->packets_out)
-+ return;
-+
-+ tcp_for_write_queue(skb, meta_sk) {
-+ if (skb == tcp_send_head(meta_sk))
-+ break;
-+
-+ if (mptcp_retransmit_skb(meta_sk, skb))
-+ return;
-+
-+ if (skb == tcp_write_queue_head(meta_sk))
-+ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
-+ inet_csk(meta_sk)->icsk_rto,
-+ TCP_RTO_MAX);
-+ }
-+}
-+
-+/* Handle the DATA_ACK */
-+static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
-+{
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ u32 prior_snd_una = meta_tp->snd_una;
-+ int prior_packets;
-+ u32 nwin, data_ack, data_seq;
-+ u16 data_len = 0;
-+
-+ /* A valid packet came in - subflow is operational again */
-+ tp->pf = 0;
-+
-+ /* Even if there is no data-ack, we stop retransmitting.
-+ * Except if this is a SYN/ACK. Then it is just a retransmission
-+ */
-+ if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
-+ tp->mptcp->pre_established = 0;
-+ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+ }
-+
-+ /* If we are in infinite mapping mode, rx_opt.data_ack has been
-+ * set by mptcp_clean_rtx_infinite.
-+ */
-+ if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
-+ goto exit;
-+
-+ data_ack = tp->mptcp->rx_opt.data_ack;
-+
-+ if (unlikely(!tp->mptcp->fully_established) &&
-+ tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
-+ /* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
-+ * includes a data-ack, we are fully established
-+ */
-+ mptcp_become_fully_estab(sk);
-+
-+ /* Get the data_seq */
-+ if (mptcp_is_data_seq(skb)) {
-+ data_seq = tp->mptcp->rx_opt.data_seq;
-+ data_len = tp->mptcp->rx_opt.data_len;
-+ } else {
-+ data_seq = meta_tp->snd_wl1;
-+ }
-+
-+ /* If the ack is older than previous acks
-+ * then we can probably ignore it.
-+ */
-+ if (before(data_ack, prior_snd_una))
-+ goto exit;
-+
-+ /* If the ack includes data we haven't sent yet, discard
-+ * this segment (RFC793 Section 3.9).
-+ */
-+ if (after(data_ack, meta_tp->snd_nxt))
-+ goto exit;
-+
-+ /*** Now, update the window - inspired by tcp_ack_update_window ***/
-+ nwin = ntohs(tcp_hdr(skb)->window);
-+
-+ if (likely(!tcp_hdr(skb)->syn))
-+ nwin <<= tp->rx_opt.snd_wscale;
-+
-+ if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
-+ tcp_update_wl(meta_tp, data_seq);
-+
-+ /* Draft v09, Section 3.3.5:
-+ * [...] It should only update its local receive window values
-+ * when the largest sequence number allowed (i.e. DATA_ACK +
-+ * receive window) increases. [...]
-+ */
-+ if (meta_tp->snd_wnd != nwin &&
-+ !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
-+ meta_tp->snd_wnd = nwin;
-+
-+ if (nwin > meta_tp->max_window)
-+ meta_tp->max_window = nwin;
-+ }
-+ }
-+ /*** Done, update the window ***/
-+
-+ /* We passed data and got it acked, remove any soft error
-+ * log. Something worked...
-+ */
-+ sk->sk_err_soft = 0;
-+ inet_csk(meta_sk)->icsk_probes_out = 0;
-+ meta_tp->rcv_tstamp = tcp_time_stamp;
-+ prior_packets = meta_tp->packets_out;
-+ if (!prior_packets)
-+ goto no_queue;
-+
-+ meta_tp->snd_una = data_ack;
-+
-+ mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
-+
-+ /* We are in loss-state, and something got acked, retransmit the whole
-+ * queue now!
-+ */
-+ if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
-+ after(data_ack, prior_snd_una)) {
-+ mptcp_xmit_retransmit_queue(meta_sk);
-+ inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
-+ }
-+
-+ /* Simplified version of tcp_new_space, because the snd-buffer
-+ * is handled by all the subflows.
-+ */
-+ if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
-+ sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
-+ if (meta_sk->sk_socket &&
-+ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
-+ meta_sk->sk_write_space(meta_sk);
-+ }
-+
-+ if (meta_sk->sk_state != TCP_ESTABLISHED &&
-+ mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
-+ return;
-+
-+exit:
-+ mptcp_push_pending_frames(meta_sk);
-+
-+ return;
-+
-+no_queue:
-+ if (tcp_send_head(meta_sk))
-+ tcp_ack_probe(meta_sk);
-+
-+ mptcp_push_pending_frames(meta_sk);
-+
-+ return;
-+}
-+
-+void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
-+
-+ if (!tp->mpcb->infinite_mapping_snd)
-+ return;
-+
-+ /* The difference between both write_seq's represents the offset between
-+ * data-sequence and subflow-sequence. As we are infinite, this must
-+ * match.
-+ *
-+ * Thus, from this difference we can infer the meta snd_una.
-+ */
-+ tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
-+ tp->snd_una;
-+
-+ mptcp_data_ack(sk, skb);
-+}
-+
-+/**** static functions used by mptcp_parse_options */
-+
-+static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
-+{
-+ struct sock *sk_it, *tmpsk;
-+
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+ if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
-+ mptcp_reinject_data(sk_it, 0);
-+ sk_it->sk_err = ECONNRESET;
-+ if (tcp_need_reset(sk_it->sk_state))
-+ tcp_sk(sk_it)->ops->send_active_reset(sk_it,
-+ GFP_ATOMIC);
-+ mptcp_sub_force_close(sk_it);
-+ }
-+ }
-+}
-+
-+void mptcp_parse_options(const uint8_t *ptr, int opsize,
-+ struct mptcp_options_received *mopt,
-+ const struct sk_buff *skb)
-+{
-+ const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
-+
-+ /* If the socket is mp-capable we would have a mopt. */
-+ if (!mopt)
-+ return;
-+
-+ switch (mp_opt->sub) {
-+ case MPTCP_SUB_CAPABLE:
-+ {
-+ const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
-+
-+ if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
-+ opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
-+ mptcp_debug("%s: mp_capable: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ if (!sysctl_mptcp_enabled)
-+ break;
-+
-+ /* We only support MPTCP version 0 */
-+ if (mpcapable->ver != 0)
-+ break;
-+
-+ /* MPTCP-RFC 6824:
-+ * "If receiving a message with the 'B' flag set to 1, and this
-+ * is not understood, then this SYN MUST be silently ignored;
-+ */
-+ if (mpcapable->b) {
-+ mopt->drop_me = 1;
-+ break;
-+ }
-+
-+ /* MPTCP-RFC 6824:
-+ * "An implementation that only supports this method MUST set
-+ * bit "H" to 1, and bits "C" through "G" to 0."
-+ */
-+ if (!mpcapable->h)
-+ break;
-+
-+ mopt->saw_mpc = 1;
-+ mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
-+
-+ if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
-+ mopt->mptcp_key = mpcapable->sender_key;
-+
-+ break;
-+ }
-+ case MPTCP_SUB_JOIN:
-+ {
-+ const struct mp_join *mpjoin = (struct mp_join *)ptr;
-+
-+ if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
-+ opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
-+ opsize != MPTCP_SUB_LEN_JOIN_ACK) {
-+ mptcp_debug("%s: mp_join: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ /* saw_mpc must be set, because in tcp_check_req we assume that
-+ * it is set to support falling back to reg. TCP if a rexmitted
-+ * SYN has no MP_CAPABLE or MP_JOIN
-+ */
-+ switch (opsize) {
-+ case MPTCP_SUB_LEN_JOIN_SYN:
-+ mopt->is_mp_join = 1;
-+ mopt->saw_mpc = 1;
-+ mopt->low_prio = mpjoin->b;
-+ mopt->rem_id = mpjoin->addr_id;
-+ mopt->mptcp_rem_token = mpjoin->u.syn.token;
-+ mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
-+ break;
-+ case MPTCP_SUB_LEN_JOIN_SYNACK:
-+ mopt->saw_mpc = 1;
-+ mopt->low_prio = mpjoin->b;
-+ mopt->rem_id = mpjoin->addr_id;
-+ mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
-+ mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
-+ break;
-+ case MPTCP_SUB_LEN_JOIN_ACK:
-+ mopt->saw_mpc = 1;
-+ mopt->join_ack = 1;
-+ memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
-+ break;
-+ }
-+ break;
-+ }
-+ case MPTCP_SUB_DSS:
-+ {
-+ const struct mp_dss *mdss = (struct mp_dss *)ptr;
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+
-+ /* We check opsize for the csum and non-csum case. We do this,
-+ * because the draft says that the csum SHOULD be ignored if
-+ * it has not been negotiated in the MP_CAPABLE but still is
-+ * present in the data.
-+ *
-+ * It will get ignored later in mptcp_queue_skb.
-+ */
-+ if (opsize != mptcp_sub_len_dss(mdss, 0) &&
-+ opsize != mptcp_sub_len_dss(mdss, 1)) {
-+ mptcp_debug("%s: mp_dss: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ ptr += 4;
-+
-+ if (mdss->A) {
-+ tcb->mptcp_flags |= MPTCPHDR_ACK;
-+
-+ if (mdss->a) {
-+ mopt->data_ack = (u32) get_unaligned_be64(ptr);
-+ ptr += MPTCP_SUB_LEN_ACK_64;
-+ } else {
-+ mopt->data_ack = get_unaligned_be32(ptr);
-+ ptr += MPTCP_SUB_LEN_ACK;
-+ }
-+ }
-+
-+ tcb->dss_off = (ptr - skb_transport_header(skb));
-+
-+ if (mdss->M) {
-+ if (mdss->m) {
-+ u64 data_seq64 = get_unaligned_be64(ptr);
-+
-+ tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
-+ mopt->data_seq = (u32) data_seq64;
-+
-+ ptr += 12; /* 64-bit dseq + subseq */
-+ } else {
-+ mopt->data_seq = get_unaligned_be32(ptr);
-+ ptr += 8; /* 32-bit dseq + subseq */
-+ }
-+ mopt->data_len = get_unaligned_be16(ptr);
-+
-+ tcb->mptcp_flags |= MPTCPHDR_SEQ;
-+
-+ /* Is a check-sum present? */
-+ if (opsize == mptcp_sub_len_dss(mdss, 1))
-+ tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
-+
-+ /* DATA_FIN only possible with DSS-mapping */
-+ if (mdss->F)
-+ tcb->mptcp_flags |= MPTCPHDR_FIN;
-+ }
-+
-+ break;
-+ }
-+ case MPTCP_SUB_ADD_ADDR:
-+ {
-+#if IS_ENABLED(CONFIG_IPV6)
-+ const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+
-+ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
-+ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
-+#else
-+ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
-+#endif /* CONFIG_IPV6 */
-+ mptcp_debug("%s: mp_add_addr: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ /* We have to manually parse the options if we got two of them. */
-+ if (mopt->saw_add_addr) {
-+ mopt->more_add_addr = 1;
-+ break;
-+ }
-+ mopt->saw_add_addr = 1;
-+ mopt->add_addr_ptr = ptr;
-+ break;
-+ }
-+ case MPTCP_SUB_REMOVE_ADDR:
-+ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
-+ mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ if (mopt->saw_rem_addr) {
-+ mopt->more_rem_addr = 1;
-+ break;
-+ }
-+ mopt->saw_rem_addr = 1;
-+ mopt->rem_addr_ptr = ptr;
-+ break;
-+ case MPTCP_SUB_PRIO:
-+ {
-+ const struct mp_prio *mpprio = (struct mp_prio *)ptr;
-+
-+ if (opsize != MPTCP_SUB_LEN_PRIO &&
-+ opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
-+ mptcp_debug("%s: mp_prio: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ mopt->saw_low_prio = 1;
-+ mopt->low_prio = mpprio->b;
-+
-+ if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
-+ mopt->saw_low_prio = 2;
-+ mopt->prio_addr_id = mpprio->addr_id;
-+ }
-+ break;
-+ }
-+ case MPTCP_SUB_FAIL:
-+ if (opsize != MPTCP_SUB_LEN_FAIL) {
-+ mptcp_debug("%s: mp_fail: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+ mopt->mp_fail = 1;
-+ break;
-+ case MPTCP_SUB_FCLOSE:
-+ if (opsize != MPTCP_SUB_LEN_FCLOSE) {
-+ mptcp_debug("%s: mp_fclose: bad option size %d\n",
-+ __func__, opsize);
-+ break;
-+ }
-+
-+ mopt->mp_fclose = 1;
-+ mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
-+
-+ break;
-+ default:
-+ mptcp_debug("%s: Received unkown subtype: %d\n",
-+ __func__, mp_opt->sub);
-+ break;
-+ }
-+}
-+
-+/** Parse only MPTCP options */
-+void tcp_parse_mptcp_options(const struct sk_buff *skb,
-+ struct mptcp_options_received *mopt)
-+{
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ int length = (th->doff * 4) - sizeof(struct tcphdr);
-+ const unsigned char *ptr = (const unsigned char *)(th + 1);
-+
-+ while (length > 0) {
-+ int opcode = *ptr++;
-+ int opsize;
-+
-+ switch (opcode) {
-+ case TCPOPT_EOL:
-+ return;
-+ case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */
-+ length--;
-+ continue;
-+ default:
-+ opsize = *ptr++;
-+ if (opsize < 2) /* "silly options" */
-+ return;
-+ if (opsize > length)
-+ return; /* don't parse partial options */
-+ if (opcode == TCPOPT_MPTCP)
-+ mptcp_parse_options(ptr - 2, opsize, mopt, skb);
-+ }
-+ ptr += opsize - 2;
-+ length -= opsize;
-+ }
-+}
-+
-+int mptcp_check_rtt(const struct tcp_sock *tp, int time)
-+{
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ struct sock *sk;
-+ u32 rtt_max = 0;
-+
-+ /* In MPTCP, we take the max delay across all flows,
-+ * in order to take into account meta-reordering buffers.
-+ */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ if (!mptcp_sk_can_recv(sk))
-+ continue;
-+
-+ if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
-+ rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
-+ }
-+ if (time < (rtt_max >> 3) || !rtt_max)
-+ return 1;
-+
-+ return 0;
-+}
-+
-+static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
-+{
-+ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+ __be16 port = 0;
-+ union inet_addr addr;
-+ sa_family_t family;
-+
-+ if (mpadd->ipver == 4) {
-+ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
-+ port = mpadd->u.v4.port;
-+ family = AF_INET;
-+ addr.in = mpadd->u.v4.addr;
-+#if IS_ENABLED(CONFIG_IPV6)
-+ } else if (mpadd->ipver == 6) {
-+ if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
-+ port = mpadd->u.v6.port;
-+ family = AF_INET6;
-+ addr.in6 = mpadd->u.v6.addr;
-+#endif /* CONFIG_IPV6 */
-+ } else {
-+ return;
-+ }
-+
-+ if (mpcb->pm_ops->add_raddr)
-+ mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
-+}
-+
-+static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
-+{
-+ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
-+ int i;
-+ u8 rem_id;
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+ for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
-+ rem_id = (&mprem->addrs_id)[i];
-+
-+ if (mpcb->pm_ops->rem_raddr)
-+ mpcb->pm_ops->rem_raddr(mpcb, rem_id);
-+ mptcp_send_reset_rem_id(mpcb, rem_id);
-+ }
-+}
-+
-+static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
-+{
-+ struct tcphdr *th = tcp_hdr(skb);
-+ unsigned char *ptr;
-+ int length = (th->doff * 4) - sizeof(struct tcphdr);
-+
-+ /* Jump through the options to check whether ADD_ADDR is there */
-+ ptr = (unsigned char *)(th + 1);
-+ while (length > 0) {
-+ int opcode = *ptr++;
-+ int opsize;
-+
-+ switch (opcode) {
-+ case TCPOPT_EOL:
-+ return;
-+ case TCPOPT_NOP:
-+ length--;
-+ continue;
-+ default:
-+ opsize = *ptr++;
-+ if (opsize < 2)
-+ return;
-+ if (opsize > length)
-+ return; /* don't parse partial options */
-+ if (opcode == TCPOPT_MPTCP &&
-+ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
-+#if IS_ENABLED(CONFIG_IPV6)
-+ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+ if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
-+ (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
-+#else
-+ if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+ opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
-+#endif /* CONFIG_IPV6 */
-+ goto cont;
-+
-+ mptcp_handle_add_addr(ptr, sk);
-+ }
-+ if (opcode == TCPOPT_MPTCP &&
-+ ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
-+ if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
-+ goto cont;
-+
-+ mptcp_handle_rem_addr(ptr, sk);
-+ }
-+cont:
-+ ptr += opsize - 2;
-+ length -= opsize;
-+ }
-+ }
-+ return;
-+}
-+
-+static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
-+{
-+ struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+ if (unlikely(mptcp->rx_opt.mp_fail)) {
-+ mptcp->rx_opt.mp_fail = 0;
-+
-+ if (!th->rst && !mpcb->infinite_mapping_snd) {
-+ struct sock *sk_it;
-+
-+ mpcb->send_infinite_mapping = 1;
-+ /* We resend everything that has not been acknowledged */
-+ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
-+
-+ /* We artificially restart the whole send-queue. Thus,
-+ * it is as if no packets are in flight
-+ */
-+ tcp_sk(meta_sk)->packets_out = 0;
-+
-+ /* If the snd_nxt already wrapped around, we have to
-+ * undo the wrapping, as we are restarting from snd_una
-+ * on.
-+ */
-+ if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
-+ mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
-+ mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
-+ }
-+ tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
-+
-+ /* Trigger a sending on the meta. */
-+ mptcp_push_pending_frames(meta_sk);
-+
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ if (sk != sk_it)
-+ mptcp_sub_force_close(sk_it);
-+ }
-+ }
-+
-+ return 0;
-+ }
-+
-+ if (unlikely(mptcp->rx_opt.mp_fclose)) {
-+ struct sock *sk_it, *tmpsk;
-+
-+ mptcp->rx_opt.mp_fclose = 0;
-+ if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
-+ return 0;
-+
-+ if (tcp_need_reset(sk->sk_state))
-+ tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
-+
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
-+ mptcp_sub_force_close(sk_it);
-+
-+ tcp_reset(meta_sk);
-+
-+ return 1;
-+ }
-+
-+ return 0;
-+}
-+
-+static inline void mptcp_path_array_check(struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+
-+ if (unlikely(mpcb->list_rcvd)) {
-+ mpcb->list_rcvd = 0;
-+ if (mpcb->pm_ops->new_remote_address)
-+ mpcb->pm_ops->new_remote_address(meta_sk);
-+ }
-+}
-+
-+int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
-+ const struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
-+
-+ if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
-+ return 0;
-+
-+ if (mptcp_mp_fail_rcvd(sk, th))
-+ return 1;
-+
-+ /* RFC 6824, Section 3.3:
-+ * If a checksum is not present when its use has been negotiated, the
-+ * receiver MUST close the subflow with a RST as it is considered broken.
-+ */
-+ if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
-+ !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
-+ if (tcp_need_reset(sk->sk_state))
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
-+
-+ mptcp_sub_force_close(sk);
-+ return 1;
-+ }
-+
-+ /* We have to acknowledge retransmissions of the third
-+ * ack.
-+ */
-+ if (mopt->join_ack) {
-+ tcp_send_delayed_ack(sk);
-+ mopt->join_ack = 0;
-+ }
-+
-+ if (mopt->saw_add_addr || mopt->saw_rem_addr) {
-+ if (mopt->more_add_addr || mopt->more_rem_addr) {
-+ mptcp_parse_addropt(skb, sk);
-+ } else {
-+ if (mopt->saw_add_addr)
-+ mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
-+ if (mopt->saw_rem_addr)
-+ mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
-+ }
-+
-+ mopt->more_add_addr = 0;
-+ mopt->saw_add_addr = 0;
-+ mopt->more_rem_addr = 0;
-+ mopt->saw_rem_addr = 0;
-+ }
-+ if (mopt->saw_low_prio) {
-+ if (mopt->saw_low_prio == 1) {
-+ tp->mptcp->rcv_low_prio = mopt->low_prio;
-+ } else {
-+ struct sock *sk_it;
-+ mptcp_for_each_sk(tp->mpcb, sk_it) {
-+ struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
-+ if (mptcp->rem_id == mopt->prio_addr_id)
-+ mptcp->rcv_low_prio = mopt->low_prio;
-+ }
-+ }
-+ mopt->saw_low_prio = 0;
-+ }
-+
-+ mptcp_data_ack(sk, skb);
-+
-+ mptcp_path_array_check(mptcp_meta_sk(sk));
-+ /* Socket may have been mp_killed by a REMOVE_ADDR */
-+ if (tp->mp_killed)
-+ return 1;
-+
-+ return 0;
-+}
-+
-+/* In case of fastopen, some data can already be in the write queue.
-+ * We need to update the sequence number of the segments as they
-+ * were initially TCP sequence numbers.
-+ */
-+static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
-+ struct sk_buff *skb;
-+ u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
-+
-+ /* There should only be one skb in write queue: the data not
-+ * acknowledged in the SYN+ACK. In this case, we need to map
-+ * this data to data sequence numbers.
-+ */
-+ skb_queue_walk(&meta_sk->sk_write_queue, skb) {
-+ /* If the server only acknowledges partially the data sent in
-+ * the SYN, we need to trim the acknowledged part because
-+ * we don't want to retransmit this already received data.
-+ * When we reach this point, tcp_ack() has already cleaned up
-+ * fully acked segments. However, tcp trims partially acked
-+ * segments only when retransmitting. Since MPTCP comes into
-+ * play only now, we will fake an initial transmit, and
-+ * retransmit_skb() will not be called. The following fragment
-+ * comes from __tcp_retransmit_skb().
-+ */
-+ if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
-+ BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
-+ master_tp->snd_una));
-+ /* tcp_trim_head can only returns ENOMEM if skb is
-+ * cloned. It is not the case here (see
-+ * tcp_send_syn_data).
-+ */
-+ BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
-+ TCP_SKB_CB(skb)->seq));
-+ }
-+
-+ TCP_SKB_CB(skb)->seq += new_mapping;
-+ TCP_SKB_CB(skb)->end_seq += new_mapping;
-+ }
-+
-+ /* We can advance write_seq by the number of bytes unacknowledged
-+ * and that were mapped in the previous loop.
-+ */
-+ meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
-+
-+ /* The packets from the master_sk will be entailed to it later
-+ * Until that time, its write queue is empty, and
-+ * write_seq must align with snd_una
-+ */
-+ master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
-+ master_tp->packets_out = 0;
-+
-+ /* Although these data have been sent already over the subsk,
-+ * They have never been sent over the meta_sk, so we rewind
-+ * the send_head so that tcp considers it as an initial send
-+ * (instead of retransmit).
-+ */
-+ meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
-+}
-+
-+/* The skptr is needed, because if we become MPTCP-capable, we have to switch
-+ * from meta-socket to master-socket.
-+ *
-+ * @return: 1 - we want to reset this connection
-+ * 2 - we want to discard the received syn/ack
-+ * 0 - everything is fine - continue
-+ */
-+int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
-+ const struct sk_buff *skb,
-+ const struct mptcp_options_received *mopt)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (mptcp(tp)) {
-+ u8 hash_mac_check[20];
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+
-+ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
-+ (u8 *)&mpcb->mptcp_loc_key,
-+ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
-+ (u8 *)&tp->mptcp->mptcp_loc_nonce,
-+ (u32 *)hash_mac_check);
-+ if (memcmp(hash_mac_check,
-+ (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
-+ mptcp_sub_force_close(sk);
-+ return 1;
-+ }
-+
-+ /* Set this flag in order to postpone data sending
-+ * until the 4th ack arrives.
-+ */
-+ tp->mptcp->pre_established = 1;
-+ tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
-+
-+ mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
-+ (u8 *)&mpcb->mptcp_rem_key,
-+ (u8 *)&tp->mptcp->mptcp_loc_nonce,
-+ (u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
-+ (u32 *)&tp->mptcp->sender_mac[0]);
-+
-+ } else if (mopt->saw_mpc) {
-+ struct sock *meta_sk = sk;
-+
-+ if (mptcp_create_master_sk(sk, mopt->mptcp_key,
-+ ntohs(tcp_hdr(skb)->window)))
-+ return 2;
-+
-+ sk = tcp_sk(sk)->mpcb->master_sk;
-+ *skptr = sk;
-+ tp = tcp_sk(sk);
-+
-+ /* If fastopen was used data might be in the send queue. We
-+ * need to update their sequence number to MPTCP-level seqno.
-+ * Note that it can happen in rare cases that fastopen_req is
-+ * NULL and syn_data is 0 but fastopen indeed occurred and
-+ * data has been queued in the write queue (but not sent).
-+ * Example of such rare cases: connect is non-blocking and
-+ * TFO is configured to work without cookies.
-+ */
-+ if (!skb_queue_empty(&meta_sk->sk_write_queue))
-+ mptcp_rcv_synsent_fastopen(meta_sk);
-+
-+ /* -1, because the SYN consumed 1 byte. In case of TFO, we
-+ * start the subflow-sequence number as if the data of the SYN
-+ * is not part of any mapping.
-+ */
-+ tp->mptcp->snt_isn = tp->snd_una - 1;
-+ tp->mpcb->dss_csum = mopt->dss_csum;
-+ tp->mptcp->include_mpc = 1;
-+
-+ /* Ensure that fastopen is handled at the meta-level. */
-+ tp->fastopen_req = NULL;
-+
-+ sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
-+ sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
-+
-+ /* hold in sk_clone_lock due to initialization to 2 */
-+ sock_put(sk);
-+ } else {
-+ tp->request_mptcp = 0;
-+
-+ if (tp->inside_tk_table)
-+ mptcp_hash_remove(tp);
-+ }
-+
-+ if (mptcp(tp))
-+ tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
-+
-+ return 0;
-+}
-+
-+bool mptcp_should_expand_sndbuf(const struct sock *sk)
-+{
-+ const struct sock *sk_it;
-+ const struct sock *meta_sk = mptcp_meta_sk(sk);
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ int cnt_backups = 0;
-+ int backup_available = 0;
-+
-+ /* We circumvent this check in tcp_check_space, because we want to
-+ * always call sk_write_space. So, we reproduce the check here.
-+ */
-+ if (!meta_sk->sk_socket ||
-+ !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
-+ return false;
-+
-+ /* If the user specified a specific send buffer setting, do
-+ * not modify it.
-+ */
-+ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-+ return false;
-+
-+ /* If we are under global TCP memory pressure, do not expand. */
-+ if (sk_under_memory_pressure(meta_sk))
-+ return false;
-+
-+ /* If we are under soft global TCP memory pressure, do not expand. */
-+ if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
-+ return false;
-+
-+
-+ /* For MPTCP we look for a subsocket that could send data.
-+ * If we found one, then we update the send-buffer.
-+ */
-+ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+ struct tcp_sock *tp_it = tcp_sk(sk_it);
-+
-+ if (!mptcp_sk_can_send(sk_it))
-+ continue;
-+
-+ /* Backup-flows have to be counted - if there is no other
-+ * subflow we take the backup-flow into account.
-+ */
-+ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
-+ cnt_backups++;
-+
-+ if (tp_it->packets_out < tp_it->snd_cwnd) {
-+ if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
-+ backup_available = 1;
-+ continue;
-+ }
-+ return true;
-+ }
-+ }
-+
-+ /* Backup-flow is available for sending - update send-buffer */
-+ if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
-+ return true;
-+ return false;
-+}
-+
-+void mptcp_init_buffer_space(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ int space;
-+
-+ tcp_init_buffer_space(sk);
-+
-+ if (is_master_tp(tp)) {
-+ meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
-+ meta_tp->rcvq_space.time = tcp_time_stamp;
-+ meta_tp->rcvq_space.seq = meta_tp->copied_seq;
-+
-+ /* If there is only one subflow, we just use regular TCP
-+ * autotuning. User-locks are handled already by
-+ * tcp_init_buffer_space
-+ */
-+ meta_tp->window_clamp = tp->window_clamp;
-+ meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
-+ meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
-+ meta_sk->sk_sndbuf = sk->sk_sndbuf;
-+
-+ return;
-+ }
-+
-+ if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
-+ goto snd_buf;
-+
-+ /* Adding a new subflow to the rcv-buffer space. We make a simple
-+ * addition, to give some space to allow traffic on the new subflow.
-+ * Autotuning will increase it further later on.
-+ */
-+ space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
-+ if (space > meta_sk->sk_rcvbuf) {
-+ meta_tp->window_clamp += tp->window_clamp;
-+ meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
-+ meta_sk->sk_rcvbuf = space;
-+ }
-+
-+snd_buf:
-+ if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-+ return;
-+
-+ /* Adding a new subflow to the send-buffer space. We make a simple
-+ * addition, to give some space to allow traffic on the new subflow.
-+ * Autotuning will increase it further later on.
-+ */
-+ space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
-+ if (space > meta_sk->sk_sndbuf) {
-+ meta_sk->sk_sndbuf = space;
-+ meta_sk->sk_write_space(meta_sk);
-+ }
-+}
-+
-+void mptcp_tcp_set_rto(struct sock *sk)
-+{
-+ tcp_set_rto(sk);
-+ mptcp_set_rto(sk);
-+}
-diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
-new file mode 100644
-index 000000000000..1183d1305d35
---- /dev/null
-+++ b/net/mptcp/mptcp_ipv4.c
-@@ -0,0 +1,483 @@
-+/*
-+ * MPTCP implementation - IPv4-specific functions
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/export.h>
-+#include <linux/ip.h>
-+#include <linux/list.h>
-+#include <linux/skbuff.h>
-+#include <linux/spinlock.h>
-+#include <linux/tcp.h>
-+
-+#include <net/inet_common.h>
-+#include <net/inet_connection_sock.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/request_sock.h>
-+#include <net/tcp.h>
-+
-+u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
-+{
-+ u32 hash[MD5_DIGEST_WORDS];
-+
-+ hash[0] = (__force u32)saddr;
-+ hash[1] = (__force u32)daddr;
-+ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-+ hash[3] = mptcp_seed++;
-+
-+ md5_transform(hash, mptcp_secret);
-+
-+ return hash[0];
-+}
-+
-+u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
-+{
-+ u32 hash[MD5_DIGEST_WORDS];
-+
-+ hash[0] = (__force u32)saddr;
-+ hash[1] = (__force u32)daddr;
-+ hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-+ hash[3] = mptcp_seed++;
-+
-+ md5_transform(hash, mptcp_secret);
-+
-+ return *((u64 *)hash);
-+}
-+
-+
-+static void mptcp_v4_reqsk_destructor(struct request_sock *req)
-+{
-+ mptcp_reqsk_destructor(req);
-+
-+ tcp_v4_reqsk_destructor(req);
-+}
-+
-+static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
-+ mptcp_reqsk_init(req, skb);
-+
-+ return 0;
-+}
-+
-+static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+ union inet_addr addr;
-+ int loc_id;
-+ bool low_prio = false;
-+
-+ /* We need to do this as early as possible. Because, if we fail later
-+ * (e.g., get_local_id), then reqsk_free tries to remove the
-+ * request-socket from the htb in mptcp_hash_request_remove as pprev
-+ * may be different from NULL.
-+ */
-+ mtreq->hash_entry.pprev = NULL;
-+
-+ tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
-+
-+ mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
-+ ip_hdr(skb)->daddr,
-+ tcp_hdr(skb)->source,
-+ tcp_hdr(skb)->dest);
-+ addr.ip = inet_rsk(req)->ir_loc_addr;
-+ loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
-+ if (loc_id == -1)
-+ return -1;
-+ mtreq->loc_id = loc_id;
-+ mtreq->low_prio = low_prio;
-+
-+ mptcp_join_reqsk_init(mpcb, req, skb);
-+
-+ return 0;
-+}
-+
-+/* Similar to tcp_request_sock_ops */
-+struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
-+ .family = PF_INET,
-+ .obj_size = sizeof(struct mptcp_request_sock),
-+ .rtx_syn_ack = tcp_rtx_synack,
-+ .send_ack = tcp_v4_reqsk_send_ack,
-+ .destructor = mptcp_v4_reqsk_destructor,
-+ .send_reset = tcp_v4_send_reset,
-+ .syn_ack_timeout = tcp_syn_ack_timeout,
-+};
-+
-+static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
-+ struct request_sock *req,
-+ const unsigned long timeout)
-+{
-+ const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
-+ inet_rsk(req)->ir_rmt_port,
-+ 0, MPTCP_HASH_SIZE);
-+ /* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
-+ * want to reset the keepalive-timer (responsible for retransmitting
-+ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
-+ * overload the keepalive timer. Also, it's not a big deal, because the
-+ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
-+ * if the third ACK gets lost, the client will handle the retransmission
-+ * anyways. If our SYN/ACK gets lost, the client will retransmit the
-+ * SYN.
-+ */
-+ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
-+ const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
-+ inet_rsk(req)->ir_rmt_port,
-+ lopt->hash_rnd, lopt->nr_table_entries);
-+
-+ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
-+ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
-+ mptcp_reset_synack_timer(meta_sk, timeout);
-+
-+ rcu_read_lock();
-+ spin_lock(&mptcp_reqsk_hlock);
-+ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
-+ spin_unlock(&mptcp_reqsk_hlock);
-+ rcu_read_unlock();
-+}
-+
-+/* Similar to tcp_v4_conn_request */
-+static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ return tcp_conn_request(&mptcp_request_sock_ops,
-+ &mptcp_join_request_sock_ipv4_ops,
-+ meta_sk, skb);
-+}
-+
-+/* We only process join requests here. (either the SYN or the final ACK) */
-+int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *child, *rsk = NULL;
-+ int ret;
-+
-+ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
-+ struct tcphdr *th = tcp_hdr(skb);
-+ const struct iphdr *iph = ip_hdr(skb);
-+ struct sock *sk;
-+
-+ sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
-+ iph->saddr, th->source, iph->daddr,
-+ th->dest, inet_iif(skb));
-+
-+ if (!sk) {
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+ if (is_meta_sk(sk)) {
-+ WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
-+ kfree_skb(skb);
-+ sock_put(sk);
-+ return 0;
-+ }
-+
-+ if (sk->sk_state == TCP_TIME_WAIT) {
-+ inet_twsk_put(inet_twsk(sk));
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+
-+ ret = tcp_v4_do_rcv(sk, skb);
-+ sock_put(sk);
-+
-+ return ret;
-+ }
-+ TCP_SKB_CB(skb)->mptcp_flags = 0;
-+
-+ /* Has been removed from the tk-table. Thus, no new subflows.
-+ *
-+ * Check for close-state is necessary, because we may have been closed
-+ * without passing by mptcp_close().
-+ *
-+ * When falling back, no new subflows are allowed either.
-+ */
-+ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
-+ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
-+ goto reset_and_discard;
-+
-+ child = tcp_v4_hnd_req(meta_sk, skb);
-+
-+ if (!child)
-+ goto discard;
-+
-+ if (child != meta_sk) {
-+ sock_rps_save_rxhash(child, skb);
-+ /* We don't call tcp_child_process here, because we hold
-+ * already the meta-sk-lock and are sure that it is not owned
-+ * by the user.
-+ */
-+ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
-+ bh_unlock_sock(child);
-+ sock_put(child);
-+ if (ret) {
-+ rsk = child;
-+ goto reset_and_discard;
-+ }
-+ } else {
-+ if (tcp_hdr(skb)->syn) {
-+ mptcp_v4_join_request(meta_sk, skb);
-+ goto discard;
-+ }
-+ goto reset_and_discard;
-+ }
-+ return 0;
-+
-+reset_and_discard:
-+ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ const struct iphdr *iph = ip_hdr(skb);
-+ struct request_sock **prev, *req;
-+ /* If we end up here, it means we should not have matched on the
-+ * request-socket. But, because the request-sock queue is only
-+ * destroyed in mptcp_close, the socket may actually already be
-+ * in close-state (e.g., through shutdown()) while still having
-+ * pending request sockets.
-+ */
-+ req = inet_csk_search_req(meta_sk, &prev, th->source,
-+ iph->saddr, iph->daddr);
-+ if (req) {
-+ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
-+ req);
-+ reqsk_free(req);
-+ }
-+ }
-+
-+ tcp_v4_send_reset(rsk, skb);
-+discard:
-+ kfree_skb(skb);
-+ return 0;
-+}
-+
-+/* After this, the ref count of the meta_sk associated with the request_sock
-+ * is incremented. Thus it is the responsibility of the caller
-+ * to call sock_put() when the reference is not needed anymore.
-+ */
-+struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
-+ const __be32 laddr, const struct net *net)
-+{
-+ const struct mptcp_request_sock *mtreq;
-+ struct sock *meta_sk = NULL;
-+ const struct hlist_nulls_node *node;
-+ const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
-+
-+ rcu_read_lock();
-+begin:
-+ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
-+ hash_entry) {
-+ struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
-+ meta_sk = mtreq->mptcp_mpcb->meta_sk;
-+
-+ if (ireq->ir_rmt_port == rport &&
-+ ireq->ir_rmt_addr == raddr &&
-+ ireq->ir_loc_addr == laddr &&
-+ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
-+ net_eq(net, sock_net(meta_sk)))
-+ goto found;
-+ meta_sk = NULL;
-+ }
-+ /* A request-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
-+ goto begin;
-+
-+found:
-+ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+ meta_sk = NULL;
-+ rcu_read_unlock();
-+
-+ return meta_sk;
-+}
-+
-+/* Create a new IPv4 subflow.
-+ *
-+ * We are in user-context and meta-sock-lock is hold.
-+ */
-+int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
-+ struct mptcp_rem4 *rem)
-+{
-+ struct tcp_sock *tp;
-+ struct sock *sk;
-+ struct sockaddr_in loc_in, rem_in;
-+ struct socket sock;
-+ int ret;
-+
-+ /** First, create and prepare the new socket */
-+
-+ sock.type = meta_sk->sk_socket->type;
-+ sock.state = SS_UNCONNECTED;
-+ sock.wq = meta_sk->sk_socket->wq;
-+ sock.file = meta_sk->sk_socket->file;
-+ sock.ops = NULL;
-+
-+ ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
-+ if (unlikely(ret < 0)) {
-+ mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
-+ return ret;
-+ }
-+
-+ sk = sock.sk;
-+ tp = tcp_sk(sk);
-+
-+ /* All subsockets need the MPTCP-lock-class */
-+ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
-+ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
-+
-+ if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
-+ goto error;
-+
-+ tp->mptcp->slave_sk = 1;
-+ tp->mptcp->low_prio = loc->low_prio;
-+
-+ /* Initializing the timer for an MPTCP subflow */
-+ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
-+
-+ /** Then, connect the socket to the peer */
-+ loc_in.sin_family = AF_INET;
-+ rem_in.sin_family = AF_INET;
-+ loc_in.sin_port = 0;
-+ if (rem->port)
-+ rem_in.sin_port = rem->port;
-+ else
-+ rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
-+ loc_in.sin_addr = loc->addr;
-+ rem_in.sin_addr = rem->addr;
-+
-+ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
-+ if (ret < 0) {
-+ mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
-+ __func__, ret);
-+ goto error;
-+ }
-+
-+ mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
-+ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index, &loc_in.sin_addr,
-+ ntohs(loc_in.sin_port), &rem_in.sin_addr,
-+ ntohs(rem_in.sin_port));
-+
-+ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
-+ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
-+
-+ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
-+ sizeof(struct sockaddr_in), O_NONBLOCK);
-+ if (ret < 0 && ret != -EINPROGRESS) {
-+ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
-+ __func__, ret);
-+ goto error;
-+ }
-+
-+ sk_set_socket(sk, meta_sk->sk_socket);
-+ sk->sk_wq = meta_sk->sk_wq;
-+
-+ return 0;
-+
-+error:
-+ /* May happen if mptcp_add_sock fails first */
-+ if (!mptcp(tp)) {
-+ tcp_close(sk, 0);
-+ } else {
-+ local_bh_disable();
-+ mptcp_sub_force_close(sk);
-+ local_bh_enable();
-+ }
-+ return ret;
-+}
-+EXPORT_SYMBOL(mptcp_init4_subsockets);
-+
-+const struct inet_connection_sock_af_ops mptcp_v4_specific = {
-+ .queue_xmit = ip_queue_xmit,
-+ .send_check = tcp_v4_send_check,
-+ .rebuild_header = inet_sk_rebuild_header,
-+ .sk_rx_dst_set = inet_sk_rx_dst_set,
-+ .conn_request = mptcp_conn_request,
-+ .syn_recv_sock = tcp_v4_syn_recv_sock,
-+ .net_header_len = sizeof(struct iphdr),
-+ .setsockopt = ip_setsockopt,
-+ .getsockopt = ip_getsockopt,
-+ .addr2sockaddr = inet_csk_addr2sockaddr,
-+ .sockaddr_len = sizeof(struct sockaddr_in),
-+ .bind_conflict = inet_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+ .compat_setsockopt = compat_ip_setsockopt,
-+ .compat_getsockopt = compat_ip_getsockopt,
-+#endif
-+};
-+
-+struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
-+struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
-+
-+/* General initialization of IPv4 for MPTCP */
-+int mptcp_pm_v4_init(void)
-+{
-+ int ret = 0;
-+ struct request_sock_ops *ops = &mptcp_request_sock_ops;
-+
-+ mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-+ mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
-+
-+ mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-+ mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
-+ mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
-+
-+ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
-+ if (ops->slab_name == NULL) {
-+ ret = -ENOMEM;
-+ goto out;
-+ }
-+
-+ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
-+ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+ NULL);
-+
-+ if (ops->slab == NULL) {
-+ ret = -ENOMEM;
-+ goto err_reqsk_create;
-+ }
-+
-+out:
-+ return ret;
-+
-+err_reqsk_create:
-+ kfree(ops->slab_name);
-+ ops->slab_name = NULL;
-+ goto out;
-+}
-+
-+void mptcp_pm_v4_undo(void)
-+{
-+ kmem_cache_destroy(mptcp_request_sock_ops.slab);
-+ kfree(mptcp_request_sock_ops.slab_name);
-+}
-diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
-new file mode 100644
-index 000000000000..1036973aa855
---- /dev/null
-+++ b/net/mptcp/mptcp_ipv6.c
-@@ -0,0 +1,518 @@
-+/*
-+ * MPTCP implementation - IPv6-specific functions
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/export.h>
-+#include <linux/in6.h>
-+#include <linux/kernel.h>
-+
-+#include <net/addrconf.h>
-+#include <net/flow.h>
-+#include <net/inet6_connection_sock.h>
-+#include <net/inet6_hashtables.h>
-+#include <net/inet_common.h>
-+#include <net/ipv6.h>
-+#include <net/ip6_checksum.h>
-+#include <net/ip6_route.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v6.h>
-+#include <net/tcp.h>
-+#include <net/transp_v6.h>
-+
-+__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
-+ __be16 sport, __be16 dport)
-+{
-+ u32 secret[MD5_MESSAGE_BYTES / 4];
-+ u32 hash[MD5_DIGEST_WORDS];
-+ u32 i;
-+
-+ memcpy(hash, saddr, 16);
-+ for (i = 0; i < 4; i++)
-+ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
-+ secret[4] = mptcp_secret[4] +
-+ (((__force u16)sport << 16) + (__force u16)dport);
-+ secret[5] = mptcp_seed++;
-+ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
-+ secret[i] = mptcp_secret[i];
-+
-+ md5_transform(hash, secret);
-+
-+ return hash[0];
-+}
-+
-+u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
-+ __be16 sport, __be16 dport)
-+{
-+ u32 secret[MD5_MESSAGE_BYTES / 4];
-+ u32 hash[MD5_DIGEST_WORDS];
-+ u32 i;
-+
-+ memcpy(hash, saddr, 16);
-+ for (i = 0; i < 4; i++)
-+ secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
-+ secret[4] = mptcp_secret[4] +
-+ (((__force u16)sport << 16) + (__force u16)dport);
-+ secret[5] = mptcp_seed++;
-+ for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
-+ secret[i] = mptcp_secret[i];
-+
-+ md5_transform(hash, secret);
-+
-+ return *((u64 *)hash);
-+}
-+
-+static void mptcp_v6_reqsk_destructor(struct request_sock *req)
-+{
-+ mptcp_reqsk_destructor(req);
-+
-+ tcp_v6_reqsk_destructor(req);
-+}
-+
-+static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
-+ mptcp_reqsk_init(req, skb);
-+
-+ return 0;
-+}
-+
-+static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
-+ struct sk_buff *skb)
-+{
-+ struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+ union inet_addr addr;
-+ int loc_id;
-+ bool low_prio = false;
-+
-+ /* We need to do this as early as possible. Because, if we fail later
-+ * (e.g., get_local_id), then reqsk_free tries to remove the
-+ * request-socket from the htb in mptcp_hash_request_remove as pprev
-+ * may be different from NULL.
-+ */
-+ mtreq->hash_entry.pprev = NULL;
-+
-+ tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
-+
-+ mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
-+ ipv6_hdr(skb)->daddr.s6_addr32,
-+ tcp_hdr(skb)->source,
-+ tcp_hdr(skb)->dest);
-+ addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
-+ loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
-+ if (loc_id == -1)
-+ return -1;
-+ mtreq->loc_id = loc_id;
-+ mtreq->low_prio = low_prio;
-+
-+ mptcp_join_reqsk_init(mpcb, req, skb);
-+
-+ return 0;
-+}
-+
-+/* Similar to tcp6_request_sock_ops */
-+struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
-+ .family = AF_INET6,
-+ .obj_size = sizeof(struct mptcp_request_sock),
-+ .rtx_syn_ack = tcp_v6_rtx_synack,
-+ .send_ack = tcp_v6_reqsk_send_ack,
-+ .destructor = mptcp_v6_reqsk_destructor,
-+ .send_reset = tcp_v6_send_reset,
-+ .syn_ack_timeout = tcp_syn_ack_timeout,
-+};
-+
-+static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
-+ struct request_sock *req,
-+ const unsigned long timeout)
-+{
-+ const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
-+ inet_rsk(req)->ir_rmt_port,
-+ 0, MPTCP_HASH_SIZE);
-+ /* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
-+ * want to reset the keepalive-timer (responsible for retransmitting
-+ * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
-+ * overload the keepalive timer. Also, it's not a big deal, because the
-+ * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
-+ * if the third ACK gets lost, the client will handle the retransmission
-+ * anyways. If our SYN/ACK gets lost, the client will retransmit the
-+ * SYN.
-+ */
-+ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+ struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
-+ const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
-+ inet_rsk(req)->ir_rmt_port,
-+ lopt->hash_rnd, lopt->nr_table_entries);
-+
-+ reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
-+ if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
-+ mptcp_reset_synack_timer(meta_sk, timeout);
-+
-+ rcu_read_lock();
-+ spin_lock(&mptcp_reqsk_hlock);
-+ hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
-+ spin_unlock(&mptcp_reqsk_hlock);
-+ rcu_read_unlock();
-+}
-+
-+static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ return tcp_conn_request(&mptcp6_request_sock_ops,
-+ &mptcp_join_request_sock_ipv6_ops,
-+ meta_sk, skb);
-+}
-+
-+int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *child, *rsk = NULL;
-+ int ret;
-+
-+ if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
-+ struct tcphdr *th = tcp_hdr(skb);
-+ const struct ipv6hdr *ip6h = ipv6_hdr(skb);
-+ struct sock *sk;
-+
-+ sk = __inet6_lookup_established(sock_net(meta_sk),
-+ &tcp_hashinfo,
-+ &ip6h->saddr, th->source,
-+ &ip6h->daddr, ntohs(th->dest),
-+ inet6_iif(skb));
-+
-+ if (!sk) {
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+ if (is_meta_sk(sk)) {
-+ WARN("%s Did not find a sub-sk!\n", __func__);
-+ kfree_skb(skb);
-+ sock_put(sk);
-+ return 0;
-+ }
-+
-+ if (sk->sk_state == TCP_TIME_WAIT) {
-+ inet_twsk_put(inet_twsk(sk));
-+ kfree_skb(skb);
-+ return 0;
-+ }
-+
-+ ret = tcp_v6_do_rcv(sk, skb);
-+ sock_put(sk);
-+
-+ return ret;
-+ }
-+ TCP_SKB_CB(skb)->mptcp_flags = 0;
-+
-+ /* Has been removed from the tk-table. Thus, no new subflows.
-+ *
-+ * Check for close-state is necessary, because we may have been closed
-+ * without passing by mptcp_close().
-+ *
-+ * When falling back, no new subflows are allowed either.
-+ */
-+ if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
-+ mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
-+ goto reset_and_discard;
-+
-+ child = tcp_v6_hnd_req(meta_sk, skb);
-+
-+ if (!child)
-+ goto discard;
-+
-+ if (child != meta_sk) {
-+ sock_rps_save_rxhash(child, skb);
-+ /* We don't call tcp_child_process here, because we hold
-+ * already the meta-sk-lock and are sure that it is not owned
-+ * by the user.
-+ */
-+ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
-+ bh_unlock_sock(child);
-+ sock_put(child);
-+ if (ret) {
-+ rsk = child;
-+ goto reset_and_discard;
-+ }
-+ } else {
-+ if (tcp_hdr(skb)->syn) {
-+ mptcp_v6_join_request(meta_sk, skb);
-+ goto discard;
-+ }
-+ goto reset_and_discard;
-+ }
-+ return 0;
-+
-+reset_and_discard:
-+ if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
-+ const struct tcphdr *th = tcp_hdr(skb);
-+ struct request_sock **prev, *req;
-+ /* If we end up here, it means we should not have matched on the
-+ * request-socket. But, because the request-sock queue is only
-+ * destroyed in mptcp_close, the socket may actually already be
-+ * in close-state (e.g., through shutdown()) while still having
-+ * pending request sockets.
-+ */
-+ req = inet6_csk_search_req(meta_sk, &prev, th->source,
-+ &ipv6_hdr(skb)->saddr,
-+ &ipv6_hdr(skb)->daddr, inet6_iif(skb));
-+ if (req) {
-+ inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+ reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
-+ req);
-+ reqsk_free(req);
-+ }
-+ }
-+
-+ tcp_v6_send_reset(rsk, skb);
-+discard:
-+ kfree_skb(skb);
-+ return 0;
-+}
-+
-+/* After this, the ref count of the meta_sk associated with the request_sock
-+ * is incremented. Thus it is the responsibility of the caller
-+ * to call sock_put() when the reference is not needed anymore.
-+ */
-+struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
-+ const struct in6_addr *laddr, const struct net *net)
-+{
-+ const struct mptcp_request_sock *mtreq;
-+ struct sock *meta_sk = NULL;
-+ const struct hlist_nulls_node *node;
-+ const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
-+
-+ rcu_read_lock();
-+begin:
-+ hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
-+ hash_entry) {
-+ struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
-+ meta_sk = mtreq->mptcp_mpcb->meta_sk;
-+
-+ if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
-+ rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
-+ ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
-+ ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
-+ net_eq(net, sock_net(meta_sk)))
-+ goto found;
-+ meta_sk = NULL;
-+ }
-+ /* A request-socket is destroyed by RCU. So, it might have been recycled
-+ * and put into another hash-table list. So, after the lookup we may
-+ * end up in a different list. So, we may need to restart.
-+ *
-+ * See also the comment in __inet_lookup_established.
-+ */
-+ if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
-+ goto begin;
-+
-+found:
-+ if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+ meta_sk = NULL;
-+ rcu_read_unlock();
-+
-+ return meta_sk;
-+}
-+
-+/* Create a new IPv6 subflow.
-+ *
-+ * We are in user-context and meta-sock-lock is hold.
-+ */
-+int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
-+ struct mptcp_rem6 *rem)
-+{
-+ struct tcp_sock *tp;
-+ struct sock *sk;
-+ struct sockaddr_in6 loc_in, rem_in;
-+ struct socket sock;
-+ int ret;
-+
-+ /** First, create and prepare the new socket */
-+
-+ sock.type = meta_sk->sk_socket->type;
-+ sock.state = SS_UNCONNECTED;
-+ sock.wq = meta_sk->sk_socket->wq;
-+ sock.file = meta_sk->sk_socket->file;
-+ sock.ops = NULL;
-+
-+ ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
-+ if (unlikely(ret < 0)) {
-+ mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
-+ return ret;
-+ }
-+
-+ sk = sock.sk;
-+ tp = tcp_sk(sk);
-+
-+ /* All subsockets need the MPTCP-lock-class */
-+ lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
-+ lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
-+
-+ if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
-+ goto error;
-+
-+ tp->mptcp->slave_sk = 1;
-+ tp->mptcp->low_prio = loc->low_prio;
-+
-+ /* Initializing the timer for an MPTCP subflow */
-+ setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
-+
-+ /** Then, connect the socket to the peer */
-+ loc_in.sin6_family = AF_INET6;
-+ rem_in.sin6_family = AF_INET6;
-+ loc_in.sin6_port = 0;
-+ if (rem->port)
-+ rem_in.sin6_port = rem->port;
-+ else
-+ rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
-+ loc_in.sin6_addr = loc->addr;
-+ rem_in.sin6_addr = rem->addr;
-+
-+ ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
-+ if (ret < 0) {
-+ mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
-+ __func__, ret);
-+ goto error;
-+ }
-+
-+ mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
-+ __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
-+ tp->mptcp->path_index, &loc_in.sin6_addr,
-+ ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
-+ ntohs(rem_in.sin6_port));
-+
-+ if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
-+ tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
-+
-+ ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
-+ sizeof(struct sockaddr_in6), O_NONBLOCK);
-+ if (ret < 0 && ret != -EINPROGRESS) {
-+ mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
-+ __func__, ret);
-+ goto error;
-+ }
-+
-+ sk_set_socket(sk, meta_sk->sk_socket);
-+ sk->sk_wq = meta_sk->sk_wq;
-+
-+ return 0;
-+
-+error:
-+ /* May happen if mptcp_add_sock fails first */
-+ if (!mptcp(tp)) {
-+ tcp_close(sk, 0);
-+ } else {
-+ local_bh_disable();
-+ mptcp_sub_force_close(sk);
-+ local_bh_enable();
-+ }
-+ return ret;
-+}
-+EXPORT_SYMBOL(mptcp_init6_subsockets);
-+
-+const struct inet_connection_sock_af_ops mptcp_v6_specific = {
-+ .queue_xmit = inet6_csk_xmit,
-+ .send_check = tcp_v6_send_check,
-+ .rebuild_header = inet6_sk_rebuild_header,
-+ .sk_rx_dst_set = inet6_sk_rx_dst_set,
-+ .conn_request = mptcp_conn_request,
-+ .syn_recv_sock = tcp_v6_syn_recv_sock,
-+ .net_header_len = sizeof(struct ipv6hdr),
-+ .net_frag_header_len = sizeof(struct frag_hdr),
-+ .setsockopt = ipv6_setsockopt,
-+ .getsockopt = ipv6_getsockopt,
-+ .addr2sockaddr = inet6_csk_addr2sockaddr,
-+ .sockaddr_len = sizeof(struct sockaddr_in6),
-+ .bind_conflict = inet6_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+ .compat_setsockopt = compat_ipv6_setsockopt,
-+ .compat_getsockopt = compat_ipv6_getsockopt,
-+#endif
-+};
-+
-+const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
-+ .queue_xmit = ip_queue_xmit,
-+ .send_check = tcp_v4_send_check,
-+ .rebuild_header = inet_sk_rebuild_header,
-+ .sk_rx_dst_set = inet_sk_rx_dst_set,
-+ .conn_request = mptcp_conn_request,
-+ .syn_recv_sock = tcp_v6_syn_recv_sock,
-+ .net_header_len = sizeof(struct iphdr),
-+ .setsockopt = ipv6_setsockopt,
-+ .getsockopt = ipv6_getsockopt,
-+ .addr2sockaddr = inet6_csk_addr2sockaddr,
-+ .sockaddr_len = sizeof(struct sockaddr_in6),
-+ .bind_conflict = inet6_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+ .compat_setsockopt = compat_ipv6_setsockopt,
-+ .compat_getsockopt = compat_ipv6_getsockopt,
-+#endif
-+};
-+
-+struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
-+struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
-+
-+int mptcp_pm_v6_init(void)
-+{
-+ int ret = 0;
-+ struct request_sock_ops *ops = &mptcp6_request_sock_ops;
-+
-+ mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
-+ mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
-+
-+ mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
-+ mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
-+ mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
-+
-+ ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
-+ if (ops->slab_name == NULL) {
-+ ret = -ENOMEM;
-+ goto out;
-+ }
-+
-+ ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
-+ SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+ NULL);
-+
-+ if (ops->slab == NULL) {
-+ ret = -ENOMEM;
-+ goto err_reqsk_create;
-+ }
-+
-+out:
-+ return ret;
-+
-+err_reqsk_create:
-+ kfree(ops->slab_name);
-+ ops->slab_name = NULL;
-+ goto out;
-+}
-+
-+void mptcp_pm_v6_undo(void)
-+{
-+ kmem_cache_destroy(mptcp6_request_sock_ops.slab);
-+ kfree(mptcp6_request_sock_ops.slab_name);
-+}
-diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
-new file mode 100644
-index 000000000000..6f5087983175
---- /dev/null
-+++ b/net/mptcp/mptcp_ndiffports.c
-@@ -0,0 +1,161 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#endif
-+
-+struct ndiffports_priv {
-+ /* Worker struct for subflow establishment */
-+ struct work_struct subflow_work;
-+
-+ struct mptcp_cb *mpcb;
-+};
-+
-+static int num_subflows __read_mostly = 2;
-+module_param(num_subflows, int, 0644);
-+MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+ const struct ndiffports_priv *pm_priv = container_of(work,
-+ struct ndiffports_priv,
-+ subflow_work);
-+ struct mptcp_cb *mpcb = pm_priv->mpcb;
-+ struct sock *meta_sk = mpcb->meta_sk;
-+ int iter = 0;
-+
-+next_subflow:
-+ if (iter) {
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+
-+ cond_resched();
-+ }
-+ mutex_lock(&mpcb->mpcb_mutex);
-+ lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+ iter++;
-+
-+ if (sock_flag(meta_sk, SOCK_DEAD))
-+ goto exit;
-+
-+ if (mpcb->master_sk &&
-+ !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+ goto exit;
-+
-+ if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
-+ if (meta_sk->sk_family == AF_INET ||
-+ mptcp_v6_is_v4_mapped(meta_sk)) {
-+ struct mptcp_loc4 loc;
-+ struct mptcp_rem4 rem;
-+
-+ loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
-+ loc.loc4_id = 0;
-+ loc.low_prio = 0;
-+
-+ rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
-+ rem.port = inet_sk(meta_sk)->inet_dport;
-+ rem.rem4_id = 0; /* Default 0 */
-+
-+ mptcp_init4_subsockets(meta_sk, &loc, &rem);
-+ } else {
-+#if IS_ENABLED(CONFIG_IPV6)
-+ struct mptcp_loc6 loc;
-+ struct mptcp_rem6 rem;
-+
-+ loc.addr = inet6_sk(meta_sk)->saddr;
-+ loc.loc6_id = 0;
-+ loc.low_prio = 0;
-+
-+ rem.addr = meta_sk->sk_v6_daddr;
-+ rem.port = inet_sk(meta_sk)->inet_dport;
-+ rem.rem6_id = 0; /* Default 0 */
-+
-+ mptcp_init6_subsockets(meta_sk, &loc, &rem);
-+#endif
-+ }
-+ goto next_subflow;
-+ }
-+
-+exit:
-+ release_sock(meta_sk);
-+ mutex_unlock(&mpcb->mpcb_mutex);
-+ sock_put(meta_sk);
-+}
-+
-+static void ndiffports_new_session(const struct sock *meta_sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
-+
-+ /* Initialize workqueue-struct */
-+ INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+ fmp->mpcb = mpcb;
-+}
-+
-+static void ndiffports_create_subflows(struct sock *meta_sk)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
-+
-+ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+ mpcb->send_infinite_mapping ||
-+ mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+ return;
-+
-+ if (!work_pending(&pm_priv->subflow_work)) {
-+ sock_hold(meta_sk);
-+ queue_work(mptcp_wq, &pm_priv->subflow_work);
-+ }
-+}
-+
-+static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio)
-+{
-+ return 0;
-+}
-+
-+static struct mptcp_pm_ops ndiffports __read_mostly = {
-+ .new_session = ndiffports_new_session,
-+ .fully_established = ndiffports_create_subflows,
-+ .get_local_id = ndiffports_get_local_id,
-+ .name = "ndiffports",
-+ .owner = THIS_MODULE,
-+};
-+
-+/* General initialization of MPTCP_PM */
-+static int __init ndiffports_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
-+
-+ if (mptcp_register_path_manager(&ndiffports))
-+ goto exit;
-+
-+ return 0;
-+
-+exit:
-+ return -1;
-+}
-+
-+static void ndiffports_unregister(void)
-+{
-+ mptcp_unregister_path_manager(&ndiffports);
-+}
-+
-+module_init(ndiffports_register);
-+module_exit(ndiffports_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
-+MODULE_VERSION("0.88");
-diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
-new file mode 100644
-index 000000000000..ec4e98622637
---- /dev/null
-+++ b/net/mptcp/mptcp_ofo_queue.c
-@@ -0,0 +1,295 @@
-+/*
-+ * MPTCP implementation - Fast algorithm for MPTCP meta-reordering
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/skbuff.h>
-+#include <linux/slab.h>
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+ const struct sk_buff *skb)
-+{
-+ struct tcp_sock *tp;
-+
-+ mptcp_for_each_tp(mpcb, tp) {
-+ if (tp->mptcp->shortcut_ofoqueue == skb) {
-+ tp->mptcp->shortcut_ofoqueue = NULL;
-+ return;
-+ }
-+ }
-+}
-+
-+/* Does 'skb' fits after 'here' in the queue 'head' ?
-+ * If yes, we queue it and return 1
-+ */
-+static int mptcp_ofo_queue_after(struct sk_buff_head *head,
-+ struct sk_buff *skb, struct sk_buff *here,
-+ const struct tcp_sock *tp)
-+{
-+ struct sock *meta_sk = tp->meta_sk;
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ u32 seq = TCP_SKB_CB(skb)->seq;
-+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+ /* We want to queue skb after here, thus seq >= end_seq */
-+ if (before(seq, TCP_SKB_CB(here)->end_seq))
-+ return 0;
-+
-+ if (seq == TCP_SKB_CB(here)->end_seq) {
-+ bool fragstolen = false;
-+
-+ if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
-+ __skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
-+ return 1;
-+ } else {
-+ kfree_skb_partial(skb, fragstolen);
-+ return -1;
-+ }
-+ }
-+
-+ /* If here is the last one, we can always queue it */
-+ if (skb_queue_is_last(head, here)) {
-+ __skb_queue_after(head, here, skb);
-+ return 1;
-+ } else {
-+ struct sk_buff *skb1 = skb_queue_next(head, here);
-+ /* It's not the last one, but does it fits between 'here' and
-+ * the one after 'here' ? Thus, does end_seq <= after_here->seq
-+ */
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
-+ __skb_queue_after(head, here, skb);
-+ return 1;
-+ }
-+ }
-+
-+ return 0;
-+}
-+
-+static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
-+ struct sk_buff_head *head, struct tcp_sock *tp)
-+{
-+ struct sock *meta_sk = tp->meta_sk;
-+ struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sk_buff *skb1, *best_shortcut = NULL;
-+ u32 seq = TCP_SKB_CB(skb)->seq;
-+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+ u32 distance = 0xffffffff;
-+
-+ /* First, check the tp's shortcut */
-+ if (!shortcut) {
-+ if (skb_queue_empty(head)) {
-+ __skb_queue_head(head, skb);
-+ goto end;
-+ }
-+ } else {
-+ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
-+ /* Does the tp's shortcut is a hit? If yes, we insert. */
-+
-+ if (ret) {
-+ skb = (ret > 0) ? skb : NULL;
-+ goto end;
-+ }
-+ }
-+
-+ /* Check the shortcuts of the other subsockets. */
-+ mptcp_for_each_tp(mpcb, tp_it) {
-+ shortcut = tp_it->mptcp->shortcut_ofoqueue;
-+ /* Can we queue it here? If yes, do so! */
-+ if (shortcut) {
-+ int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
-+
-+ if (ret) {
-+ skb = (ret > 0) ? skb : NULL;
-+ goto end;
-+ }
-+ }
-+
-+ /* Could not queue it, check if we are close.
-+ * We are looking for a shortcut, close enough to seq to
-+ * set skb1 prematurely and thus improve the subsequent lookup,
-+ * which tries to find a skb1 so that skb1->seq <= seq.
-+ *
-+ * So, here we only take shortcuts, whose shortcut->seq > seq,
-+ * and minimize the distance between shortcut->seq and seq and
-+ * set best_shortcut to this one with the minimal distance.
-+ *
-+ * That way, the subsequent while-loop is shortest.
-+ */
-+ if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
-+ /* Are we closer than the current best shortcut? */
-+ if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
-+ distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
-+ best_shortcut = shortcut;
-+ }
-+ }
-+ }
-+
-+ if (best_shortcut)
-+ skb1 = best_shortcut;
-+ else
-+ skb1 = skb_peek_tail(head);
-+
-+ if (seq == TCP_SKB_CB(skb1)->end_seq) {
-+ bool fragstolen = false;
-+
-+ if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
-+ __skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
-+ } else {
-+ kfree_skb_partial(skb, fragstolen);
-+ skb = NULL;
-+ }
-+
-+ goto end;
-+ }
-+
-+ /* Find the insertion point, starting from best_shortcut if available.
-+ *
-+ * Inspired from tcp_data_queue_ofo.
-+ */
-+ while (1) {
-+ /* skb1->seq <= seq */
-+ if (!after(TCP_SKB_CB(skb1)->seq, seq))
-+ break;
-+ if (skb_queue_is_first(head, skb1)) {
-+ skb1 = NULL;
-+ break;
-+ }
-+ skb1 = skb_queue_prev(head, skb1);
-+ }
-+
-+ /* Do skb overlap to previous one? */
-+ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ /* All the bits are present. */
-+ __kfree_skb(skb);
-+ skb = NULL;
-+ goto end;
-+ }
-+ if (seq == TCP_SKB_CB(skb1)->seq) {
-+ if (skb_queue_is_first(head, skb1))
-+ skb1 = NULL;
-+ else
-+ skb1 = skb_queue_prev(head, skb1);
-+ }
-+ }
-+ if (!skb1)
-+ __skb_queue_head(head, skb);
-+ else
-+ __skb_queue_after(head, skb1, skb);
-+
-+ /* And clean segments covered by new one as whole. */
-+ while (!skb_queue_is_last(head, skb)) {
-+ skb1 = skb_queue_next(head, skb);
-+
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-+ break;
-+
-+ __skb_unlink(skb1, head);
-+ mptcp_remove_shortcuts(mpcb, skb1);
-+ __kfree_skb(skb1);
-+ }
-+
-+end:
-+ if (skb) {
-+ skb_set_owner_r(skb, meta_sk);
-+ tp->mptcp->shortcut_ofoqueue = skb;
-+ }
-+
-+ return;
-+}
-+
-+/**
-+ * @sk: the subflow that received this skb.
-+ */
-+void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
-+ struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
-+ &tcp_sk(meta_sk)->out_of_order_queue, tp);
-+}
-+
-+bool mptcp_prune_ofo_queue(struct sock *sk)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ bool res = false;
-+
-+ if (!skb_queue_empty(&tp->out_of_order_queue)) {
-+ NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-+ mptcp_purge_ofo_queue(tp);
-+
-+ /* No sack at the mptcp-level */
-+ sk_mem_reclaim(sk);
-+ res = true;
-+ }
-+
-+ return res;
-+}
-+
-+void mptcp_ofo_queue(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sk_buff *skb;
-+
-+ while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
-+ u32 old_rcv_nxt = meta_tp->rcv_nxt;
-+ if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
-+ break;
-+
-+ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
-+ __skb_unlink(skb, &meta_tp->out_of_order_queue);
-+ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+ __kfree_skb(skb);
-+ continue;
-+ }
-+
-+ __skb_unlink(skb, &meta_tp->out_of_order_queue);
-+ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+
-+ __skb_queue_tail(&meta_sk->sk_receive_queue, skb);
-+ meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-+ mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
-+
-+ if (tcp_hdr(skb)->fin)
-+ mptcp_fin(meta_sk);
-+ }
-+}
-+
-+void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
-+{
-+ struct sk_buff_head *head = &meta_tp->out_of_order_queue;
-+ struct sk_buff *skb, *tmp;
-+
-+ skb_queue_walk_safe(head, skb, tmp) {
-+ __skb_unlink(skb, head);
-+ mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+ kfree_skb(skb);
-+ }
-+}
-diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
-new file mode 100644
-index 000000000000..53f5c43bb488
---- /dev/null
-+++ b/net/mptcp/mptcp_olia.c
-@@ -0,0 +1,311 @@
-+/*
-+ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
-+ *
-+ * Algorithm design:
-+ * Ramin Khalili <ramin.khalili@epfl.ch>
-+ * Nicolas Gast <nicolas.gast@epfl.ch>
-+ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
-+ *
-+ * Implementation:
-+ * Ramin Khalili <ramin.khalili@epfl.ch>
-+ *
-+ * Ported to the official MPTCP-kernel:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+#include <linux/module.h>
-+
-+static int scale = 10;
-+
-+struct mptcp_olia {
-+ u32 mptcp_loss1;
-+ u32 mptcp_loss2;
-+ u32 mptcp_loss3;
-+ int epsilon_num;
-+ u32 epsilon_den;
-+ int mptcp_snd_cwnd_cnt;
-+};
-+
-+static inline int mptcp_olia_sk_can_send(const struct sock *sk)
-+{
-+ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
-+}
-+
-+static inline u64 mptcp_olia_scale(u64 val, int scale)
-+{
-+ return (u64) val << scale;
-+}
-+
-+/* take care of artificially inflate (see RFC5681)
-+ * of cwnd during fast-retransmit phase
-+ */
-+static u32 mptcp_get_crt_cwnd(struct sock *sk)
-+{
-+ const struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+ if (icsk->icsk_ca_state == TCP_CA_Recovery)
-+ return tcp_sk(sk)->snd_ssthresh;
-+ else
-+ return tcp_sk(sk)->snd_cwnd;
-+}
-+
-+/* return the dominator of the first term of the increasing term */
-+static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
-+{
-+ struct sock *sk;
-+ u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ u64 scaled_num;
-+ u32 tmp_cwnd;
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+ scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
-+ rate += div_u64(scaled_num , tp->srtt_us);
-+ }
-+ rate *= rate;
-+ return rate;
-+}
-+
-+/* find the maximum cwnd, used to find set M */
-+static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
-+{
-+ struct sock *sk;
-+ u32 best_cwnd = 0;
-+
-+ mptcp_for_each_sk(mpcb, sk) {
-+ u32 tmp_cwnd;
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+ if (tmp_cwnd > best_cwnd)
-+ best_cwnd = tmp_cwnd;
-+ }
-+ return best_cwnd;
-+}
-+
-+static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
-+{
-+ struct mptcp_olia *ca;
-+ struct tcp_sock *tp;
-+ struct sock *sk;
-+ u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
-+ u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
-+ u8 M = 0, B_not_M = 0;
-+
-+ /* TODO - integrate this in the following loop - we just want to iterate once */
-+
-+ max_cwnd = mptcp_get_max_cwnd(mpcb);
-+
-+ /* find the best path */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ tp = tcp_sk(sk);
-+ ca = inet_csk_ca(sk);
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+ /* TODO - check here and rename variables */
-+ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+ ca->mptcp_loss2 - ca->mptcp_loss1);
-+
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+ if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
-+ best_rtt = tmp_rtt;
-+ best_int = tmp_int;
-+ best_cwnd = tmp_cwnd;
-+ }
-+ }
-+
-+ /* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
-+ /* find the size of M and B_not_M */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ tp = tcp_sk(sk);
-+ ca = inet_csk_ca(sk);
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+ if (tmp_cwnd == max_cwnd) {
-+ M++;
-+ } else {
-+ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+ ca->mptcp_loss2 - ca->mptcp_loss1);
-+
-+ if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
-+ B_not_M++;
-+ }
-+ }
-+
-+ /* check if the path is in M or B_not_M and set the value of epsilon accordingly */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ tp = tcp_sk(sk);
-+ ca = inet_csk_ca(sk);
-+
-+ if (!mptcp_olia_sk_can_send(sk))
-+ continue;
-+
-+ if (B_not_M == 0) {
-+ ca->epsilon_num = 0;
-+ ca->epsilon_den = 1;
-+ } else {
-+ tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+ tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+ ca->mptcp_loss2 - ca->mptcp_loss1);
-+ tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+
-+ if (tmp_cwnd < max_cwnd &&
-+ (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
-+ ca->epsilon_num = 1;
-+ ca->epsilon_den = mpcb->cnt_established * B_not_M;
-+ } else if (tmp_cwnd == max_cwnd) {
-+ ca->epsilon_num = -1;
-+ ca->epsilon_den = mpcb->cnt_established * M;
-+ } else {
-+ ca->epsilon_num = 0;
-+ ca->epsilon_den = 1;
-+ }
-+ }
-+ }
-+}
-+
-+/* setting the initial values */
-+static void mptcp_olia_init(struct sock *sk)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_olia *ca = inet_csk_ca(sk);
-+
-+ if (mptcp(tp)) {
-+ ca->mptcp_loss1 = tp->snd_una;
-+ ca->mptcp_loss2 = tp->snd_una;
-+ ca->mptcp_loss3 = tp->snd_una;
-+ ca->mptcp_snd_cwnd_cnt = 0;
-+ ca->epsilon_num = 0;
-+ ca->epsilon_den = 1;
-+ }
-+}
-+
-+/* updating inter-loss distance and ssthresh */
-+static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
-+{
-+ if (!mptcp(tcp_sk(sk)))
-+ return;
-+
-+ if (new_state == TCP_CA_Loss ||
-+ new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
-+ struct mptcp_olia *ca = inet_csk_ca(sk);
-+
-+ if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
-+ !inet_csk(sk)->icsk_retransmits) {
-+ ca->mptcp_loss1 = ca->mptcp_loss2;
-+ ca->mptcp_loss2 = ca->mptcp_loss3;
-+ }
-+ }
-+}
-+
-+/* main algorithm */
-+static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_olia *ca = inet_csk_ca(sk);
-+ const struct mptcp_cb *mpcb = tp->mpcb;
-+
-+ u64 inc_num, inc_den, rate, cwnd_scaled;
-+
-+ if (!mptcp(tp)) {
-+ tcp_reno_cong_avoid(sk, ack, acked);
-+ return;
-+ }
-+
-+ ca->mptcp_loss3 = tp->snd_una;
-+
-+ if (!tcp_is_cwnd_limited(sk))
-+ return;
-+
-+ /* slow start if it is in the safe area */
-+ if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+ tcp_slow_start(tp, acked);
-+ return;
-+ }
-+
-+ mptcp_get_epsilon(mpcb);
-+ rate = mptcp_get_rate(mpcb, tp->srtt_us);
-+ cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
-+ inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
-+
-+ /* calculate the increasing term, scaling is used to reduce the rounding effect */
-+ if (ca->epsilon_num == -1) {
-+ if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
-+ inc_num = rate - ca->epsilon_den *
-+ cwnd_scaled * cwnd_scaled;
-+ ca->mptcp_snd_cwnd_cnt -= div64_u64(
-+ mptcp_olia_scale(inc_num , scale) , inc_den);
-+ } else {
-+ inc_num = ca->epsilon_den *
-+ cwnd_scaled * cwnd_scaled - rate;
-+ ca->mptcp_snd_cwnd_cnt += div64_u64(
-+ mptcp_olia_scale(inc_num , scale) , inc_den);
-+ }
-+ } else {
-+ inc_num = ca->epsilon_num * rate +
-+ ca->epsilon_den * cwnd_scaled * cwnd_scaled;
-+ ca->mptcp_snd_cwnd_cnt += div64_u64(
-+ mptcp_olia_scale(inc_num , scale) , inc_den);
-+ }
-+
-+
-+ if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
-+ if (tp->snd_cwnd < tp->snd_cwnd_clamp)
-+ tp->snd_cwnd++;
-+ ca->mptcp_snd_cwnd_cnt = 0;
-+ } else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
-+ tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
-+ ca->mptcp_snd_cwnd_cnt = 0;
-+ }
-+}
-+
-+static struct tcp_congestion_ops mptcp_olia = {
-+ .init = mptcp_olia_init,
-+ .ssthresh = tcp_reno_ssthresh,
-+ .cong_avoid = mptcp_olia_cong_avoid,
-+ .set_state = mptcp_olia_set_state,
-+ .owner = THIS_MODULE,
-+ .name = "olia",
-+};
-+
-+static int __init mptcp_olia_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
-+ return tcp_register_congestion_control(&mptcp_olia);
-+}
-+
-+static void __exit mptcp_olia_unregister(void)
-+{
-+ tcp_unregister_congestion_control(&mptcp_olia);
-+}
-+
-+module_init(mptcp_olia_register);
-+module_exit(mptcp_olia_unregister);
-+
-+MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
-new file mode 100644
-index 000000000000..400ea254c078
---- /dev/null
-+++ b/net/mptcp/mptcp_output.c
-@@ -0,0 +1,1743 @@
-+/*
-+ * MPTCP implementation - Sending side
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/kconfig.h>
-+#include <linux/skbuff.h>
-+#include <linux/tcp.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-+#include <net/sock.h>
-+
-+static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
-+ MPTCP_SUB_LEN_ACK_ALIGN +
-+ MPTCP_SUB_LEN_SEQ_ALIGN;
-+
-+static inline int mptcp_sub_len_remove_addr(u16 bitfield)
-+{
-+ unsigned int c;
-+ for (c = 0; bitfield; c++)
-+ bitfield &= bitfield - 1;
-+ return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
-+}
-+
-+int mptcp_sub_len_remove_addr_align(u16 bitfield)
-+{
-+ return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
-+}
-+EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
-+
-+/* get the data-seq and end-data-seq and store them again in the
-+ * tcp_skb_cb
-+ */
-+static int mptcp_reconstruct_mapping(struct sk_buff *skb)
-+{
-+ const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
-+ u32 *p32;
-+ u16 *p16;
-+
-+ if (!mpdss->M)
-+ return 1;
-+
-+ /* Move the pointer to the data-seq */
-+ p32 = (u32 *)mpdss;
-+ p32++;
-+ if (mpdss->A) {
-+ p32++;
-+ if (mpdss->a)
-+ p32++;
-+ }
-+
-+ TCP_SKB_CB(skb)->seq = ntohl(*p32);
-+
-+ /* Get the data_len to calculate the end_data_seq */
-+ p32++;
-+ p32++;
-+ p16 = (u16 *)p32;
-+ TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
-+
-+ return 0;
-+}
-+
-+static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ struct sk_buff *skb_it;
-+
-+ skb_it = tcp_write_queue_head(meta_sk);
-+
-+ tcp_for_write_queue_from(skb_it, meta_sk) {
-+ if (skb_it == tcp_send_head(meta_sk))
-+ break;
-+
-+ if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
-+ TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
-+ break;
-+ }
-+ }
-+}
-+
-+/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
-+ * coming from the meta-retransmit-timer
-+ */
-+static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
-+ struct sock *sk, int clone_it)
-+{
-+ struct sk_buff *skb, *skb1;
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ u32 seq, end_seq;
-+
-+ if (clone_it) {
-+ /* pskb_copy is necessary here, because the TCP/IP-headers
-+ * will be changed when it's going to be reinjected on another
-+ * subflow.
-+ */
-+ skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
-+ } else {
-+ __skb_unlink(orig_skb, &sk->sk_write_queue);
-+ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
-+ sk->sk_wmem_queued -= orig_skb->truesize;
-+ sk_mem_uncharge(sk, orig_skb->truesize);
-+ skb = orig_skb;
-+ }
-+ if (unlikely(!skb))
-+ return;
-+
-+ if (sk && mptcp_reconstruct_mapping(skb)) {
-+ __kfree_skb(skb);
-+ return;
-+ }
-+
-+ skb->sk = meta_sk;
-+
-+ /* If it reached already the destination, we don't have to reinject it */
-+ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
-+ __kfree_skb(skb);
-+ return;
-+ }
-+
-+ /* Only reinject segments that are fully covered by the mapping */
-+ if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
-+ TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
-+ u32 seq = TCP_SKB_CB(skb)->seq;
-+ u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+ __kfree_skb(skb);
-+
-+ /* Ok, now we have to look for the full mapping in the meta
-+ * send-queue :S
-+ */
-+ tcp_for_write_queue(skb, meta_sk) {
-+ /* Not yet at the mapping? */
-+ if (before(TCP_SKB_CB(skb)->seq, seq))
-+ continue;
-+ /* We have passed by the mapping */
-+ if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
-+ return;
-+
-+ __mptcp_reinject_data(skb, meta_sk, NULL, 1);
-+ }
-+ return;
-+ }
-+
-+ /* Segment goes back to the MPTCP-layer. So, we need to zero the
-+ * path_mask/dss.
-+ */
-+ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
-+
-+ /* We need to find out the path-mask from the meta-write-queue
-+ * to properly select a subflow.
-+ */
-+ mptcp_find_and_set_pathmask(meta_sk, skb);
-+
-+ /* If it's empty, just add */
-+ if (skb_queue_empty(&mpcb->reinject_queue)) {
-+ skb_queue_head(&mpcb->reinject_queue, skb);
-+ return;
-+ }
-+
-+ /* Find place to insert skb - or even we can 'drop' it, as the
-+ * data is already covered by other skb's in the reinject-queue.
-+ *
-+ * This is inspired by code from tcp_data_queue.
-+ */
-+
-+ skb1 = skb_peek_tail(&mpcb->reinject_queue);
-+ seq = TCP_SKB_CB(skb)->seq;
-+ while (1) {
-+ if (!after(TCP_SKB_CB(skb1)->seq, seq))
-+ break;
-+ if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
-+ skb1 = NULL;
-+ break;
-+ }
-+ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
-+ }
-+
-+ /* Do skb overlap to previous one? */
-+ end_seq = TCP_SKB_CB(skb)->end_seq;
-+ if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+ /* All the bits are present. Don't reinject */
-+ __kfree_skb(skb);
-+ return;
-+ }
-+ if (seq == TCP_SKB_CB(skb1)->seq) {
-+ if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
-+ skb1 = NULL;
-+ else
-+ skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
-+ }
-+ }
-+ if (!skb1)
-+ __skb_queue_head(&mpcb->reinject_queue, skb);
-+ else
-+ __skb_queue_after(&mpcb->reinject_queue, skb1, skb);
-+
-+ /* And clean segments covered by new one as whole. */
-+ while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
-+ skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
-+
-+ if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-+ break;
-+
-+ __skb_unlink(skb1, &mpcb->reinject_queue);
-+ __kfree_skb(skb1);
-+ }
-+ return;
-+}
-+
-+/* Inserts data into the reinject queue */
-+void mptcp_reinject_data(struct sock *sk, int clone_it)
-+{
-+ struct sk_buff *skb_it, *tmp;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct sock *meta_sk = tp->meta_sk;
-+
-+ /* It has already been closed - there is really no point in reinjecting */
-+ if (meta_sk->sk_state == TCP_CLOSE)
-+ return;
-+
-+ skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
-+ /* Subflow syn's and fin's are not reinjected.
-+ *
-+ * As well as empty subflow-fins with a data-fin.
-+ * They are reinjected below (without the subflow-fin-flag)
-+ */
-+ if (tcb->tcp_flags & TCPHDR_SYN ||
-+ (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
-+ (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
-+ continue;
-+
-+ __mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
-+ }
-+
-+ skb_it = tcp_write_queue_tail(meta_sk);
-+ /* If sk has sent the empty data-fin, we have to reinject it too. */
-+ if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
-+ TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
-+ __mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
-+ }
-+
-+ mptcp_push_pending_frames(meta_sk);
-+
-+ tp->pf = 1;
-+}
-+EXPORT_SYMBOL(mptcp_reinject_data);
-+
-+static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
-+ struct sock *subsk)
-+{
-+ const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sock *sk_it;
-+ int all_empty = 1, all_acked;
-+
-+ /* In infinite mapping we always try to combine */
-+ if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
-+ subsk->sk_shutdown |= SEND_SHUTDOWN;
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+ return;
-+ }
-+
-+ /* Don't combine, if they didn't combine - otherwise we end up in
-+ * TIME_WAIT, even if our app is smart enough to avoid it
-+ */
-+ if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
-+ if (!mpcb->dfin_combined)
-+ return;
-+ }
-+
-+ /* If no other subflow has data to send, we can combine */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ if (!mptcp_sk_can_send(sk_it))
-+ continue;
-+
-+ if (!tcp_write_queue_empty(sk_it))
-+ all_empty = 0;
-+ }
-+
-+ /* If all data has been DATA_ACKed, we can combine.
-+ * -1, because the data_fin consumed one byte
-+ */
-+ all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
-+
-+ if ((all_empty || all_acked) && tcp_close_state(subsk)) {
-+ subsk->sk_shutdown |= SEND_SHUTDOWN;
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+ }
-+}
-+
-+static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ __be32 *ptr)
-+{
-+ const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ __be32 *start = ptr;
-+ __u16 data_len;
-+
-+ *ptr++ = htonl(tcb->seq); /* data_seq */
-+
-+ /* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
-+ if (mptcp_is_data_fin(skb) && skb->len == 0)
-+ *ptr++ = 0; /* subseq */
-+ else
-+ *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
-+
-+ if (tcb->mptcp_flags & MPTCPHDR_INF)
-+ data_len = 0;
-+ else
-+ data_len = tcb->end_seq - tcb->seq;
-+
-+ if (tp->mpcb->dss_csum && data_len) {
-+ __be16 *p16 = (__be16 *)ptr;
-+ __be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
-+ __wsum csum;
-+
-+ *ptr = htonl(((data_len) << 16) |
-+ (TCPOPT_EOL << 8) |
-+ (TCPOPT_EOL));
-+ csum = csum_partial(ptr - 2, 12, skb->csum);
-+ p16++;
-+ *p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
-+ } else {
-+ *ptr++ = htonl(((data_len) << 16) |
-+ (TCPOPT_NOP << 8) |
-+ (TCPOPT_NOP));
-+ }
-+
-+ return ptr - start;
-+}
-+
-+static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
-+ __be32 *ptr)
-+{
-+ struct mp_dss *mdss = (struct mp_dss *)ptr;
-+ __be32 *start = ptr;
-+
-+ mdss->kind = TCPOPT_MPTCP;
-+ mdss->sub = MPTCP_SUB_DSS;
-+ mdss->rsv1 = 0;
-+ mdss->rsv2 = 0;
-+ mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
-+ mdss->m = 0;
-+ mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
-+ mdss->a = 0;
-+ mdss->A = 1;
-+ mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
-+ ptr++;
-+
-+ *ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
-+
-+ return ptr - start;
-+}
-+
-+/* RFC6824 states that once a particular subflow mapping has been sent
-+ * out it must never be changed. However, packets may be split while
-+ * they are in the retransmission queue (due to SACK or ACKs) and that
-+ * arguably means that we would change the mapping (e.g. it splits it,
-+ * our sends out a subset of the initial mapping).
-+ *
-+ * Furthermore, the skb checksum is not always preserved across splits
-+ * (e.g. mptcp_fragment) which would mean that we need to recompute
-+ * the DSS checksum in this case.
-+ *
-+ * To avoid this we save the initial DSS mapping which allows us to
-+ * send the same DSS mapping even for fragmented retransmits.
-+ */
-+static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
-+{
-+ struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+ __be32 *ptr = (__be32 *)tcb->dss;
-+
-+ tcb->mptcp_flags |= MPTCPHDR_SEQ;
-+
-+ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-+ ptr += mptcp_write_dss_mapping(tp, skb, ptr);
-+}
-+
-+/* Write the saved DSS mapping to the header */
-+static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
-+ __be32 *ptr)
-+{
-+ __be32 *start = ptr;
-+
-+ memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
-+
-+ /* update the data_ack */
-+ start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
-+
-+ /* dss is in a union with inet_skb_parm and
-+ * the IP layer expects zeroed IPCB fields.
-+ */
-+ memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
-+
-+ return mptcp_dss_len/sizeof(*ptr);
-+}
-+
-+static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ const struct sock *meta_sk = mptcp_meta_sk(sk);
-+ const struct mptcp_cb *mpcb = tp->mpcb;
-+ struct tcp_skb_cb *tcb;
-+ struct sk_buff *subskb = NULL;
-+
-+ if (!reinject)
-+ TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
-+ MPTCPHDR_SEQ64_INDEX : 0);
-+
-+ subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
-+ if (!subskb)
-+ return false;
-+
-+ /* At the subflow-level we need to call again tcp_init_tso_segs. We
-+ * force this, by setting gso_segs to 0. It has been set to 1 prior to
-+ * the call to mptcp_skb_entail.
-+ */
-+ skb_shinfo(subskb)->gso_segs = 0;
-+
-+ TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
-+
-+ if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
-+ skb->ip_summed == CHECKSUM_PARTIAL) {
-+ subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
-+ subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
-+ }
-+
-+ tcb = TCP_SKB_CB(subskb);
-+
-+ if (tp->mpcb->send_infinite_mapping &&
-+ !tp->mpcb->infinite_mapping_snd &&
-+ !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
-+ tp->mptcp->fully_established = 1;
-+ tp->mpcb->infinite_mapping_snd = 1;
-+ tp->mptcp->infinite_cutoff_seq = tp->write_seq;
-+ tcb->mptcp_flags |= MPTCPHDR_INF;
-+ }
-+
-+ if (mptcp_is_data_fin(subskb))
-+ mptcp_combine_dfin(subskb, meta_sk, sk);
-+
-+ mptcp_save_dss_data_seq(tp, subskb);
-+
-+ tcb->seq = tp->write_seq;
-+ tcb->sacked = 0; /* reset the sacked field: from the point of view
-+ * of this subflow, we are sending a brand new
-+ * segment
-+ */
-+ /* Take into account seg len */
-+ tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
-+ tcb->end_seq = tp->write_seq;
-+
-+ /* If it's a non-payload DATA_FIN (also no subflow-fin), the
-+ * segment is not part of the subflow but on a meta-only-level.
-+ */
-+ if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
-+ tcp_add_write_queue_tail(sk, subskb);
-+ sk->sk_wmem_queued += subskb->truesize;
-+ sk_mem_charge(sk, subskb->truesize);
-+ } else {
-+ int err;
-+
-+ /* Necessary to initialize for tcp_transmit_skb. mss of 1, as
-+ * skb->len = 0 will force tso_segs to 1.
-+ */
-+ tcp_init_tso_segs(sk, subskb, 1);
-+ /* Empty data-fins are sent immediatly on the subflow */
-+ TCP_SKB_CB(subskb)->when = tcp_time_stamp;
-+ err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
-+
-+ /* It has not been queued, we can free it now. */
-+ kfree_skb(subskb);
-+
-+ if (err)
-+ return false;
-+ }
-+
-+ if (!tp->mptcp->fully_established) {
-+ tp->mptcp->second_packet = 1;
-+ tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
-+ }
-+
-+ return true;
-+}
-+
-+/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
-+ * might need to undo some operations done by tcp_fragment.
-+ */
-+static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
-+ gfp_t gfp, int reinject)
-+{
-+ int ret, diff, old_factor;
-+ struct sk_buff *buff;
-+ u8 flags;
-+
-+ if (skb_headlen(skb) < len)
-+ diff = skb->len - len;
-+ else
-+ diff = skb->data_len;
-+ old_factor = tcp_skb_pcount(skb);
-+
-+ /* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
-+ * At the MPTCP-level we do not care about the absolute value. All we
-+ * care about is that it is set to 1 for accurate packets_out
-+ * accounting.
-+ */
-+ ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
-+ if (ret)
-+ return ret;
-+
-+ buff = skb->next;
-+
-+ flags = TCP_SKB_CB(skb)->mptcp_flags;
-+ TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
-+ TCP_SKB_CB(buff)->mptcp_flags = flags;
-+ TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
-+
-+ /* If reinject == 1, the buff will be added to the reinject
-+ * queue, which is currently not part of memory accounting. So
-+ * undo the changes done by tcp_fragment and update the
-+ * reinject queue. Also, undo changes to the packet counters.
-+ */
-+ if (reinject == 1) {
-+ int undo = buff->truesize - diff;
-+ meta_sk->sk_wmem_queued -= undo;
-+ sk_mem_uncharge(meta_sk, undo);
-+
-+ tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
-+ meta_sk->sk_write_queue.qlen--;
-+
-+ if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
-+ undo = old_factor - tcp_skb_pcount(skb) -
-+ tcp_skb_pcount(buff);
-+ if (undo)
-+ tcp_adjust_pcount(meta_sk, skb, -undo);
-+ }
-+ }
-+
-+ return 0;
-+}
-+
-+/* Inspired by tcp_write_wakeup */
-+int mptcp_write_wakeup(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sk_buff *skb;
-+ struct sock *sk_it;
-+ int ans = 0;
-+
-+ if (meta_sk->sk_state == TCP_CLOSE)
-+ return -1;
-+
-+ skb = tcp_send_head(meta_sk);
-+ if (skb &&
-+ before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
-+ unsigned int mss;
-+ unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
-+ struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
-+ struct tcp_sock *subtp;
-+ if (!subsk)
-+ goto window_probe;
-+ subtp = tcp_sk(subsk);
-+ mss = tcp_current_mss(subsk);
-+
-+ seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
-+ tcp_wnd_end(subtp) - subtp->write_seq);
-+
-+ if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
-+ meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+ /* We are probing the opening of a window
-+ * but the window size is != 0
-+ * must have been a result SWS avoidance ( sender )
-+ */
-+ if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
-+ skb->len > mss) {
-+ seg_size = min(seg_size, mss);
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
-+ if (mptcp_fragment(meta_sk, skb, seg_size,
-+ GFP_ATOMIC, 0))
-+ return -1;
-+ } else if (!tcp_skb_pcount(skb)) {
-+ /* see mptcp_write_xmit on why we use UINT_MAX */
-+ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
-+ }
-+
-+ TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
-+ if (!mptcp_skb_entail(subsk, skb, 0))
-+ return -1;
-+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+ mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
-+ TCP_SKB_CB(skb)->seq);
-+ tcp_event_new_data_sent(meta_sk, skb);
-+
-+ __tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
-+
-+ return 0;
-+ } else {
-+window_probe:
-+ if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
-+ meta_tp->snd_una + 0xFFFF)) {
-+ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+ if (mptcp_sk_can_send_ack(sk_it))
-+ tcp_xmit_probe_skb(sk_it, 1);
-+ }
-+ }
-+
-+ /* At least one of the tcp_xmit_probe_skb's has to succeed */
-+ mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+ int ret;
-+
-+ if (!mptcp_sk_can_send_ack(sk_it))
-+ continue;
-+
-+ ret = tcp_xmit_probe_skb(sk_it, 0);
-+ if (unlikely(ret > 0))
-+ ans = ret;
-+ }
-+ return ans;
-+ }
-+}
-+
-+bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
-+ int push_one, gfp_t gfp)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
-+ struct sock *subsk = NULL;
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sk_buff *skb;
-+ unsigned int sent_pkts;
-+ int reinject = 0;
-+ unsigned int sublimit;
-+
-+ sent_pkts = 0;
-+
-+ while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
-+ &sublimit))) {
-+ unsigned int limit;
-+
-+ subtp = tcp_sk(subsk);
-+ mss_now = tcp_current_mss(subsk);
-+
-+ if (reinject == 1) {
-+ if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
-+ /* Segment already reached the peer, take the next one */
-+ __skb_unlink(skb, &mpcb->reinject_queue);
-+ __kfree_skb(skb);
-+ continue;
-+ }
-+ }
-+
-+ /* If the segment was cloned (e.g. a meta retransmission),
-+ * the header must be expanded/copied so that there is no
-+ * corruption of TSO information.
-+ */
-+ if (skb_unclone(skb, GFP_ATOMIC))
-+ break;
-+
-+ if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
-+ break;
-+
-+ /* Force tso_segs to 1 by using UINT_MAX.
-+ * We actually don't care about the exact number of segments
-+ * emitted on the subflow. We need just to set tso_segs, because
-+ * we still need an accurate packets_out count in
-+ * tcp_event_new_data_sent.
-+ */
-+ tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
-+
-+ /* Check for nagle, irregardless of tso_segs. If the segment is
-+ * actually larger than mss_now (TSO segment), then
-+ * tcp_nagle_check will have partial == false and always trigger
-+ * the transmission.
-+ * tcp_write_xmit has a TSO-level nagle check which is not
-+ * subject to the MPTCP-level. It is based on the properties of
-+ * the subflow, not the MPTCP-level.
-+ */
-+ if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
-+ (tcp_skb_is_last(meta_sk, skb) ?
-+ nonagle : TCP_NAGLE_PUSH))))
-+ break;
-+
-+ limit = mss_now;
-+ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
-+ * tcp_write_xmit. Otherwise split-point would return 0.
-+ */
-+ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
-+ /* We limit the size of the skb so that it fits into the
-+ * window. Call tcp_mss_split_point to avoid duplicating
-+ * code.
-+ * We really only care about fitting the skb into the
-+ * window. That's why we use UINT_MAX. If the skb does
-+ * not fit into the cwnd_quota or the NIC's max-segs
-+ * limitation, it will be split by the subflow's
-+ * tcp_write_xmit which does the appropriate call to
-+ * tcp_mss_split_point.
-+ */
-+ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
-+ UINT_MAX / mss_now,
-+ nonagle);
-+
-+ if (sublimit)
-+ limit = min(limit, sublimit);
-+
-+ if (skb->len > limit &&
-+ unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
-+ break;
-+
-+ if (!mptcp_skb_entail(subsk, skb, reinject))
-+ break;
-+ /* Nagle is handled at the MPTCP-layer, so
-+ * always push on the subflow
-+ */
-+ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
-+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+ if (!reinject) {
-+ mptcp_check_sndseq_wrap(meta_tp,
-+ TCP_SKB_CB(skb)->end_seq -
-+ TCP_SKB_CB(skb)->seq);
-+ tcp_event_new_data_sent(meta_sk, skb);
-+ }
-+
-+ tcp_minshall_update(meta_tp, mss_now, skb);
-+ sent_pkts += tcp_skb_pcount(skb);
-+
-+ if (reinject > 0) {
-+ __skb_unlink(skb, &mpcb->reinject_queue);
-+ kfree_skb(skb);
-+ }
-+
-+ if (push_one)
-+ break;
-+ }
-+
-+ return !meta_tp->packets_out && tcp_send_head(meta_sk);
-+}
-+
-+void mptcp_write_space(struct sock *sk)
-+{
-+ mptcp_push_pending_frames(mptcp_meta_sk(sk));
-+}
-+
-+u32 __mptcp_select_window(struct sock *sk)
-+{
-+ struct inet_connection_sock *icsk = inet_csk(sk);
-+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+ int mss, free_space, full_space, window;
-+
-+ /* MSS for the peer's data. Previous versions used mss_clamp
-+ * here. I don't know if the value based on our guesses
-+ * of peer's MSS is better for the performance. It's more correct
-+ * but may be worse for the performance because of rcv_mss
-+ * fluctuations. --SAW 1998/11/1
-+ */
-+ mss = icsk->icsk_ack.rcv_mss;
-+ free_space = tcp_space(sk);
-+ full_space = min_t(int, meta_tp->window_clamp,
-+ tcp_full_space(sk));
-+
-+ if (mss > full_space)
-+ mss = full_space;
-+
-+ if (free_space < (full_space >> 1)) {
-+ icsk->icsk_ack.quick = 0;
-+
-+ if (tcp_memory_pressure)
-+ /* TODO this has to be adapted when we support different
-+ * MSS's among the subflows.
-+ */
-+ meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
-+ 4U * meta_tp->advmss);
-+
-+ if (free_space < mss)
-+ return 0;
-+ }
-+
-+ if (free_space > meta_tp->rcv_ssthresh)
-+ free_space = meta_tp->rcv_ssthresh;
-+
-+ /* Don't do rounding if we are using window scaling, since the
-+ * scaled window will not line up with the MSS boundary anyway.
-+ */
-+ window = meta_tp->rcv_wnd;
-+ if (tp->rx_opt.rcv_wscale) {
-+ window = free_space;
-+
-+ /* Advertise enough space so that it won't get scaled away.
-+ * Import case: prevent zero window announcement if
-+ * 1<<rcv_wscale > mss.
-+ */
-+ if (((window >> tp->rx_opt.rcv_wscale) << tp->
-+ rx_opt.rcv_wscale) != window)
-+ window = (((window >> tp->rx_opt.rcv_wscale) + 1)
-+ << tp->rx_opt.rcv_wscale);
-+ } else {
-+ /* Get the largest window that is a nice multiple of mss.
-+ * Window clamp already applied above.
-+ * If our current window offering is within 1 mss of the
-+ * free space we just keep it. This prevents the divide
-+ * and multiply from happening most of the time.
-+ * We also don't do any window rounding when the free space
-+ * is too small.
-+ */
-+ if (window <= free_space - mss || window > free_space)
-+ window = (free_space / mss) * mss;
-+ else if (mss == full_space &&
-+ free_space > window + (full_space >> 1))
-+ window = free_space;
-+ }
-+
-+ return window;
-+}
-+
-+void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
-+ unsigned *remaining)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+
-+ opts->options |= OPTION_MPTCP;
-+ if (is_master_tp(tp)) {
-+ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
-+ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
-+ opts->mp_capable.sender_key = tp->mptcp_loc_key;
-+ opts->dss_csum = !!sysctl_mptcp_checksum;
-+ } else {
-+ const struct mptcp_cb *mpcb = tp->mpcb;
-+
-+ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
-+ *remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
-+ opts->mp_join_syns.token = mpcb->mptcp_rem_token;
-+ opts->mp_join_syns.low_prio = tp->mptcp->low_prio;
-+ opts->addr_id = tp->mptcp->loc_id;
-+ opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
-+ }
-+}
-+
-+void mptcp_synack_options(struct request_sock *req,
-+ struct tcp_out_options *opts, unsigned *remaining)
-+{
-+ struct mptcp_request_sock *mtreq;
-+ mtreq = mptcp_rsk(req);
-+
-+ opts->options |= OPTION_MPTCP;
-+ /* MPCB not yet set - thus it's a new MPTCP-session */
-+ if (!mtreq->is_sub) {
-+ opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
-+ opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
-+ opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
-+ *remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
-+ } else {
-+ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
-+ opts->mp_join_syns.sender_truncated_mac =
-+ mtreq->mptcp_hash_tmac;
-+ opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
-+ opts->mp_join_syns.low_prio = mtreq->low_prio;
-+ opts->addr_id = mtreq->loc_id;
-+ *remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
-+ }
-+}
-+
-+void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
-+ struct tcp_out_options *opts, unsigned *size)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct mptcp_cb *mpcb = tp->mpcb;
-+ const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
-+
-+ /* We are coming from tcp_current_mss with the meta_sk as an argument.
-+ * It does not make sense to check for the options, because when the
-+ * segment gets sent, another subflow will be chosen.
-+ */
-+ if (!skb && is_meta_sk(sk))
-+ return;
-+
-+ /* In fallback mp_fail-mode, we have to repeat it until the fallback
-+ * has been done by the sender
-+ */
-+ if (unlikely(tp->mptcp->send_mp_fail)) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_FAIL;
-+ *size += MPTCP_SUB_LEN_FAIL;
-+ return;
-+ }
-+
-+ if (unlikely(tp->send_mp_fclose)) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_FCLOSE;
-+ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
-+ *size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
-+ return;
-+ }
-+
-+ /* 1. If we are the sender of the infinite-mapping, we need the
-+ * MPTCPHDR_INF-flag, because a retransmission of the
-+ * infinite-announcment still needs the mptcp-option.
-+ *
-+ * We need infinite_cutoff_seq, because retransmissions from before
-+ * the infinite-cutoff-moment still need the MPTCP-signalling to stay
-+ * consistent.
-+ *
-+ * 2. If we are the receiver of the infinite-mapping, we always skip
-+ * mptcp-options, because acknowledgments from before the
-+ * infinite-mapping point have already been sent out.
-+ *
-+ * I know, the whole infinite-mapping stuff is ugly...
-+ *
-+ * TODO: Handle wrapped data-sequence numbers
-+ * (even if it's very unlikely)
-+ */
-+ if (unlikely(mpcb->infinite_mapping_snd) &&
-+ ((mpcb->send_infinite_mapping && tcb &&
-+ mptcp_is_data_seq(skb) &&
-+ !(tcb->mptcp_flags & MPTCPHDR_INF) &&
-+ !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
-+ !mpcb->send_infinite_mapping))
-+ return;
-+
-+ if (unlikely(tp->mptcp->include_mpc)) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_CAPABLE |
-+ OPTION_TYPE_ACK;
-+ *size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
-+ opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
-+ opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
-+ opts->dss_csum = mpcb->dss_csum;
-+
-+ if (skb)
-+ tp->mptcp->include_mpc = 0;
-+ }
-+ if (unlikely(tp->mptcp->pre_established)) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
-+ *size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
-+ }
-+
-+ if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_DATA_ACK;
-+ /* If !skb, we come from tcp_current_mss and thus we always
-+ * assume that the DSS-option will be set for the data-packet.
-+ */
-+ if (skb && !mptcp_is_data_seq(skb)) {
-+ *size += MPTCP_SUB_LEN_ACK_ALIGN;
-+ } else {
-+ /* Doesn't matter, if csum included or not. It will be
-+ * either 10 or 12, and thus aligned = 12
-+ */
-+ *size += MPTCP_SUB_LEN_ACK_ALIGN +
-+ MPTCP_SUB_LEN_SEQ_ALIGN;
-+ }
-+
-+ *size += MPTCP_SUB_LEN_DSS_ALIGN;
-+ }
-+
-+ if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
-+ mpcb->pm_ops->addr_signal(sk, size, opts, skb);
-+
-+ if (unlikely(tp->mptcp->send_mp_prio) &&
-+ MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
-+ opts->options |= OPTION_MPTCP;
-+ opts->mptcp_options |= OPTION_MP_PRIO;
-+ if (skb)
-+ tp->mptcp->send_mp_prio = 0;
-+ *size += MPTCP_SUB_LEN_PRIO_ALIGN;
-+ }
-+
-+ return;
-+}
-+
-+u16 mptcp_select_window(struct sock *sk)
-+{
-+ u16 new_win = tcp_select_window(sk);
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
-+
-+ meta_tp->rcv_wnd = tp->rcv_wnd;
-+ meta_tp->rcv_wup = meta_tp->rcv_nxt;
-+
-+ return new_win;
-+}
-+
-+void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+ const struct tcp_out_options *opts,
-+ struct sk_buff *skb)
-+{
-+ if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
-+ struct mp_capable *mpc = (struct mp_capable *)ptr;
-+
-+ mpc->kind = TCPOPT_MPTCP;
-+
-+ if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
-+ (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
-+ mpc->sender_key = opts->mp_capable.sender_key;
-+ mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
-+ ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
-+ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
-+ mpc->sender_key = opts->mp_capable.sender_key;
-+ mpc->receiver_key = opts->mp_capable.receiver_key;
-+ mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
-+ ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
-+ }
-+
-+ mpc->sub = MPTCP_SUB_CAPABLE;
-+ mpc->ver = 0;
-+ mpc->a = opts->dss_csum;
-+ mpc->b = 0;
-+ mpc->rsv = 0;
-+ mpc->h = 1;
-+ }
-+
-+ if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
-+ struct mp_join *mpj = (struct mp_join *)ptr;
-+
-+ mpj->kind = TCPOPT_MPTCP;
-+ mpj->sub = MPTCP_SUB_JOIN;
-+ mpj->rsv = 0;
-+
-+ if (OPTION_TYPE_SYN & opts->mptcp_options) {
-+ mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
-+ mpj->u.syn.token = opts->mp_join_syns.token;
-+ mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
-+ mpj->b = opts->mp_join_syns.low_prio;
-+ mpj->addr_id = opts->addr_id;
-+ ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
-+ } else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
-+ mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
-+ mpj->u.synack.mac =
-+ opts->mp_join_syns.sender_truncated_mac;
-+ mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
-+ mpj->b = opts->mp_join_syns.low_prio;
-+ mpj->addr_id = opts->addr_id;
-+ ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
-+ } else if (OPTION_TYPE_ACK & opts->mptcp_options) {
-+ mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
-+ mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
-+ memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
-+ ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
-+ }
-+ }
-+ if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
-+ struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+
-+ mpadd->kind = TCPOPT_MPTCP;
-+ if (opts->add_addr_v4) {
-+ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
-+ mpadd->sub = MPTCP_SUB_ADD_ADDR;
-+ mpadd->ipver = 4;
-+ mpadd->addr_id = opts->add_addr4.addr_id;
-+ mpadd->u.v4.addr = opts->add_addr4.addr;
-+ ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
-+ } else if (opts->add_addr_v6) {
-+ mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
-+ mpadd->sub = MPTCP_SUB_ADD_ADDR;
-+ mpadd->ipver = 6;
-+ mpadd->addr_id = opts->add_addr6.addr_id;
-+ memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
-+ sizeof(mpadd->u.v6.addr));
-+ ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
-+ }
-+ }
-+ if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
-+ struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
-+ u8 *addrs_id;
-+ int id, len, len_align;
-+
-+ len = mptcp_sub_len_remove_addr(opts->remove_addrs);
-+ len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
-+
-+ mprem->kind = TCPOPT_MPTCP;
-+ mprem->len = len;
-+ mprem->sub = MPTCP_SUB_REMOVE_ADDR;
-+ mprem->rsv = 0;
-+ addrs_id = &mprem->addrs_id;
-+
-+ mptcp_for_each_bit_set(opts->remove_addrs, id)
-+ *(addrs_id++) = id;
-+
-+ /* Fill the rest with NOP's */
-+ if (len_align > len) {
-+ int i;
-+ for (i = 0; i < len_align - len; i++)
-+ *(addrs_id++) = TCPOPT_NOP;
-+ }
-+
-+ ptr += len_align >> 2;
-+ }
-+ if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
-+ struct mp_fail *mpfail = (struct mp_fail *)ptr;
-+
-+ mpfail->kind = TCPOPT_MPTCP;
-+ mpfail->len = MPTCP_SUB_LEN_FAIL;
-+ mpfail->sub = MPTCP_SUB_FAIL;
-+ mpfail->rsv1 = 0;
-+ mpfail->rsv2 = 0;
-+ mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
-+
-+ ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
-+ }
-+ if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
-+ struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
-+
-+ mpfclose->kind = TCPOPT_MPTCP;
-+ mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
-+ mpfclose->sub = MPTCP_SUB_FCLOSE;
-+ mpfclose->rsv1 = 0;
-+ mpfclose->rsv2 = 0;
-+ mpfclose->key = opts->mp_capable.receiver_key;
-+
-+ ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
-+ }
-+
-+ if (OPTION_DATA_ACK & opts->mptcp_options) {
-+ if (!mptcp_is_data_seq(skb))
-+ ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-+ else
-+ ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
-+ }
-+ if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
-+ struct mp_prio *mpprio = (struct mp_prio *)ptr;
-+
-+ mpprio->kind = TCPOPT_MPTCP;
-+ mpprio->len = MPTCP_SUB_LEN_PRIO;
-+ mpprio->sub = MPTCP_SUB_PRIO;
-+ mpprio->rsv = 0;
-+ mpprio->b = tp->mptcp->low_prio;
-+ mpprio->addr_id = TCPOPT_NOP;
-+
-+ ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
-+ }
-+}
-+
-+/* Sends the datafin */
-+void mptcp_send_fin(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
-+ int mss_now;
-+
-+ if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
-+ meta_tp->mpcb->passive_close = 1;
-+
-+ /* Optimization, tack on the FIN if we have a queue of
-+ * unsent frames. But be careful about outgoing SACKS
-+ * and IP options.
-+ */
-+ mss_now = mptcp_current_mss(meta_sk);
-+
-+ if (tcp_send_head(meta_sk) != NULL) {
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
-+ TCP_SKB_CB(skb)->end_seq++;
-+ meta_tp->write_seq++;
-+ } else {
-+ /* Socket is locked, keep trying until memory is available. */
-+ for (;;) {
-+ skb = alloc_skb_fclone(MAX_TCP_HEADER,
-+ meta_sk->sk_allocation);
-+ if (skb)
-+ break;
-+ yield();
-+ }
-+ /* Reserve space for headers and prepare control bits. */
-+ skb_reserve(skb, MAX_TCP_HEADER);
-+
-+ tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
-+ TCP_SKB_CB(skb)->end_seq++;
-+ TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
-+ tcp_queue_skb(meta_sk, skb);
-+ }
-+ __tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
-+}
-+
-+void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
-+
-+ if (!mpcb->cnt_subflows)
-+ return;
-+
-+ WARN_ON(meta_tp->send_mp_fclose);
-+
-+ /* First - select a socket */
-+ sk = mptcp_select_ack_sock(meta_sk);
-+
-+ /* May happen if no subflow is in an appropriate state */
-+ if (!sk)
-+ return;
-+
-+ /* We are in infinite mode - just send a reset */
-+ if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
-+ sk->sk_err = ECONNRESET;
-+ if (tcp_need_reset(sk->sk_state))
-+ tcp_send_active_reset(sk, priority);
-+ mptcp_sub_force_close(sk);
-+ return;
-+ }
-+
-+
-+ tcp_sk(sk)->send_mp_fclose = 1;
-+ /** Reset all other subflows */
-+
-+ /* tcp_done must be handled with bh disabled */
-+ if (!in_serving_softirq())
-+ local_bh_disable();
-+
-+ mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+ if (tcp_sk(sk_it)->send_mp_fclose)
-+ continue;
-+
-+ sk_it->sk_err = ECONNRESET;
-+ if (tcp_need_reset(sk_it->sk_state))
-+ tcp_send_active_reset(sk_it, GFP_ATOMIC);
-+ mptcp_sub_force_close(sk_it);
-+ }
-+
-+ if (!in_serving_softirq())
-+ local_bh_enable();
-+
-+ tcp_send_ack(sk);
-+ inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
-+
-+ meta_tp->send_mp_fclose = 1;
-+}
-+
-+static void mptcp_ack_retransmit_timer(struct sock *sk)
-+{
-+ struct sk_buff *skb;
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+ if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
-+ goto out; /* Routing failure or similar */
-+
-+ if (!tp->retrans_stamp)
-+ tp->retrans_stamp = tcp_time_stamp ? : 1;
-+
-+ if (tcp_write_timeout(sk)) {
-+ tp->mptcp->pre_established = 0;
-+ sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+ tp->ops->send_active_reset(sk, GFP_ATOMIC);
-+ goto out;
-+ }
-+
-+ skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
-+ if (skb == NULL) {
-+ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+ jiffies + icsk->icsk_rto);
-+ return;
-+ }
-+
-+ /* Reserve space for headers and prepare control bits */
-+ skb_reserve(skb, MAX_TCP_HEADER);
-+ tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
-+
-+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+ if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
-+ /* Retransmission failed because of local congestion,
-+ * do not backoff.
-+ */
-+ if (!icsk->icsk_retransmits)
-+ icsk->icsk_retransmits = 1;
-+ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+ jiffies + icsk->icsk_rto);
-+ return;
-+ }
-+
-+
-+ icsk->icsk_retransmits++;
-+ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-+ sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+ jiffies + icsk->icsk_rto);
-+ if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
-+ __sk_dst_reset(sk);
-+
-+out:;
-+}
-+
-+void mptcp_ack_handler(unsigned long data)
-+{
-+ struct sock *sk = (struct sock *)data;
-+ struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+ bh_lock_sock(meta_sk);
-+ if (sock_owned_by_user(meta_sk)) {
-+ /* Try again later */
-+ sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
-+ jiffies + (HZ / 20));
-+ goto out_unlock;
-+ }
-+
-+ if (sk->sk_state == TCP_CLOSE)
-+ goto out_unlock;
-+ if (!tcp_sk(sk)->mptcp->pre_established)
-+ goto out_unlock;
-+
-+ mptcp_ack_retransmit_timer(sk);
-+
-+ sk_mem_reclaim(sk);
-+
-+out_unlock:
-+ bh_unlock_sock(meta_sk);
-+ sock_put(sk);
-+}
-+
-+/* Similar to tcp_retransmit_skb
-+ *
-+ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
-+ * meta-level.
-+ */
-+int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct sock *subsk;
-+ unsigned int limit, mss_now;
-+ int err = -1;
-+
-+ /* Do not sent more than we queued. 1/4 is reserved for possible
-+ * copying overhead: fragmentation, tunneling, mangling etc.
-+ *
-+ * This is a meta-retransmission thus we check on the meta-socket.
-+ */
-+ if (atomic_read(&meta_sk->sk_wmem_alloc) >
-+ min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
-+ return -EAGAIN;
-+ }
-+
-+ /* We need to make sure that the retransmitted segment can be sent on a
-+ * subflow right now. If it is too big, it needs to be fragmented.
-+ */
-+ subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
-+ if (!subsk) {
-+ /* We want to increase icsk_retransmits, thus return 0, so that
-+ * mptcp_retransmit_timer enters the desired branch.
-+ */
-+ err = 0;
-+ goto failed;
-+ }
-+ mss_now = tcp_current_mss(subsk);
-+
-+ /* If the segment was cloned (e.g. a meta retransmission), the header
-+ * must be expanded/copied so that there is no corruption of TSO
-+ * information.
-+ */
-+ if (skb_unclone(skb, GFP_ATOMIC)) {
-+ err = -ENOMEM;
-+ goto failed;
-+ }
-+
-+ /* Must have been set by mptcp_write_xmit before */
-+ BUG_ON(!tcp_skb_pcount(skb));
-+
-+ limit = mss_now;
-+ /* skb->len > mss_now is the equivalent of tso_segs > 1 in
-+ * tcp_write_xmit. Otherwise split-point would return 0.
-+ */
-+ if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
-+ limit = tcp_mss_split_point(meta_sk, skb, mss_now,
-+ UINT_MAX / mss_now,
-+ TCP_NAGLE_OFF);
-+
-+ if (skb->len > limit &&
-+ unlikely(mptcp_fragment(meta_sk, skb, limit,
-+ GFP_ATOMIC, 0)))
-+ goto failed;
-+
-+ if (!mptcp_skb_entail(subsk, skb, -1))
-+ goto failed;
-+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+ /* Update global TCP statistics. */
-+ TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
-+
-+ /* Diff to tcp_retransmit_skb */
-+
-+ /* Save stamp of the first retransmit. */
-+ if (!meta_tp->retrans_stamp)
-+ meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
-+
-+ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
-+
-+ return 0;
-+
-+failed:
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
-+ return err;
-+}
-+
-+/* Similar to tcp_retransmit_timer
-+ *
-+ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
-+ * and that we don't have an srtt estimation at the meta-level.
-+ */
-+void mptcp_retransmit_timer(struct sock *meta_sk)
-+{
-+ struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+ struct mptcp_cb *mpcb = meta_tp->mpcb;
-+ struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+ int err;
-+
-+ /* In fallback, retransmission is handled at the subflow-level */
-+ if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
-+ mpcb->send_infinite_mapping)
-+ return;
-+
-+ WARN_ON(tcp_write_queue_empty(meta_sk));
-+
-+ if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
-+ !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
-+ /* Receiver dastardly shrinks window. Our retransmits
-+ * become zero probes, but we should not timeout this
-+ * connection. If the socket is an orphan, time it out,
-+ * we cannot allow such beasts to hang infinitely.
-+ */
-+ struct inet_sock *meta_inet = inet_sk(meta_sk);
-+ if (meta_sk->sk_family == AF_INET) {
-+ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-+ &meta_inet->inet_daddr,
-+ ntohs(meta_inet->inet_dport),
-+ meta_inet->inet_num, meta_tp->snd_una,
-+ meta_tp->snd_nxt);
-+ }
-+#if IS_ENABLED(CONFIG_IPV6)
-+ else if (meta_sk->sk_family == AF_INET6) {
-+ LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-+ &meta_sk->sk_v6_daddr,
-+ ntohs(meta_inet->inet_dport),
-+ meta_inet->inet_num, meta_tp->snd_una,
-+ meta_tp->snd_nxt);
-+ }
-+#endif
-+ if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
-+ tcp_write_err(meta_sk);
-+ return;
-+ }
-+
-+ mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
-+ goto out_reset_timer;
-+ }
-+
-+ if (tcp_write_timeout(meta_sk))
-+ return;
-+
-+ if (meta_icsk->icsk_retransmits == 0)
-+ NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
-+
-+ meta_icsk->icsk_ca_state = TCP_CA_Loss;
-+
-+ err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
-+ if (err > 0) {
-+ /* Retransmission failed because of local congestion,
-+ * do not backoff.
-+ */
-+ if (!meta_icsk->icsk_retransmits)
-+ meta_icsk->icsk_retransmits = 1;
-+ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
-+ min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
-+ TCP_RTO_MAX);
-+ return;
-+ }
-+
-+ /* Increase the timeout each time we retransmit. Note that
-+ * we do not increase the rtt estimate. rto is initialized
-+ * from rtt, but increases here. Jacobson (SIGCOMM 88) suggests
-+ * that doubling rto each time is the least we can get away with.
-+ * In KA9Q, Karn uses this for the first few times, and then
-+ * goes to quadratic. netBSD doubles, but only goes up to *64,
-+ * and clamps at 1 to 64 sec afterwards. Note that 120 sec is
-+ * defined in the protocol as the maximum possible RTT. I guess
-+ * we'll have to use something other than TCP to talk to the
-+ * University of Mars.
-+ *
-+ * PAWS allows us longer timeouts and large windows, so once
-+ * implemented ftp to mars will work nicely. We will have to fix
-+ * the 120 second clamps though!
-+ */
-+ meta_icsk->icsk_backoff++;
-+ meta_icsk->icsk_retransmits++;
-+
-+out_reset_timer:
-+ /* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
-+ * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
-+ * might be increased if the stream oscillates between thin and thick,
-+ * thus the old value might already be too high compared to the value
-+ * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
-+ * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
-+ * exponential backoff behaviour to avoid continue hammering
-+ * linear-timeout retransmissions into a black hole
-+ */
-+ if (meta_sk->sk_state == TCP_ESTABLISHED &&
-+ (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
-+ tcp_stream_is_thin(meta_tp) &&
-+ meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
-+ meta_icsk->icsk_backoff = 0;
-+ /* We cannot do the same as in tcp_write_timer because the
-+ * srtt is not set here.
-+ */
-+ mptcp_set_rto(meta_sk);
-+ } else {
-+ /* Use normal (exponential) backoff */
-+ meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
-+ }
-+ inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
-+
-+ return;
-+}
-+
-+/* Modify values to an mptcp-level for the initial window of new subflows */
-+void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
-+ __u32 *window_clamp, int wscale_ok,
-+ __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+ const struct sock *sk)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+ *window_clamp = mpcb->orig_window_clamp;
-+ __space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
-+
-+ tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
-+ wscale_ok, rcv_wscale, init_rcv_wnd, sk);
-+}
-+
-+static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
-+ unsigned int (*mss_cb)(struct sock *sk))
-+{
-+ struct sock *sk;
-+ u64 rate = 0;
-+
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ int this_mss;
-+ u64 this_rate;
-+
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ /* Do not consider subflows without a RTT estimation yet
-+ * otherwise this_rate >>> rate.
-+ */
-+ if (unlikely(!tp->srtt_us))
-+ continue;
-+
-+ this_mss = mss_cb(sk);
-+
-+ /* If this_mss is smaller than mss, it means that a segment will
-+ * be splitted in two (or more) when pushed on this subflow. If
-+ * you consider that mss = 1428 and this_mss = 1420 then two
-+ * segments will be generated: a 1420-byte and 8-byte segment.
-+ * The latter will introduce a large overhead as for a single
-+ * data segment 2 slots will be used in the congestion window.
-+ * Therefore reducing by ~2 the potential throughput of this
-+ * subflow. Indeed, 1428 will be send while 2840 could have been
-+ * sent if mss == 1420 reducing the throughput by 2840 / 1428.
-+ *
-+ * The following algorithm take into account this overhead
-+ * when computing the potential throughput that MPTCP can
-+ * achieve when generating mss-byte segments.
-+ *
-+ * The formulae is the following:
-+ * \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
-+ * Where ratio is computed as follows:
-+ * \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
-+ *
-+ * ratio gives the reduction factor of the theoretical
-+ * throughput a subflow can achieve if MPTCP uses a specific
-+ * MSS value.
-+ */
-+ this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
-+ max(tp->snd_cwnd, tp->packets_out),
-+ (u64)tp->srtt_us *
-+ DIV_ROUND_UP(mss, this_mss) * this_mss);
-+ rate += this_rate;
-+ }
-+
-+ return rate;
-+}
-+
-+static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
-+ unsigned int (*mss_cb)(struct sock *sk))
-+{
-+ unsigned int mss = 0;
-+ u64 rate = 0;
-+ struct sock *sk;
-+
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ int this_mss;
-+ u64 this_rate;
-+
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ this_mss = mss_cb(sk);
-+
-+ /* Same mss values will produce the same throughput. */
-+ if (this_mss == mss)
-+ continue;
-+
-+ /* See whether using this mss value can theoretically improve
-+ * the performances.
-+ */
-+ this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
-+ if (this_rate >= rate) {
-+ mss = this_mss;
-+ rate = this_rate;
-+ }
-+ }
-+
-+ return mss;
-+}
-+
-+unsigned int mptcp_current_mss(struct sock *meta_sk)
-+{
-+ unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
-+
-+ /* If no subflow is available, we take a default-mss from the
-+ * meta-socket.
-+ */
-+ return !mss ? tcp_current_mss(meta_sk) : mss;
-+}
-+
-+static unsigned int mptcp_select_size_mss(struct sock *sk)
-+{
-+ return tcp_sk(sk)->mss_cache;
-+}
-+
-+int mptcp_select_size(const struct sock *meta_sk, bool sg)
-+{
-+ unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
-+
-+ if (sg) {
-+ if (mptcp_sk_can_gso(meta_sk)) {
-+ mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
-+ } else {
-+ int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
-+
-+ if (mss >= pgbreak &&
-+ mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
-+ mss = pgbreak;
-+ }
-+ }
-+
-+ return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
-+}
-+
-+int mptcp_check_snd_buf(const struct tcp_sock *tp)
-+{
-+ const struct sock *sk;
-+ u32 rtt_max = tp->srtt_us;
-+ u64 bw_est;
-+
-+ if (!tp->srtt_us)
-+ return tp->reordering + 1;
-+
-+ mptcp_for_each_sk(tp->mpcb, sk) {
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ if (rtt_max < tcp_sk(sk)->srtt_us)
-+ rtt_max = tcp_sk(sk)->srtt_us;
-+ }
-+
-+ bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
-+ (u64)tp->srtt_us);
-+
-+ return max_t(unsigned int, (u32)(bw_est >> 16),
-+ tp->reordering + 1);
-+}
-+
-+unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
-+ int large_allowed)
-+{
-+ struct sock *sk;
-+ u32 xmit_size_goal = 0;
-+
-+ if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
-+ mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+ int this_size_goal;
-+
-+ if (!mptcp_sk_can_send(sk))
-+ continue;
-+
-+ this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
-+ if (this_size_goal > xmit_size_goal)
-+ xmit_size_goal = this_size_goal;
-+ }
-+ }
-+
-+ return max(xmit_size_goal, mss_now);
-+}
-+
-+/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
-+int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
-+{
-+ if (skb_cloned(skb)) {
-+ if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
-+ return -ENOMEM;
-+ }
-+
-+ __pskb_trim_head(skb, len);
-+
-+ TCP_SKB_CB(skb)->seq += len;
-+ skb->ip_summed = CHECKSUM_PARTIAL;
-+
-+ skb->truesize -= len;
-+ sk->sk_wmem_queued -= len;
-+ sk_mem_uncharge(sk, len);
-+ sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
-+
-+ /* Any change of skb->len requires recalculation of tso factor. */
-+ if (tcp_skb_pcount(skb) > 1)
-+ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
-+
-+ return 0;
-+}
-diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
-new file mode 100644
-index 000000000000..9542f950729f
---- /dev/null
-+++ b/net/mptcp/mptcp_pm.c
-@@ -0,0 +1,169 @@
-+/*
-+ * MPTCP implementation - MPTCP-subflow-management
-+ *
-+ * Initial Design & Implementation:
-+ * Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ * Current Maintainer & Author:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * Additional authors:
-+ * Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ * Gregory Detal <gregory.detal@uclouvain.be>
-+ * Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ * Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ * Lavkesh Lahngir <lavkesh51@gmail.com>
-+ * Andreas Ripke <ripke@neclab.eu>
-+ * Vlad Dogaru <vlad.dogaru@intel.com>
-+ * Octavian Purdila <octavian.purdila@intel.com>
-+ * John Ronan <jronan@tssg.org>
-+ * Catalin Nicutar <catalin.nicutar@gmail.com>
-+ * Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static DEFINE_SPINLOCK(mptcp_pm_list_lock);
-+static LIST_HEAD(mptcp_pm_list);
-+
-+static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
-+ struct net *net, bool *low_prio)
-+{
-+ return 0;
-+}
-+
-+struct mptcp_pm_ops mptcp_pm_default = {
-+ .get_local_id = mptcp_default_id, /* We do not care */
-+ .name = "default",
-+ .owner = THIS_MODULE,
-+};
-+
-+static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
-+{
-+ struct mptcp_pm_ops *e;
-+
-+ list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
-+ if (strcmp(e->name, name) == 0)
-+ return e;
-+ }
-+
-+ return NULL;
-+}
-+
-+int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
-+{
-+ int ret = 0;
-+
-+ if (!pm->get_local_id)
-+ return -EINVAL;
-+
-+ spin_lock(&mptcp_pm_list_lock);
-+ if (mptcp_pm_find(pm->name)) {
-+ pr_notice("%s already registered\n", pm->name);
-+ ret = -EEXIST;
-+ } else {
-+ list_add_tail_rcu(&pm->list, &mptcp_pm_list);
-+ pr_info("%s registered\n", pm->name);
-+ }
-+ spin_unlock(&mptcp_pm_list_lock);
-+
-+ return ret;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
-+
-+void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
-+{
-+ spin_lock(&mptcp_pm_list_lock);
-+ list_del_rcu(&pm->list);
-+ spin_unlock(&mptcp_pm_list_lock);
-+}
-+EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
-+
-+void mptcp_get_default_path_manager(char *name)
-+{
-+ struct mptcp_pm_ops *pm;
-+
-+ BUG_ON(list_empty(&mptcp_pm_list));
-+
-+ rcu_read_lock();
-+ pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
-+ strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
-+ rcu_read_unlock();
-+}
-+
-+int mptcp_set_default_path_manager(const char *name)
-+{
-+ struct mptcp_pm_ops *pm;
-+ int ret = -ENOENT;
-+
-+ spin_lock(&mptcp_pm_list_lock);
-+ pm = mptcp_pm_find(name);
-+#ifdef CONFIG_MODULES
-+ if (!pm && capable(CAP_NET_ADMIN)) {
-+ spin_unlock(&mptcp_pm_list_lock);
-+
-+ request_module("mptcp_%s", name);
-+ spin_lock(&mptcp_pm_list_lock);
-+ pm = mptcp_pm_find(name);
-+ }
-+#endif
-+
-+ if (pm) {
-+ list_move(&pm->list, &mptcp_pm_list);
-+ ret = 0;
-+ } else {
-+ pr_info("%s is not available\n", name);
-+ }
-+ spin_unlock(&mptcp_pm_list_lock);
-+
-+ return ret;
-+}
-+
-+void mptcp_init_path_manager(struct mptcp_cb *mpcb)
-+{
-+ struct mptcp_pm_ops *pm;
-+
-+ rcu_read_lock();
-+ list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
-+ if (try_module_get(pm->owner)) {
-+ mpcb->pm_ops = pm;
-+ break;
-+ }
-+ }
-+ rcu_read_unlock();
-+}
-+
-+/* Manage refcounts on socket close. */
-+void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
-+{
-+ module_put(mpcb->pm_ops->owner);
-+}
-+
-+/* Fallback to the default path-manager. */
-+void mptcp_fallback_default(struct mptcp_cb *mpcb)
-+{
-+ struct mptcp_pm_ops *pm;
-+
-+ mptcp_cleanup_path_manager(mpcb);
-+ pm = mptcp_pm_find("default");
-+
-+ /* Cannot fail - it's the default module */
-+ try_module_get(pm->owner);
-+ mpcb->pm_ops = pm;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_fallback_default);
-+
-+/* Set default value from kernel configuration at bootup */
-+static int __init mptcp_path_manager_default(void)
-+{
-+ return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
-+}
-+late_initcall(mptcp_path_manager_default);
-diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
-new file mode 100644
-index 000000000000..93278f684069
---- /dev/null
-+++ b/net/mptcp/mptcp_rr.c
-@@ -0,0 +1,301 @@
-+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static unsigned char num_segments __read_mostly = 1;
-+module_param(num_segments, byte, 0644);
-+MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
-+
-+static bool cwnd_limited __read_mostly = 1;
-+module_param(cwnd_limited, bool, 0644);
-+MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
-+
-+struct rrsched_priv {
-+ unsigned char quota;
-+};
-+
-+static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
-+{
-+ return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
-+}
-+
-+/* If the sub-socket sk available to send the skb? */
-+static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
-+ bool zero_wnd_test, bool cwnd_test)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ unsigned int space, in_flight;
-+
-+ /* Set of states for which we are allowed to send data */
-+ if (!mptcp_sk_can_send(sk))
-+ return false;
-+
-+ /* We do not send data on this subflow unless it is
-+ * fully established, i.e. the 4th ack has been received.
-+ */
-+ if (tp->mptcp->pre_established)
-+ return false;
-+
-+ if (tp->pf)
-+ return false;
-+
-+ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
-+ /* If SACK is disabled, and we got a loss, TCP does not exit
-+ * the loss-state until something above high_seq has been acked.
-+ * (see tcp_try_undo_recovery)
-+ *
-+ * high_seq is the snd_nxt at the moment of the RTO. As soon
-+ * as we have an RTO, we won't push data on the subflow.
-+ * Thus, snd_una can never go beyond high_seq.
-+ */
-+ if (!tcp_is_reno(tp))
-+ return false;
-+ else if (tp->snd_una != tp->high_seq)
-+ return false;
-+ }
-+
-+ if (!tp->mptcp->fully_established) {
-+ /* Make sure that we send in-order data */
-+ if (skb && tp->mptcp->second_packet &&
-+ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
-+ return false;
-+ }
-+
-+ if (!cwnd_test)
-+ goto zero_wnd_test;
-+
-+ in_flight = tcp_packets_in_flight(tp);
-+ /* Not even a single spot in the cwnd */
-+ if (in_flight >= tp->snd_cwnd)
-+ return false;
-+
-+ /* Now, check if what is queued in the subflow's send-queue
-+ * already fills the cwnd.
-+ */
-+ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
-+
-+ if (tp->write_seq - tp->snd_nxt > space)
-+ return false;
-+
-+zero_wnd_test:
-+ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
-+ return false;
-+
-+ return true;
-+}
-+
-+/* Are we not allowed to reinject this skb on tp? */
-+static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
-+{
-+ /* If the skb has already been enqueued in this sk, try to find
-+ * another one.
-+ */
-+ return skb &&
-+ /* Has the skb already been enqueued into this subsocket? */
-+ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
-+}
-+
-+/* We just look for any subflow that is available */
-+static struct sock *rr_get_available_subflow(struct sock *meta_sk,
-+ struct sk_buff *skb,
-+ bool zero_wnd_test)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *sk, *bestsk = NULL, *backupsk = NULL;
-+
-+ /* Answer data_fin on same subflow!!! */
-+ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
-+ skb && mptcp_is_data_fin(skb)) {
-+ mptcp_for_each_sk(mpcb, sk) {
-+ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
-+ mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
-+ return sk;
-+ }
-+ }
-+
-+ /* First, find the best subflow */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
-+ continue;
-+
-+ if (mptcp_rr_dont_reinject_skb(tp, skb)) {
-+ backupsk = sk;
-+ continue;
-+ }
-+
-+ bestsk = sk;
-+ }
-+
-+ if (bestsk) {
-+ sk = bestsk;
-+ } else if (backupsk) {
-+ /* It has been sent on all subflows once - let's give it a
-+ * chance again by restarting its pathmask.
-+ */
-+ if (skb)
-+ TCP_SKB_CB(skb)->path_mask = 0;
-+ sk = backupsk;
-+ }
-+
-+ return sk;
-+}
-+
-+/* Returns the next segment to be sent from the mptcp meta-queue.
-+ * (chooses the reinject queue if any segment is waiting in it, otherwise,
-+ * chooses the normal write queue).
-+ * Sets *@reinject to 1 if the returned segment comes from the
-+ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
-+ * and sets it to -1 if it is a meta-level retransmission to optimize the
-+ * receive-buffer.
-+ */
-+static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sk_buff *skb = NULL;
-+
-+ *reinject = 0;
-+
-+ /* If we are in fallback-mode, just take from the meta-send-queue */
-+ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
-+ return tcp_send_head(meta_sk);
-+
-+ skb = skb_peek(&mpcb->reinject_queue);
-+
-+ if (skb)
-+ *reinject = 1;
-+ else
-+ skb = tcp_send_head(meta_sk);
-+ return skb;
-+}
-+
-+static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
-+ int *reinject,
-+ struct sock **subsk,
-+ unsigned int *limit)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *sk_it, *choose_sk = NULL;
-+ struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
-+ unsigned char split = num_segments;
-+ unsigned char iter = 0, full_subs = 0;
-+
-+ /* As we set it, we have to reset it as well. */
-+ *limit = 0;
-+
-+ if (!skb)
-+ return NULL;
-+
-+ if (*reinject) {
-+ *subsk = rr_get_available_subflow(meta_sk, skb, false);
-+ if (!*subsk)
-+ return NULL;
-+
-+ return skb;
-+ }
-+
-+retry:
-+
-+ /* First, we look for a subflow who is currently being used */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ struct tcp_sock *tp_it = tcp_sk(sk_it);
-+ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
-+
-+ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
-+ continue;
-+
-+ iter++;
-+
-+ /* Is this subflow currently being used? */
-+ if (rsp->quota > 0 && rsp->quota < num_segments) {
-+ split = num_segments - rsp->quota;
-+ choose_sk = sk_it;
-+ goto found;
-+ }
-+
-+ /* Or, it's totally unused */
-+ if (!rsp->quota) {
-+ split = num_segments;
-+ choose_sk = sk_it;
-+ }
-+
-+ /* Or, it must then be fully used */
-+ if (rsp->quota == num_segments)
-+ full_subs++;
-+ }
-+
-+ /* All considered subflows have a full quota, and we considered at
-+ * least one.
-+ */
-+ if (iter && iter == full_subs) {
-+ /* So, we restart this round by setting quota to 0 and retry
-+ * to find a subflow.
-+ */
-+ mptcp_for_each_sk(mpcb, sk_it) {
-+ struct tcp_sock *tp_it = tcp_sk(sk_it);
-+ struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
-+
-+ if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
-+ continue;
-+
-+ rsp->quota = 0;
-+ }
-+
-+ goto retry;
-+ }
-+
-+found:
-+ if (choose_sk) {
-+ unsigned int mss_now;
-+ struct tcp_sock *choose_tp = tcp_sk(choose_sk);
-+ struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
-+
-+ if (!mptcp_rr_is_available(choose_sk, skb, false, true))
-+ return NULL;
-+
-+ *subsk = choose_sk;
-+ mss_now = tcp_current_mss(*subsk);
-+ *limit = split * mss_now;
-+
-+ if (skb->len > mss_now)
-+ rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
-+ else
-+ rsp->quota++;
-+
-+ return skb;
-+ }
-+
-+ return NULL;
-+}
-+
-+static struct mptcp_sched_ops mptcp_sched_rr = {
-+ .get_subflow = rr_get_available_subflow,
-+ .next_segment = mptcp_rr_next_segment,
-+ .name = "roundrobin",
-+ .owner = THIS_MODULE,
-+};
-+
-+static int __init rr_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
-+
-+ if (mptcp_register_scheduler(&mptcp_sched_rr))
-+ return -1;
-+
-+ return 0;
-+}
-+
-+static void rr_unregister(void)
-+{
-+ mptcp_unregister_scheduler(&mptcp_sched_rr);
-+}
-+
-+module_init(rr_register);
-+module_exit(rr_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
-+MODULE_VERSION("0.89");
-diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
-new file mode 100644
-index 000000000000..6c7ff4eceac1
---- /dev/null
-+++ b/net/mptcp/mptcp_sched.c
-@@ -0,0 +1,493 @@
-+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static DEFINE_SPINLOCK(mptcp_sched_list_lock);
-+static LIST_HEAD(mptcp_sched_list);
-+
-+struct defsched_priv {
-+ u32 last_rbuf_opti;
-+};
-+
-+static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
-+{
-+ return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
-+}
-+
-+/* If the sub-socket sk available to send the skb? */
-+static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
-+ bool zero_wnd_test)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ unsigned int mss_now, space, in_flight;
-+
-+ /* Set of states for which we are allowed to send data */
-+ if (!mptcp_sk_can_send(sk))
-+ return false;
-+
-+ /* We do not send data on this subflow unless it is
-+ * fully established, i.e. the 4th ack has been received.
-+ */
-+ if (tp->mptcp->pre_established)
-+ return false;
-+
-+ if (tp->pf)
-+ return false;
-+
-+ if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
-+ /* If SACK is disabled, and we got a loss, TCP does not exit
-+ * the loss-state until something above high_seq has been acked.
-+ * (see tcp_try_undo_recovery)
-+ *
-+ * high_seq is the snd_nxt at the moment of the RTO. As soon
-+ * as we have an RTO, we won't push data on the subflow.
-+ * Thus, snd_una can never go beyond high_seq.
-+ */
-+ if (!tcp_is_reno(tp))
-+ return false;
-+ else if (tp->snd_una != tp->high_seq)
-+ return false;
-+ }
-+
-+ if (!tp->mptcp->fully_established) {
-+ /* Make sure that we send in-order data */
-+ if (skb && tp->mptcp->second_packet &&
-+ tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
-+ return false;
-+ }
-+
-+ /* If TSQ is already throttling us, do not send on this subflow. When
-+ * TSQ gets cleared the subflow becomes eligible again.
-+ */
-+ if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
-+ return false;
-+
-+ in_flight = tcp_packets_in_flight(tp);
-+ /* Not even a single spot in the cwnd */
-+ if (in_flight >= tp->snd_cwnd)
-+ return false;
-+
-+ /* Now, check if what is queued in the subflow's send-queue
-+ * already fills the cwnd.
-+ */
-+ space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
-+
-+ if (tp->write_seq - tp->snd_nxt > space)
-+ return false;
-+
-+ if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
-+ return false;
-+
-+ mss_now = tcp_current_mss(sk);
-+
-+ /* Don't send on this subflow if we bypass the allowed send-window at
-+ * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
-+ * calculated end_seq (because here at this point end_seq is still at
-+ * the meta-level).
-+ */
-+ if (skb && !zero_wnd_test &&
-+ after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
-+ return false;
-+
-+ return true;
-+}
-+
-+/* Are we not allowed to reinject this skb on tp? */
-+static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
-+{
-+ /* If the skb has already been enqueued in this sk, try to find
-+ * another one.
-+ */
-+ return skb &&
-+ /* Has the skb already been enqueued into this subsocket? */
-+ mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
-+}
-+
-+/* This is the scheduler. This function decides on which flow to send
-+ * a given MSS. If all subflows are found to be busy, NULL is returned
-+ * The flow is selected based on the shortest RTT.
-+ * If all paths have full cong windows, we simply return NULL.
-+ *
-+ * Additionally, this function is aware of the backup-subflows.
-+ */
-+static struct sock *get_available_subflow(struct sock *meta_sk,
-+ struct sk_buff *skb,
-+ bool zero_wnd_test)
-+{
-+ struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
-+ u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
-+ int cnt_backups = 0;
-+
-+ /* if there is only one subflow, bypass the scheduling function */
-+ if (mpcb->cnt_subflows == 1) {
-+ bestsk = (struct sock *)mpcb->connection_list;
-+ if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
-+ bestsk = NULL;
-+ return bestsk;
-+ }
-+
-+ /* Answer data_fin on same subflow!!! */
-+ if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
-+ skb && mptcp_is_data_fin(skb)) {
-+ mptcp_for_each_sk(mpcb, sk) {
-+ if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
-+ mptcp_is_available(sk, skb, zero_wnd_test))
-+ return sk;
-+ }
-+ }
-+
-+ /* First, find the best subflow */
-+ mptcp_for_each_sk(mpcb, sk) {
-+ struct tcp_sock *tp = tcp_sk(sk);
-+
-+ if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
-+ cnt_backups++;
-+
-+ if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
-+ tp->srtt_us < lowprio_min_time_to_peer) {
-+ if (!mptcp_is_available(sk, skb, zero_wnd_test))
-+ continue;
-+
-+ if (mptcp_dont_reinject_skb(tp, skb)) {
-+ backupsk = sk;
-+ continue;
-+ }
-+
-+ lowprio_min_time_to_peer = tp->srtt_us;
-+ lowpriosk = sk;
-+ } else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
-+ tp->srtt_us < min_time_to_peer) {
-+ if (!mptcp_is_available(sk, skb, zero_wnd_test))
-+ continue;
-+
-+ if (mptcp_dont_reinject_skb(tp, skb)) {
-+ backupsk = sk;
-+ continue;
-+ }
-+
-+ min_time_to_peer = tp->srtt_us;
-+ bestsk = sk;
-+ }
-+ }
-+
-+ if (mpcb->cnt_established == cnt_backups && lowpriosk) {
-+ sk = lowpriosk;
-+ } else if (bestsk) {
-+ sk = bestsk;
-+ } else if (backupsk) {
-+ /* It has been sent on all subflows once - let's give it a
-+ * chance again by restarting its pathmask.
-+ */
-+ if (skb)
-+ TCP_SKB_CB(skb)->path_mask = 0;
-+ sk = backupsk;
-+ }
-+
-+ return sk;
-+}
-+
-+static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
-+{
-+ struct sock *meta_sk;
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct tcp_sock *tp_it;
-+ struct sk_buff *skb_head;
-+ struct defsched_priv *dsp = defsched_get_priv(tp);
-+
-+ if (tp->mpcb->cnt_subflows == 1)
-+ return NULL;
-+
-+ meta_sk = mptcp_meta_sk(sk);
-+ skb_head = tcp_write_queue_head(meta_sk);
-+
-+ if (!skb_head || skb_head == tcp_send_head(meta_sk))
-+ return NULL;
-+
-+ /* If penalization is optional (coming from mptcp_next_segment() and
-+ * We are not send-buffer-limited we do not penalize. The retransmission
-+ * is just an optimization to fix the idle-time due to the delay before
-+ * we wake up the application.
-+ */
-+ if (!penal && sk_stream_memory_free(meta_sk))
-+ goto retrans;
-+
-+ /* Only penalize again after an RTT has elapsed */
-+ if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
-+ goto retrans;
-+
-+ /* Half the cwnd of the slow flow */
-+ mptcp_for_each_tp(tp->mpcb, tp_it) {
-+ if (tp_it != tp &&
-+ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
-+ if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
-+ tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
-+ if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
-+ tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
-+
-+ dsp->last_rbuf_opti = tcp_time_stamp;
-+ }
-+ break;
-+ }
-+ }
-+
-+retrans:
-+
-+ /* Segment not yet injected into this path? Take it!!! */
-+ if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
-+ bool do_retrans = false;
-+ mptcp_for_each_tp(tp->mpcb, tp_it) {
-+ if (tp_it != tp &&
-+ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
-+ if (tp_it->snd_cwnd <= 4) {
-+ do_retrans = true;
-+ break;
-+ }
-+
-+ if (4 * tp->srtt_us >= tp_it->srtt_us) {
-+ do_retrans = false;
-+ break;
-+ } else {
-+ do_retrans = true;
-+ }
-+ }
-+ }
-+
-+ if (do_retrans && mptcp_is_available(sk, skb_head, false))
-+ return skb_head;
-+ }
-+ return NULL;
-+}
-+
-+/* Returns the next segment to be sent from the mptcp meta-queue.
-+ * (chooses the reinject queue if any segment is waiting in it, otherwise,
-+ * chooses the normal write queue).
-+ * Sets *@reinject to 1 if the returned segment comes from the
-+ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
-+ * and sets it to -1 if it is a meta-level retransmission to optimize the
-+ * receive-buffer.
-+ */
-+static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
-+{
-+ const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+ struct sk_buff *skb = NULL;
-+
-+ *reinject = 0;
-+
-+ /* If we are in fallback-mode, just take from the meta-send-queue */
-+ if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
-+ return tcp_send_head(meta_sk);
-+
-+ skb = skb_peek(&mpcb->reinject_queue);
-+
-+ if (skb) {
-+ *reinject = 1;
-+ } else {
-+ skb = tcp_send_head(meta_sk);
-+
-+ if (!skb && meta_sk->sk_socket &&
-+ test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
-+ sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
-+ struct sock *subsk = get_available_subflow(meta_sk, NULL,
-+ false);
-+ if (!subsk)
-+ return NULL;
-+
-+ skb = mptcp_rcv_buf_optimization(subsk, 0);
-+ if (skb)
-+ *reinject = -1;
-+ }
-+ }
-+ return skb;
-+}
-+
-+static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
-+ int *reinject,
-+ struct sock **subsk,
-+ unsigned int *limit)
-+{
-+ struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
-+ unsigned int mss_now;
-+ struct tcp_sock *subtp;
-+ u16 gso_max_segs;
-+ u32 max_len, max_segs, window, needed;
-+
-+ /* As we set it, we have to reset it as well. */
-+ *limit = 0;
-+
-+ if (!skb)
-+ return NULL;
-+
-+ *subsk = get_available_subflow(meta_sk, skb, false);
-+ if (!*subsk)
-+ return NULL;
-+
-+ subtp = tcp_sk(*subsk);
-+ mss_now = tcp_current_mss(*subsk);
-+
-+ if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
-+ skb = mptcp_rcv_buf_optimization(*subsk, 1);
-+ if (skb)
-+ *reinject = -1;
-+ else
-+ return NULL;
-+ }
-+
-+ /* No splitting required, as we will only send one single segment */
-+ if (skb->len <= mss_now)
-+ return skb;
-+
-+ /* The following is similar to tcp_mss_split_point, but
-+ * we do not care about nagle, because we will anyways
-+ * use TCP_NAGLE_PUSH, which overrides this.
-+ *
-+ * So, we first limit according to the cwnd/gso-size and then according
-+ * to the subflow's window.
-+ */
-+
-+ gso_max_segs = (*subsk)->sk_gso_max_segs;
-+ if (!gso_max_segs) /* No gso supported on the subflow's NIC */
-+ gso_max_segs = 1;
-+ max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
-+ if (!max_segs)
-+ return NULL;
-+
-+ max_len = mss_now * max_segs;
-+ window = tcp_wnd_end(subtp) - subtp->write_seq;
-+
-+ needed = min(skb->len, window);
-+ if (max_len <= skb->len)
-+ /* Take max_win, which is actually the cwnd/gso-size */
-+ *limit = max_len;
-+ else
-+ /* Or, take the window */
-+ *limit = needed;
-+
-+ return skb;
-+}
-+
-+static void defsched_init(struct sock *sk)
-+{
-+ struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
-+
-+ dsp->last_rbuf_opti = tcp_time_stamp;
-+}
-+
-+struct mptcp_sched_ops mptcp_sched_default = {
-+ .get_subflow = get_available_subflow,
-+ .next_segment = mptcp_next_segment,
-+ .init = defsched_init,
-+ .name = "default",
-+ .owner = THIS_MODULE,
-+};
-+
-+static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
-+{
-+ struct mptcp_sched_ops *e;
-+
-+ list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
-+ if (strcmp(e->name, name) == 0)
-+ return e;
-+ }
-+
-+ return NULL;
-+}
-+
-+int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
-+{
-+ int ret = 0;
-+
-+ if (!sched->get_subflow || !sched->next_segment)
-+ return -EINVAL;
-+
-+ spin_lock(&mptcp_sched_list_lock);
-+ if (mptcp_sched_find(sched->name)) {
-+ pr_notice("%s already registered\n", sched->name);
-+ ret = -EEXIST;
-+ } else {
-+ list_add_tail_rcu(&sched->list, &mptcp_sched_list);
-+ pr_info("%s registered\n", sched->name);
-+ }
-+ spin_unlock(&mptcp_sched_list_lock);
-+
-+ return ret;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
-+
-+void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
-+{
-+ spin_lock(&mptcp_sched_list_lock);
-+ list_del_rcu(&sched->list);
-+ spin_unlock(&mptcp_sched_list_lock);
-+}
-+EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
-+
-+void mptcp_get_default_scheduler(char *name)
-+{
-+ struct mptcp_sched_ops *sched;
-+
-+ BUG_ON(list_empty(&mptcp_sched_list));
-+
-+ rcu_read_lock();
-+ sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
-+ strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
-+ rcu_read_unlock();
-+}
-+
-+int mptcp_set_default_scheduler(const char *name)
-+{
-+ struct mptcp_sched_ops *sched;
-+ int ret = -ENOENT;
-+
-+ spin_lock(&mptcp_sched_list_lock);
-+ sched = mptcp_sched_find(name);
-+#ifdef CONFIG_MODULES
-+ if (!sched && capable(CAP_NET_ADMIN)) {
-+ spin_unlock(&mptcp_sched_list_lock);
-+
-+ request_module("mptcp_%s", name);
-+ spin_lock(&mptcp_sched_list_lock);
-+ sched = mptcp_sched_find(name);
-+ }
-+#endif
-+
-+ if (sched) {
-+ list_move(&sched->list, &mptcp_sched_list);
-+ ret = 0;
-+ } else {
-+ pr_info("%s is not available\n", name);
-+ }
-+ spin_unlock(&mptcp_sched_list_lock);
-+
-+ return ret;
-+}
-+
-+void mptcp_init_scheduler(struct mptcp_cb *mpcb)
-+{
-+ struct mptcp_sched_ops *sched;
-+
-+ rcu_read_lock();
-+ list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
-+ if (try_module_get(sched->owner)) {
-+ mpcb->sched_ops = sched;
-+ break;
-+ }
-+ }
-+ rcu_read_unlock();
-+}
-+
-+/* Manage refcounts on socket close. */
-+void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
-+{
-+ module_put(mpcb->sched_ops->owner);
-+}
-+
-+/* Set default value from kernel configuration at bootup */
-+static int __init mptcp_scheduler_default(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
-+
-+ return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
-+}
-+late_initcall(mptcp_scheduler_default);
-diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
-new file mode 100644
-index 000000000000..29ca1d868d17
---- /dev/null
-+++ b/net/mptcp/mptcp_wvegas.c
-@@ -0,0 +1,268 @@
-+/*
-+ * MPTCP implementation - WEIGHTED VEGAS
-+ *
-+ * Algorithm design:
-+ * Yu Cao <cyAnalyst@126.com>
-+ * Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
-+ * Xiaoming Fu <fu@cs.uni-goettinggen.de>
-+ *
-+ * Implementation:
-+ * Yu Cao <cyAnalyst@126.com>
-+ * Enhuan Dong <deh13@mails.tsinghua.edu.cn>
-+ *
-+ * Ported to the official MPTCP-kernel:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/skbuff.h>
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <linux/module.h>
-+#include <linux/tcp.h>
-+
-+static int initial_alpha = 2;
-+static int total_alpha = 10;
-+static int gamma = 1;
-+
-+module_param(initial_alpha, int, 0644);
-+MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
-+module_param(total_alpha, int, 0644);
-+MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
-+module_param(gamma, int, 0644);
-+MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
-+
-+#define MPTCP_WVEGAS_SCALE 16
-+
-+/* wVegas variables */
-+struct wvegas {
-+ u32 beg_snd_nxt; /* right edge during last RTT */
-+ u8 doing_wvegas_now;/* if true, do wvegas for this RTT */
-+
-+ u16 cnt_rtt; /* # of RTTs measured within last RTT */
-+ u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
-+ u32 base_rtt; /* the min of all wVegas RTT measurements seen (in usec) */
-+
-+ u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
-+ u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
-+ int alpha; /* alpha for each subflows */
-+
-+ u32 queue_delay; /* queue delay*/
-+};
-+
-+
-+static inline u64 mptcp_wvegas_scale(u32 val, int scale)
-+{
-+ return (u64) val << scale;
-+}
-+
-+static void wvegas_enable(const struct sock *sk)
-+{
-+ const struct tcp_sock *tp = tcp_sk(sk);
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ wvegas->doing_wvegas_now = 1;
-+
-+ wvegas->beg_snd_nxt = tp->snd_nxt;
-+
-+ wvegas->cnt_rtt = 0;
-+ wvegas->sampled_rtt = 0;
-+
-+ wvegas->instant_rate = 0;
-+ wvegas->alpha = initial_alpha;
-+ wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
-+
-+ wvegas->queue_delay = 0;
-+}
-+
-+static inline void wvegas_disable(const struct sock *sk)
-+{
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ wvegas->doing_wvegas_now = 0;
-+}
-+
-+static void mptcp_wvegas_init(struct sock *sk)
-+{
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ wvegas->base_rtt = 0x7fffffff;
-+ wvegas_enable(sk);
-+}
-+
-+static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
-+{
-+ return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
-+}
-+
-+static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
-+{
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+ u32 vrtt;
-+
-+ if (rtt_us < 0)
-+ return;
-+
-+ vrtt = rtt_us + 1;
-+
-+ if (vrtt < wvegas->base_rtt)
-+ wvegas->base_rtt = vrtt;
-+
-+ wvegas->sampled_rtt += vrtt;
-+ wvegas->cnt_rtt++;
-+}
-+
-+static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
-+{
-+ if (ca_state == TCP_CA_Open)
-+ wvegas_enable(sk);
-+ else
-+ wvegas_disable(sk);
-+}
-+
-+static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
-+{
-+ if (event == CA_EVENT_CWND_RESTART) {
-+ mptcp_wvegas_init(sk);
-+ } else if (event == CA_EVENT_LOSS) {
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+ wvegas->instant_rate = 0;
-+ }
-+}
-+
-+static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
-+{
-+ return min(tp->snd_ssthresh, tp->snd_cwnd - 1);
-+}
-+
-+static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
-+{
-+ u64 total_rate = 0;
-+ struct sock *sub_sk;
-+ const struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ if (!mpcb)
-+ return wvegas->weight;
-+
-+
-+ mptcp_for_each_sk(mpcb, sub_sk) {
-+ struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
-+
-+ /* sampled_rtt is initialized by 0 */
-+ if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
-+ total_rate += sub_wvegas->instant_rate;
-+ }
-+
-+ if (total_rate && wvegas->instant_rate)
-+ return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
-+ else
-+ return wvegas->weight;
-+}
-+
-+static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+ struct tcp_sock *tp = tcp_sk(sk);
-+ struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+ if (!wvegas->doing_wvegas_now) {
-+ tcp_reno_cong_avoid(sk, ack, acked);
-+ return;
-+ }
-+
-+ if (after(ack, wvegas->beg_snd_nxt)) {
-+ wvegas->beg_snd_nxt = tp->snd_nxt;
-+
-+ if (wvegas->cnt_rtt <= 2) {
-+ tcp_reno_cong_avoid(sk, ack, acked);
-+ } else {
-+ u32 rtt, diff, q_delay;
-+ u64 target_cwnd;
-+
-+ rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
-+ target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
-+
-+ diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
-+
-+ if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
-+ tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
-+ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
-+
-+ } else if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+ tcp_slow_start(tp, acked);
-+ } else {
-+ if (diff >= wvegas->alpha) {
-+ wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
-+ wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
-+ wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
-+ }
-+ if (diff > wvegas->alpha) {
-+ tp->snd_cwnd--;
-+ tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
-+ } else if (diff < wvegas->alpha) {
-+ tp->snd_cwnd++;
-+ }
-+
-+ /* Try to drain link queue if needed*/
-+ q_delay = rtt - wvegas->base_rtt;
-+ if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
-+ wvegas->queue_delay = q_delay;
-+
-+ if (q_delay >= 2 * wvegas->queue_delay) {
-+ u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
-+ tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
-+ wvegas->queue_delay = 0;
-+ }
-+ }
-+
-+ if (tp->snd_cwnd < 2)
-+ tp->snd_cwnd = 2;
-+ else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
-+ tp->snd_cwnd = tp->snd_cwnd_clamp;
-+
-+ tp->snd_ssthresh = tcp_current_ssthresh(sk);
-+ }
-+
-+ wvegas->cnt_rtt = 0;
-+ wvegas->sampled_rtt = 0;
-+ }
-+ /* Use normal slow start */
-+ else if (tp->snd_cwnd <= tp->snd_ssthresh)
-+ tcp_slow_start(tp, acked);
-+}
-+
-+
-+static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
-+ .init = mptcp_wvegas_init,
-+ .ssthresh = tcp_reno_ssthresh,
-+ .cong_avoid = mptcp_wvegas_cong_avoid,
-+ .pkts_acked = mptcp_wvegas_pkts_acked,
-+ .set_state = mptcp_wvegas_state,
-+ .cwnd_event = mptcp_wvegas_cwnd_event,
-+
-+ .owner = THIS_MODULE,
-+ .name = "wvegas",
-+};
-+
-+static int __init mptcp_wvegas_register(void)
-+{
-+ BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
-+ tcp_register_congestion_control(&mptcp_wvegas);
-+ return 0;
-+}
-+
-+static void __exit mptcp_wvegas_unregister(void)
-+{
-+ tcp_unregister_congestion_control(&mptcp_wvegas);
-+}
-+
-+module_init(mptcp_wvegas_register);
-+module_exit(mptcp_wvegas_unregister);
-+
-+MODULE_AUTHOR("Yu Cao, Enhuan Dong");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP wVegas");
-+MODULE_VERSION("0.1");
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-07 1:28 Anthony G. Basile
0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-10-07 1:28 UTC (permalink / raw
To: gentoo-commits
commit: f0e24d581e380ceb5a563a6bc0a9e66ad077fe31
Author: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Tue Oct 7 01:28:33 2014 +0000
Commit: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Tue Oct 7 01:28:33 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=f0e24d58
Add patch t support namespace user.pax.* on tmpfs, bug #470644
---
0000_README | 4 ++++
1500_XATTR_USER_PREFIX.patch | 54 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+)
diff --git a/0000_README b/0000_README
index 3cc9441..25ca364 100644
--- a/0000_README
+++ b/0000_README
@@ -54,6 +54,10 @@ Patch: 1002_linux-3.16.3.patch
From: http://www.kernel.org
Desc: Linux 3.16.3
+Patch: 1500_XATTR_USER_PREFIX.patch
+From: https://bugs.gentoo.org/show_bug.cgi?id=470644
+Desc: Support for namespace user.pax.* on tmpfs.
+
Patch: 2400_kcopy-patch-for-infiniband-driver.patch
From: Alexey Shvetsov <alexxy@gentoo.org>
Desc: Zero copy for infiniband psm userspace driver
diff --git a/1500_XATTR_USER_PREFIX.patch b/1500_XATTR_USER_PREFIX.patch
new file mode 100644
index 0000000..cc15cd5
--- /dev/null
+++ b/1500_XATTR_USER_PREFIX.patch
@@ -0,0 +1,54 @@
+From: Anthony G. Basile <blueness@gentoo.org>
+
+This patch adds support for a restricted user-controlled namespace on
+tmpfs filesystem used to house PaX flags. The namespace must be of the
+form user.pax.* and its value cannot exceed a size of 8 bytes.
+
+This is needed even on all Gentoo systems so that XATTR_PAX flags
+are preserved for users who might build packages using portage on
+a tmpfs system with a non-hardened kernel and then switch to a
+hardened kernel with XATTR_PAX enabled.
+
+The namespace is added to any user with Extended Attribute support
+enabled for tmpfs. Users who do not enable xattrs will not have
+the XATTR_PAX flags preserved.
+
+diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
+index e4629b9..6958086 100644
+--- a/include/uapi/linux/xattr.h
++++ b/include/uapi/linux/xattr.h
+@@ -63,5 +63,9 @@
+ #define XATTR_POSIX_ACL_DEFAULT "posix_acl_default"
+ #define XATTR_NAME_POSIX_ACL_DEFAULT XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_DEFAULT
+
++/* User namespace */
++#define XATTR_PAX_PREFIX XATTR_USER_PREFIX "pax."
++#define XATTR_PAX_FLAGS_SUFFIX "flags"
++#define XATTR_NAME_PAX_FLAGS XATTR_PAX_PREFIX XATTR_PAX_FLAGS_SUFFIX
+
+ #endif /* _UAPI_LINUX_XATTR_H */
+diff --git a/mm/shmem.c b/mm/shmem.c
+index 1c44af7..f23bb1b 100644
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -2201,6 +2201,7 @@ static const struct xattr_handler *shmem_xattr_handlers[] = {
+ static int shmem_xattr_validate(const char *name)
+ {
+ struct { const char *prefix; size_t len; } arr[] = {
++ { XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN},
+ { XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN },
+ { XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN }
+ };
+@@ -2256,6 +2257,12 @@ static int shmem_setxattr(struct dentry *dentry, const char *name,
+ if (err)
+ return err;
+
++ if (!strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN)) {
++ if (strcmp(name, XATTR_NAME_PAX_FLAGS))
++ return -EOPNOTSUPP;
++ if (size > 8)
++ return -EINVAL;
++ }
+ return simple_xattr_set(&info->xattrs, name, value, size, flags);
+ }
+
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-07 1:34 Anthony G. Basile
0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-10-07 1:34 UTC (permalink / raw
To: gentoo-commits
commit: 469245b0b190204e29f395ab73a0c3b5b2ab988f
Author: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Tue Oct 7 01:28:33 2014 +0000
Commit: Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Tue Oct 7 01:34:53 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=469245b0
Add patch to support namespace user.pax.* on tmpfs, bug #470644
---
0000_README | 4 ++++
1500_XATTR_USER_PREFIX.patch | 54 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+)
diff --git a/0000_README b/0000_README
index 3cc9441..25ca364 100644
--- a/0000_README
+++ b/0000_README
@@ -54,6 +54,10 @@ Patch: 1002_linux-3.16.3.patch
From: http://www.kernel.org
Desc: Linux 3.16.3
+Patch: 1500_XATTR_USER_PREFIX.patch
+From: https://bugs.gentoo.org/show_bug.cgi?id=470644
+Desc: Support for namespace user.pax.* on tmpfs.
+
Patch: 2400_kcopy-patch-for-infiniband-driver.patch
From: Alexey Shvetsov <alexxy@gentoo.org>
Desc: Zero copy for infiniband psm userspace driver
diff --git a/1500_XATTR_USER_PREFIX.patch b/1500_XATTR_USER_PREFIX.patch
new file mode 100644
index 0000000..cc15cd5
--- /dev/null
+++ b/1500_XATTR_USER_PREFIX.patch
@@ -0,0 +1,54 @@
+From: Anthony G. Basile <blueness@gentoo.org>
+
+This patch adds support for a restricted user-controlled namespace on
+tmpfs filesystem used to house PaX flags. The namespace must be of the
+form user.pax.* and its value cannot exceed a size of 8 bytes.
+
+This is needed even on all Gentoo systems so that XATTR_PAX flags
+are preserved for users who might build packages using portage on
+a tmpfs system with a non-hardened kernel and then switch to a
+hardened kernel with XATTR_PAX enabled.
+
+The namespace is added to any user with Extended Attribute support
+enabled for tmpfs. Users who do not enable xattrs will not have
+the XATTR_PAX flags preserved.
+
+diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
+index e4629b9..6958086 100644
+--- a/include/uapi/linux/xattr.h
++++ b/include/uapi/linux/xattr.h
+@@ -63,5 +63,9 @@
+ #define XATTR_POSIX_ACL_DEFAULT "posix_acl_default"
+ #define XATTR_NAME_POSIX_ACL_DEFAULT XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_DEFAULT
+
++/* User namespace */
++#define XATTR_PAX_PREFIX XATTR_USER_PREFIX "pax."
++#define XATTR_PAX_FLAGS_SUFFIX "flags"
++#define XATTR_NAME_PAX_FLAGS XATTR_PAX_PREFIX XATTR_PAX_FLAGS_SUFFIX
+
+ #endif /* _UAPI_LINUX_XATTR_H */
+diff --git a/mm/shmem.c b/mm/shmem.c
+index 1c44af7..f23bb1b 100644
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -2201,6 +2201,7 @@ static const struct xattr_handler *shmem_xattr_handlers[] = {
+ static int shmem_xattr_validate(const char *name)
+ {
+ struct { const char *prefix; size_t len; } arr[] = {
++ { XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN},
+ { XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN },
+ { XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN }
+ };
+@@ -2256,6 +2257,12 @@ static int shmem_setxattr(struct dentry *dentry, const char *name,
+ if (err)
+ return err;
+
++ if (!strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN)) {
++ if (strcmp(name, XATTR_NAME_PAX_FLAGS))
++ return -EOPNOTSUPP;
++ if (size > 8)
++ return -EINVAL;
++ }
+ return simple_xattr_set(&info->xattrs, name, value, size, flags);
+ }
+
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-09 19:54 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-09 19:54 UTC (permalink / raw
To: gentoo-commits
commit: 5a7ae131b7b69198d892277ab46031299237a9a6
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Thu Oct 9 19:54:07 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Thu Oct 9 19:54:07 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=5a7ae131
Linux patch 3.16.5
---
0000_README | 8 +
1004_linux-3.16.5.patch | 987 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 995 insertions(+)
diff --git a/0000_README b/0000_README
index 25ca364..ede03f9 100644
--- a/0000_README
+++ b/0000_README
@@ -54,6 +54,14 @@ Patch: 1002_linux-3.16.3.patch
From: http://www.kernel.org
Desc: Linux 3.16.3
+Patch: 1003_linux-3.16.4.patch
+From: http://www.kernel.org
+Desc: Linux 3.16.4
+
+Patch: 1004_linux-3.16.5.patch
+From: http://www.kernel.org
+Desc: Linux 3.16.5
+
Patch: 1500_XATTR_USER_PREFIX.patch
From: https://bugs.gentoo.org/show_bug.cgi?id=470644
Desc: Support for namespace user.pax.* on tmpfs.
diff --git a/1004_linux-3.16.5.patch b/1004_linux-3.16.5.patch
new file mode 100644
index 0000000..248afad
--- /dev/null
+++ b/1004_linux-3.16.5.patch
@@ -0,0 +1,987 @@
+diff --git a/Makefile b/Makefile
+index e75c75f0ec35..41efc3d9f2e0 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 4
++SUBLEVEL = 5
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+
+diff --git a/arch/ia64/pci/fixup.c b/arch/ia64/pci/fixup.c
+index 1fe9aa5068ea..fc505d58f078 100644
+--- a/arch/ia64/pci/fixup.c
++++ b/arch/ia64/pci/fixup.c
+@@ -6,6 +6,7 @@
+ #include <linux/pci.h>
+ #include <linux/init.h>
+ #include <linux/vgaarb.h>
++#include <linux/screen_info.h>
+
+ #include <asm/machvec.h>
+
+@@ -61,8 +62,7 @@ static void pci_fixup_video(struct pci_dev *pdev)
+ pci_read_config_word(pdev, PCI_COMMAND, &config);
+ if (config & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) {
+ pdev->resource[PCI_ROM_RESOURCE].flags |= IORESOURCE_ROM_SHADOW;
+- dev_printk(KERN_DEBUG, &pdev->dev, "Boot video device\n");
+- vga_set_default_device(pdev);
++ dev_printk(KERN_DEBUG, &pdev->dev, "Video device with shadowed ROM\n");
+ }
+ }
+ }
+diff --git a/arch/x86/include/asm/vga.h b/arch/x86/include/asm/vga.h
+index 44282fbf7bf9..c4b9dc2f67c5 100644
+--- a/arch/x86/include/asm/vga.h
++++ b/arch/x86/include/asm/vga.h
+@@ -17,10 +17,4 @@
+ #define vga_readb(x) (*(x))
+ #define vga_writeb(x, y) (*(y) = (x))
+
+-#ifdef CONFIG_FB_EFI
+-#define __ARCH_HAS_VGA_DEFAULT_DEVICE
+-extern struct pci_dev *vga_default_device(void);
+-extern void vga_set_default_device(struct pci_dev *pdev);
+-#endif
+-
+ #endif /* _ASM_X86_VGA_H */
+diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
+index b5e60268d93f..9a2b7101ae8a 100644
+--- a/arch/x86/pci/fixup.c
++++ b/arch/x86/pci/fixup.c
+@@ -350,8 +350,7 @@ static void pci_fixup_video(struct pci_dev *pdev)
+ pci_read_config_word(pdev, PCI_COMMAND, &config);
+ if (config & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) {
+ pdev->resource[PCI_ROM_RESOURCE].flags |= IORESOURCE_ROM_SHADOW;
+- dev_printk(KERN_DEBUG, &pdev->dev, "Boot video device\n");
+- vga_set_default_device(pdev);
++ dev_printk(KERN_DEBUG, &pdev->dev, "Video device with shadowed ROM\n");
+ }
+ }
+ }
+diff --git a/drivers/cpufreq/integrator-cpufreq.c b/drivers/cpufreq/integrator-cpufreq.c
+index e5122f1bfe78..302eb5c55d01 100644
+--- a/drivers/cpufreq/integrator-cpufreq.c
++++ b/drivers/cpufreq/integrator-cpufreq.c
+@@ -213,9 +213,9 @@ static int __init integrator_cpufreq_probe(struct platform_device *pdev)
+ return cpufreq_register_driver(&integrator_driver);
+ }
+
+-static void __exit integrator_cpufreq_remove(struct platform_device *pdev)
++static int __exit integrator_cpufreq_remove(struct platform_device *pdev)
+ {
+- cpufreq_unregister_driver(&integrator_driver);
++ return cpufreq_unregister_driver(&integrator_driver);
+ }
+
+ static const struct of_device_id integrator_cpufreq_match[] = {
+diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
+index 728a2d879499..4d2c8e861089 100644
+--- a/drivers/cpufreq/pcc-cpufreq.c
++++ b/drivers/cpufreq/pcc-cpufreq.c
+@@ -204,7 +204,6 @@ static int pcc_cpufreq_target(struct cpufreq_policy *policy,
+ u32 input_buffer;
+ int cpu;
+
+- spin_lock(&pcc_lock);
+ cpu = policy->cpu;
+ pcc_cpu_data = per_cpu_ptr(pcc_cpu_info, cpu);
+
+@@ -216,6 +215,7 @@ static int pcc_cpufreq_target(struct cpufreq_policy *policy,
+ freqs.old = policy->cur;
+ freqs.new = target_freq;
+ cpufreq_freq_transition_begin(policy, &freqs);
++ spin_lock(&pcc_lock);
+
+ input_buffer = 0x1 | (((target_freq * 100)
+ / (ioread32(&pcch_hdr->nominal) * 1000)) << 8);
+diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
+index 8b3cde703364..8faabb95cd65 100644
+--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
++++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
+@@ -1297,6 +1297,16 @@ void i915_check_and_clear_faults(struct drm_device *dev)
+ POSTING_READ(RING_FAULT_REG(&dev_priv->ring[RCS]));
+ }
+
++static void i915_ggtt_flush(struct drm_i915_private *dev_priv)
++{
++ if (INTEL_INFO(dev_priv->dev)->gen < 6) {
++ intel_gtt_chipset_flush();
++ } else {
++ I915_WRITE(GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
++ POSTING_READ(GFX_FLSH_CNTL_GEN6);
++ }
++}
++
+ void i915_gem_suspend_gtt_mappings(struct drm_device *dev)
+ {
+ struct drm_i915_private *dev_priv = dev->dev_private;
+@@ -1313,6 +1323,8 @@ void i915_gem_suspend_gtt_mappings(struct drm_device *dev)
+ dev_priv->gtt.base.start,
+ dev_priv->gtt.base.total,
+ true);
++
++ i915_ggtt_flush(dev_priv);
+ }
+
+ void i915_gem_restore_gtt_mappings(struct drm_device *dev)
+@@ -1365,7 +1377,7 @@ void i915_gem_restore_gtt_mappings(struct drm_device *dev)
+ gen6_write_pdes(container_of(vm, struct i915_hw_ppgtt, base));
+ }
+
+- i915_gem_chipset_flush(dev);
++ i915_ggtt_flush(dev_priv);
+ }
+
+ int i915_gem_gtt_prepare_object(struct drm_i915_gem_object *obj)
+diff --git a/drivers/gpu/drm/i915/intel_opregion.c b/drivers/gpu/drm/i915/intel_opregion.c
+index 4f6b53998d79..b9135dc3fe5d 100644
+--- a/drivers/gpu/drm/i915/intel_opregion.c
++++ b/drivers/gpu/drm/i915/intel_opregion.c
+@@ -395,6 +395,16 @@ int intel_opregion_notify_adapter(struct drm_device *dev, pci_power_t state)
+ return -EINVAL;
+ }
+
++/*
++ * If the vendor backlight interface is not in use and ACPI backlight interface
++ * is broken, do not bother processing backlight change requests from firmware.
++ */
++static bool should_ignore_backlight_request(void)
++{
++ return acpi_video_backlight_support() &&
++ !acpi_video_verify_backlight_support();
++}
++
+ static u32 asle_set_backlight(struct drm_device *dev, u32 bclp)
+ {
+ struct drm_i915_private *dev_priv = dev->dev_private;
+@@ -403,11 +413,7 @@ static u32 asle_set_backlight(struct drm_device *dev, u32 bclp)
+
+ DRM_DEBUG_DRIVER("bclp = 0x%08x\n", bclp);
+
+- /*
+- * If the acpi_video interface is not supposed to be used, don't
+- * bother processing backlight level change requests from firmware.
+- */
+- if (!acpi_video_verify_backlight_support()) {
++ if (should_ignore_backlight_request()) {
+ DRM_DEBUG_KMS("opregion backlight request ignored\n");
+ return 0;
+ }
+diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
+index af0259708358..366641d0483f 100644
+--- a/drivers/gpu/vga/vgaarb.c
++++ b/drivers/gpu/vga/vgaarb.c
+@@ -41,6 +41,7 @@
+ #include <linux/poll.h>
+ #include <linux/miscdevice.h>
+ #include <linux/slab.h>
++#include <linux/screen_info.h>
+
+ #include <linux/uaccess.h>
+
+@@ -580,8 +581,11 @@ static bool vga_arbiter_add_pci_device(struct pci_dev *pdev)
+ */
+ #ifndef __ARCH_HAS_VGA_DEFAULT_DEVICE
+ if (vga_default == NULL &&
+- ((vgadev->owns & VGA_RSRC_LEGACY_MASK) == VGA_RSRC_LEGACY_MASK))
++ ((vgadev->owns & VGA_RSRC_LEGACY_MASK) == VGA_RSRC_LEGACY_MASK)) {
++ pr_info("vgaarb: setting as boot device: PCI:%s\n",
++ pci_name(pdev));
+ vga_set_default_device(pdev);
++ }
+ #endif
+
+ vga_arbiter_check_bridge_sharing(vgadev);
+@@ -1316,6 +1320,38 @@ static int __init vga_arb_device_init(void)
+ pr_info("vgaarb: loaded\n");
+
+ list_for_each_entry(vgadev, &vga_list, list) {
++#if defined(CONFIG_X86) || defined(CONFIG_IA64)
++ /* Override I/O based detection done by vga_arbiter_add_pci_device()
++ * as it may take the wrong device (e.g. on Apple system under EFI).
++ *
++ * Select the device owning the boot framebuffer if there is one.
++ */
++ resource_size_t start, end;
++ int i;
++
++ /* Does firmware framebuffer belong to us? */
++ for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
++ if (!(pci_resource_flags(vgadev->pdev, i) & IORESOURCE_MEM))
++ continue;
++
++ start = pci_resource_start(vgadev->pdev, i);
++ end = pci_resource_end(vgadev->pdev, i);
++
++ if (!start || !end)
++ continue;
++
++ if (screen_info.lfb_base < start ||
++ (screen_info.lfb_base + screen_info.lfb_size) >= end)
++ continue;
++ if (!vga_default_device())
++ pr_info("vgaarb: setting as boot device: PCI:%s\n",
++ pci_name(vgadev->pdev));
++ else if (vgadev->pdev != vga_default_device())
++ pr_info("vgaarb: overriding boot device: PCI:%s\n",
++ pci_name(vgadev->pdev));
++ vga_set_default_device(vgadev->pdev);
++ }
++#endif
+ if (vgadev->bridge_has_one_vga)
+ pr_info("vgaarb: bridge control possible %s\n", pci_name(vgadev->pdev));
+ else
+diff --git a/drivers/i2c/busses/i2c-qup.c b/drivers/i2c/busses/i2c-qup.c
+index 2a5efb5b487c..eb47c98131ec 100644
+--- a/drivers/i2c/busses/i2c-qup.c
++++ b/drivers/i2c/busses/i2c-qup.c
+@@ -670,16 +670,20 @@ static int qup_i2c_probe(struct platform_device *pdev)
+ qup->adap.dev.of_node = pdev->dev.of_node;
+ strlcpy(qup->adap.name, "QUP I2C adapter", sizeof(qup->adap.name));
+
+- ret = i2c_add_adapter(&qup->adap);
+- if (ret)
+- goto fail;
+-
+ pm_runtime_set_autosuspend_delay(qup->dev, MSEC_PER_SEC);
+ pm_runtime_use_autosuspend(qup->dev);
+ pm_runtime_set_active(qup->dev);
+ pm_runtime_enable(qup->dev);
++
++ ret = i2c_add_adapter(&qup->adap);
++ if (ret)
++ goto fail_runtime;
++
+ return 0;
+
++fail_runtime:
++ pm_runtime_disable(qup->dev);
++ pm_runtime_set_suspended(qup->dev);
+ fail:
+ qup_i2c_disable_clocks(qup);
+ return ret;
+diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
+index 93cfc837200b..b38b0529946a 100644
+--- a/drivers/i2c/busses/i2c-rk3x.c
++++ b/drivers/i2c/busses/i2c-rk3x.c
+@@ -238,7 +238,7 @@ static void rk3x_i2c_fill_transmit_buf(struct rk3x_i2c *i2c)
+ for (i = 0; i < 8; ++i) {
+ val = 0;
+ for (j = 0; j < 4; ++j) {
+- if (i2c->processed == i2c->msg->len)
++ if ((i2c->processed == i2c->msg->len) && (cnt != 0))
+ break;
+
+ if (i2c->processed == 0 && cnt == 0)
+diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
+index 183588b11fc1..9f0fbecd1eb5 100644
+--- a/drivers/md/raid5.c
++++ b/drivers/md/raid5.c
+@@ -64,6 +64,10 @@
+ #define cpu_to_group(cpu) cpu_to_node(cpu)
+ #define ANY_GROUP NUMA_NO_NODE
+
++static bool devices_handle_discard_safely = false;
++module_param(devices_handle_discard_safely, bool, 0644);
++MODULE_PARM_DESC(devices_handle_discard_safely,
++ "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
+ static struct workqueue_struct *raid5_wq;
+ /*
+ * Stripe cache
+@@ -6208,7 +6212,7 @@ static int run(struct mddev *mddev)
+ mddev->queue->limits.discard_granularity = stripe;
+ /*
+ * unaligned part of discard request will be ignored, so can't
+- * guarantee discard_zerors_data
++ * guarantee discard_zeroes_data
+ */
+ mddev->queue->limits.discard_zeroes_data = 0;
+
+@@ -6233,6 +6237,18 @@ static int run(struct mddev *mddev)
+ !bdev_get_queue(rdev->bdev)->
+ limits.discard_zeroes_data)
+ discard_supported = false;
++ /* Unfortunately, discard_zeroes_data is not currently
++ * a guarantee - just a hint. So we only allow DISCARD
++ * if the sysadmin has confirmed that only safe devices
++ * are in use by setting a module parameter.
++ */
++ if (!devices_handle_discard_safely) {
++ if (discard_supported) {
++ pr_info("md/raid456: discard support disabled due to uncertainty.\n");
++ pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
++ }
++ discard_supported = false;
++ }
+ }
+
+ if (discard_supported &&
+diff --git a/drivers/media/v4l2-core/videobuf2-core.c b/drivers/media/v4l2-core/videobuf2-core.c
+index dcdceae30ab0..a946523772d6 100644
+--- a/drivers/media/v4l2-core/videobuf2-core.c
++++ b/drivers/media/v4l2-core/videobuf2-core.c
+@@ -967,6 +967,7 @@ static int __reqbufs(struct vb2_queue *q, struct v4l2_requestbuffers *req)
+ * to the userspace.
+ */
+ req->count = allocated_buffers;
++ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
+
+ return 0;
+ }
+@@ -1014,6 +1015,7 @@ static int __create_bufs(struct vb2_queue *q, struct v4l2_create_buffers *create
+ memset(q->plane_sizes, 0, sizeof(q->plane_sizes));
+ memset(q->alloc_ctx, 0, sizeof(q->alloc_ctx));
+ q->memory = create->memory;
++ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
+ }
+
+ num_buffers = min(create->count, VIDEO_MAX_FRAME - q->num_buffers);
+@@ -1812,6 +1814,7 @@ static int vb2_internal_qbuf(struct vb2_queue *q, struct v4l2_buffer *b)
+ */
+ list_add_tail(&vb->queued_entry, &q->queued_list);
+ q->queued_count++;
++ q->waiting_for_buffers = false;
+ vb->state = VB2_BUF_STATE_QUEUED;
+ if (V4L2_TYPE_IS_OUTPUT(q->type)) {
+ /*
+@@ -2244,6 +2247,7 @@ static int vb2_internal_streamoff(struct vb2_queue *q, enum v4l2_buf_type type)
+ * their normal dequeued state.
+ */
+ __vb2_queue_cancel(q);
++ q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
+
+ dprintk(3, "successful\n");
+ return 0;
+@@ -2562,9 +2566,16 @@ unsigned int vb2_poll(struct vb2_queue *q, struct file *file, poll_table *wait)
+ }
+
+ /*
+- * There is nothing to wait for if no buffers have already been queued.
++ * There is nothing to wait for if the queue isn't streaming.
+ */
+- if (list_empty(&q->queued_list))
++ if (!vb2_is_streaming(q))
++ return res | POLLERR;
++ /*
++ * For compatibility with vb1: if QBUF hasn't been called yet, then
++ * return POLLERR as well. This only affects capture queues, output
++ * queues will always initialize waiting_for_buffers to false.
++ */
++ if (q->waiting_for_buffers)
+ return res | POLLERR;
+
+ if (list_empty(&q->done_list))
+diff --git a/drivers/usb/storage/uas-detect.h b/drivers/usb/storage/uas-detect.h
+index bb05b984d5f6..8a6f371ed6e7 100644
+--- a/drivers/usb/storage/uas-detect.h
++++ b/drivers/usb/storage/uas-detect.h
+@@ -9,32 +9,15 @@ static int uas_is_interface(struct usb_host_interface *intf)
+ intf->desc.bInterfaceProtocol == USB_PR_UAS);
+ }
+
+-static int uas_isnt_supported(struct usb_device *udev)
+-{
+- struct usb_hcd *hcd = bus_to_hcd(udev->bus);
+-
+- dev_warn(&udev->dev, "The driver for the USB controller %s does not "
+- "support scatter-gather which is\n",
+- hcd->driver->description);
+- dev_warn(&udev->dev, "required by the UAS driver. Please try an"
+- "alternative USB controller if you wish to use UAS.\n");
+- return -ENODEV;
+-}
+-
+ static int uas_find_uas_alt_setting(struct usb_interface *intf)
+ {
+ int i;
+- struct usb_device *udev = interface_to_usbdev(intf);
+- int sg_supported = udev->bus->sg_tablesize != 0;
+
+ for (i = 0; i < intf->num_altsetting; i++) {
+ struct usb_host_interface *alt = &intf->altsetting[i];
+
+- if (uas_is_interface(alt)) {
+- if (!sg_supported)
+- return uas_isnt_supported(udev);
++ if (uas_is_interface(alt))
+ return alt->desc.bAlternateSetting;
+- }
+ }
+
+ return -ENODEV;
+@@ -76,13 +59,6 @@ static int uas_use_uas_driver(struct usb_interface *intf,
+ unsigned long flags = id->driver_info;
+ int r, alt;
+
+- usb_stor_adjust_quirks(udev, &flags);
+-
+- if (flags & US_FL_IGNORE_UAS)
+- return 0;
+-
+- if (udev->speed >= USB_SPEED_SUPER && !hcd->can_do_streams)
+- return 0;
+
+ alt = uas_find_uas_alt_setting(intf);
+ if (alt < 0)
+@@ -92,5 +68,46 @@ static int uas_use_uas_driver(struct usb_interface *intf,
+ if (r < 0)
+ return 0;
+
++ /*
++ * ASM1051 and older ASM1053 devices have the same usb-id, and UAS is
++ * broken on the ASM1051, use the number of streams to differentiate.
++ * New ASM1053-s also support 32 streams, but have a different prod-id.
++ */
++ if (le16_to_cpu(udev->descriptor.idVendor) == 0x174c &&
++ le16_to_cpu(udev->descriptor.idProduct) == 0x55aa) {
++ if (udev->speed < USB_SPEED_SUPER) {
++ /* No streams info, assume ASM1051 */
++ flags |= US_FL_IGNORE_UAS;
++ } else if (usb_ss_max_streams(&eps[1]->ss_ep_comp) == 32) {
++ flags |= US_FL_IGNORE_UAS;
++ }
++ }
++
++ usb_stor_adjust_quirks(udev, &flags);
++
++ if (flags & US_FL_IGNORE_UAS) {
++ dev_warn(&udev->dev,
++ "UAS is blacklisted for this device, using usb-storage instead\n");
++ return 0;
++ }
++
++ if (udev->bus->sg_tablesize == 0) {
++ dev_warn(&udev->dev,
++ "The driver for the USB controller %s does not support scatter-gather which is\n",
++ hcd->driver->description);
++ dev_warn(&udev->dev,
++ "required by the UAS driver. Please try an other USB controller if you wish to use UAS.\n");
++ return 0;
++ }
++
++ if (udev->speed >= USB_SPEED_SUPER && !hcd->can_do_streams) {
++ dev_warn(&udev->dev,
++ "USB controller %s does not support streams, which are required by the UAS driver.\n",
++ hcd_to_bus(hcd)->bus_name);
++ dev_warn(&udev->dev,
++ "Please try an other USB controller if you wish to use UAS.\n");
++ return 0;
++ }
++
+ return 1;
+ }
+diff --git a/drivers/video/fbdev/efifb.c b/drivers/video/fbdev/efifb.c
+index ae9618ff6735..982f6abe6faf 100644
+--- a/drivers/video/fbdev/efifb.c
++++ b/drivers/video/fbdev/efifb.c
+@@ -19,8 +19,6 @@
+
+ static bool request_mem_succeeded = false;
+
+-static struct pci_dev *default_vga;
+-
+ static struct fb_var_screeninfo efifb_defined = {
+ .activate = FB_ACTIVATE_NOW,
+ .height = -1,
+@@ -84,23 +82,10 @@ static struct fb_ops efifb_ops = {
+ .fb_imageblit = cfb_imageblit,
+ };
+
+-struct pci_dev *vga_default_device(void)
+-{
+- return default_vga;
+-}
+-
+-EXPORT_SYMBOL_GPL(vga_default_device);
+-
+-void vga_set_default_device(struct pci_dev *pdev)
+-{
+- default_vga = pdev;
+-}
+-
+ static int efifb_setup(char *options)
+ {
+ char *this_opt;
+ int i;
+- struct pci_dev *dev = NULL;
+
+ if (options && *options) {
+ while ((this_opt = strsep(&options, ",")) != NULL) {
+@@ -126,30 +111,6 @@ static int efifb_setup(char *options)
+ }
+ }
+
+- for_each_pci_dev(dev) {
+- int i;
+-
+- if ((dev->class >> 8) != PCI_CLASS_DISPLAY_VGA)
+- continue;
+-
+- for (i=0; i < DEVICE_COUNT_RESOURCE; i++) {
+- resource_size_t start, end;
+-
+- if (!(pci_resource_flags(dev, i) & IORESOURCE_MEM))
+- continue;
+-
+- start = pci_resource_start(dev, i);
+- end = pci_resource_end(dev, i);
+-
+- if (!start || !end)
+- continue;
+-
+- if (screen_info.lfb_base >= start &&
+- (screen_info.lfb_base + screen_info.lfb_size) < end)
+- default_vga = dev;
+- }
+- }
+-
+ return 0;
+ }
+
+diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c
+index 84ca0a4caaeb..e9ad8d37bb00 100644
+--- a/fs/cifs/smb1ops.c
++++ b/fs/cifs/smb1ops.c
+@@ -586,7 +586,7 @@ cifs_query_path_info(const unsigned int xid, struct cifs_tcon *tcon,
+ tmprc = CIFS_open(xid, &oparms, &oplock, NULL);
+ if (tmprc == -EOPNOTSUPP)
+ *symlink = true;
+- else
++ else if (tmprc == 0)
+ CIFSSMBClose(xid, tcon, fid.netfid);
+ }
+
+diff --git a/fs/cifs/smb2maperror.c b/fs/cifs/smb2maperror.c
+index a689514e260f..a491814cb2c0 100644
+--- a/fs/cifs/smb2maperror.c
++++ b/fs/cifs/smb2maperror.c
+@@ -256,6 +256,8 @@ static const struct status_to_posix_error smb2_error_map_table[] = {
+ {STATUS_DLL_MIGHT_BE_INCOMPATIBLE, -EIO,
+ "STATUS_DLL_MIGHT_BE_INCOMPATIBLE"},
+ {STATUS_STOPPED_ON_SYMLINK, -EOPNOTSUPP, "STATUS_STOPPED_ON_SYMLINK"},
++ {STATUS_IO_REPARSE_TAG_NOT_HANDLED, -EOPNOTSUPP,
++ "STATUS_REPARSE_NOT_HANDLED"},
+ {STATUS_DEVICE_REQUIRES_CLEANING, -EIO,
+ "STATUS_DEVICE_REQUIRES_CLEANING"},
+ {STATUS_DEVICE_DOOR_OPEN, -EIO, "STATUS_DEVICE_DOOR_OPEN"},
+diff --git a/fs/udf/inode.c b/fs/udf/inode.c
+index 236cd48184c2..a932f7740b51 100644
+--- a/fs/udf/inode.c
++++ b/fs/udf/inode.c
+@@ -1271,13 +1271,22 @@ update_time:
+ return 0;
+ }
+
++/*
++ * Maximum length of linked list formed by ICB hierarchy. The chosen number is
++ * arbitrary - just that we hopefully don't limit any real use of rewritten
++ * inode on write-once media but avoid looping for too long on corrupted media.
++ */
++#define UDF_MAX_ICB_NESTING 1024
++
+ static void __udf_read_inode(struct inode *inode)
+ {
+ struct buffer_head *bh = NULL;
+ struct fileEntry *fe;
+ uint16_t ident;
+ struct udf_inode_info *iinfo = UDF_I(inode);
++ unsigned int indirections = 0;
+
++reread:
+ /*
+ * Set defaults, but the inode is still incomplete!
+ * Note: get_new_inode() sets the following on a new inode:
+@@ -1314,28 +1323,26 @@ static void __udf_read_inode(struct inode *inode)
+ ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1,
+ &ident);
+ if (ident == TAG_IDENT_IE && ibh) {
+- struct buffer_head *nbh = NULL;
+ struct kernel_lb_addr loc;
+ struct indirectEntry *ie;
+
+ ie = (struct indirectEntry *)ibh->b_data;
+ loc = lelb_to_cpu(ie->indirectICB.extLocation);
+
+- if (ie->indirectICB.extLength &&
+- (nbh = udf_read_ptagged(inode->i_sb, &loc, 0,
+- &ident))) {
+- if (ident == TAG_IDENT_FE ||
+- ident == TAG_IDENT_EFE) {
+- memcpy(&iinfo->i_location,
+- &loc,
+- sizeof(struct kernel_lb_addr));
+- brelse(bh);
+- brelse(ibh);
+- brelse(nbh);
+- __udf_read_inode(inode);
++ if (ie->indirectICB.extLength) {
++ brelse(bh);
++ brelse(ibh);
++ memcpy(&iinfo->i_location, &loc,
++ sizeof(struct kernel_lb_addr));
++ if (++indirections > UDF_MAX_ICB_NESTING) {
++ udf_err(inode->i_sb,
++ "too many ICBs in ICB hierarchy"
++ " (max %d supported)\n",
++ UDF_MAX_ICB_NESTING);
++ make_bad_inode(inode);
+ return;
+ }
+- brelse(nbh);
++ goto reread;
+ }
+ }
+ brelse(ibh);
+diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
+index 1f44466c1e9d..c367cbdf73ab 100644
+--- a/include/linux/jiffies.h
++++ b/include/linux/jiffies.h
+@@ -258,23 +258,11 @@ extern unsigned long preset_lpj;
+ #define SEC_JIFFIE_SC (32 - SHIFT_HZ)
+ #endif
+ #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
+-#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
+ #define SEC_CONVERSION ((unsigned long)((((u64)NSEC_PER_SEC << SEC_JIFFIE_SC) +\
+ TICK_NSEC -1) / (u64)TICK_NSEC))
+
+ #define NSEC_CONVERSION ((unsigned long)((((u64)1 << NSEC_JIFFIE_SC) +\
+ TICK_NSEC -1) / (u64)TICK_NSEC))
+-#define USEC_CONVERSION \
+- ((unsigned long)((((u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\
+- TICK_NSEC -1) / (u64)TICK_NSEC))
+-/*
+- * USEC_ROUND is used in the timeval to jiffie conversion. See there
+- * for more details. It is the scaled resolution rounding value. Note
+- * that it is a 64-bit value. Since, when it is applied, we are already
+- * in jiffies (albit scaled), it is nothing but the bits we will shift
+- * off.
+- */
+-#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1)
+ /*
+ * The maximum jiffie value is (MAX_INT >> 1). Here we translate that
+ * into seconds. The 64-bit case will overflow if we are not careful,
+diff --git a/include/media/videobuf2-core.h b/include/media/videobuf2-core.h
+index 8fab6fa0dbfb..d6f010c17f4a 100644
+--- a/include/media/videobuf2-core.h
++++ b/include/media/videobuf2-core.h
+@@ -375,6 +375,9 @@ struct v4l2_fh;
+ * @streaming: current streaming state
+ * @start_streaming_called: start_streaming() was called successfully and we
+ * started streaming.
++ * @waiting_for_buffers: used in poll() to check if vb2 is still waiting for
++ * buffers. Only set for capture queues if qbuf has not yet been
++ * called since poll() needs to return POLLERR in that situation.
+ * @fileio: file io emulator internal data, used only if emulator is active
+ * @threadio: thread io internal data, used only if thread is active
+ */
+@@ -411,6 +414,7 @@ struct vb2_queue {
+
+ unsigned int streaming:1;
+ unsigned int start_streaming_called:1;
++ unsigned int waiting_for_buffers:1;
+
+ struct vb2_fileio_data *fileio;
+ struct vb2_threadio_data *threadio;
+diff --git a/init/Kconfig b/init/Kconfig
+index 9d76b99af1b9..35685a46e4da 100644
+--- a/init/Kconfig
++++ b/init/Kconfig
+@@ -1432,6 +1432,7 @@ config FUTEX
+
+ config HAVE_FUTEX_CMPXCHG
+ bool
++ depends on FUTEX
+ help
+ Architectures should select this if futex_atomic_cmpxchg_inatomic()
+ is implemented and always working. This removes a couple of runtime
+diff --git a/kernel/events/core.c b/kernel/events/core.c
+index f626c9f1f3c0..2065959042ea 100644
+--- a/kernel/events/core.c
++++ b/kernel/events/core.c
+@@ -7921,8 +7921,10 @@ int perf_event_init_task(struct task_struct *child)
+
+ for_each_task_context_nr(ctxn) {
+ ret = perf_event_init_context(child, ctxn);
+- if (ret)
++ if (ret) {
++ perf_event_free_task(child);
+ return ret;
++ }
+ }
+
+ return 0;
+diff --git a/kernel/fork.c b/kernel/fork.c
+index 6a13c46cd87d..b41958b0cb67 100644
+--- a/kernel/fork.c
++++ b/kernel/fork.c
+@@ -1326,7 +1326,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
+ goto bad_fork_cleanup_policy;
+ retval = audit_alloc(p);
+ if (retval)
+- goto bad_fork_cleanup_policy;
++ goto bad_fork_cleanup_perf;
+ /* copy all the process information */
+ retval = copy_semundo(clone_flags, p);
+ if (retval)
+@@ -1525,8 +1525,9 @@ bad_fork_cleanup_semundo:
+ exit_sem(p);
+ bad_fork_cleanup_audit:
+ audit_free(p);
+-bad_fork_cleanup_policy:
++bad_fork_cleanup_perf:
+ perf_event_free_task(p);
++bad_fork_cleanup_policy:
+ #ifdef CONFIG_NUMA
+ mpol_put(p->mempolicy);
+ bad_fork_cleanup_threadgroup_lock:
+diff --git a/kernel/time.c b/kernel/time.c
+index 7c7964c33ae7..3c49ab45f822 100644
+--- a/kernel/time.c
++++ b/kernel/time.c
+@@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
+ * that a remainder subtract here would not do the right thing as the
+ * resolution values don't fall on second boundries. I.e. the line:
+ * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
++ * Note that due to the small error in the multiplier here, this
++ * rounding is incorrect for sufficiently large values of tv_nsec, but
++ * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're
++ * OK.
+ *
+ * Rather, we just shift the bits off the right.
+ *
+ * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
+ * value to a scaled second value.
+ */
+-unsigned long
+-timespec_to_jiffies(const struct timespec *value)
++static unsigned long
++__timespec_to_jiffies(unsigned long sec, long nsec)
+ {
+- unsigned long sec = value->tv_sec;
+- long nsec = value->tv_nsec + TICK_NSEC - 1;
++ nsec = nsec + TICK_NSEC - 1;
+
+ if (sec >= MAX_SEC_IN_JIFFIES){
+ sec = MAX_SEC_IN_JIFFIES;
+@@ -517,6 +520,13 @@ timespec_to_jiffies(const struct timespec *value)
+ (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
+
+ }
++
++unsigned long
++timespec_to_jiffies(const struct timespec *value)
++{
++ return __timespec_to_jiffies(value->tv_sec, value->tv_nsec);
++}
++
+ EXPORT_SYMBOL(timespec_to_jiffies);
+
+ void
+@@ -533,31 +543,27 @@ jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
+ }
+ EXPORT_SYMBOL(jiffies_to_timespec);
+
+-/* Same for "timeval"
++/*
++ * We could use a similar algorithm to timespec_to_jiffies (with a
++ * different multiplier for usec instead of nsec). But this has a
++ * problem with rounding: we can't exactly add TICK_NSEC - 1 to the
++ * usec value, since it's not necessarily integral.
+ *
+- * Well, almost. The problem here is that the real system resolution is
+- * in nanoseconds and the value being converted is in micro seconds.
+- * Also for some machines (those that use HZ = 1024, in-particular),
+- * there is a LARGE error in the tick size in microseconds.
+-
+- * The solution we use is to do the rounding AFTER we convert the
+- * microsecond part. Thus the USEC_ROUND, the bits to be shifted off.
+- * Instruction wise, this should cost only an additional add with carry
+- * instruction above the way it was done above.
++ * We could instead round in the intermediate scaled representation
++ * (i.e. in units of 1/2^(large scale) jiffies) but that's also
++ * perilous: the scaling introduces a small positive error, which
++ * combined with a division-rounding-upward (i.e. adding 2^(scale) - 1
++ * units to the intermediate before shifting) leads to accidental
++ * overflow and overestimates.
++ *
++ * At the cost of one additional multiplication by a constant, just
++ * use the timespec implementation.
+ */
+ unsigned long
+ timeval_to_jiffies(const struct timeval *value)
+ {
+- unsigned long sec = value->tv_sec;
+- long usec = value->tv_usec;
+-
+- if (sec >= MAX_SEC_IN_JIFFIES){
+- sec = MAX_SEC_IN_JIFFIES;
+- usec = 0;
+- }
+- return (((u64)sec * SEC_CONVERSION) +
+- (((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
+- (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
++ return __timespec_to_jiffies(value->tv_sec,
++ value->tv_usec * NSEC_PER_USEC);
+ }
+ EXPORT_SYMBOL(timeval_to_jiffies);
+
+diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
+index 2ff0580d3dcd..51862982e1e9 100644
+--- a/kernel/trace/ring_buffer.c
++++ b/kernel/trace/ring_buffer.c
+@@ -3375,7 +3375,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
+ iter->head = cpu_buffer->reader_page->read;
+
+ iter->cache_reader_page = iter->head_page;
+- iter->cache_read = iter->head;
++ iter->cache_read = cpu_buffer->read;
+
+ if (iter->head)
+ iter->read_stamp = cpu_buffer->read_stamp;
+diff --git a/mm/huge_memory.c b/mm/huge_memory.c
+index 33514d88fef9..c9ef81e08e4a 100644
+--- a/mm/huge_memory.c
++++ b/mm/huge_memory.c
+@@ -1775,21 +1775,24 @@ static int __split_huge_page_map(struct page *page,
+ if (pmd) {
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pmd_populate(mm, &_pmd, pgtable);
++ if (pmd_write(*pmd))
++ BUG_ON(page_mapcount(page) != 1);
+
+ haddr = address;
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ BUG_ON(PageCompound(page+i));
++ /*
++ * Note that pmd_numa is not transferred deliberately
++ * to avoid any possibility that pte_numa leaks to
++ * a PROT_NONE VMA by accident.
++ */
+ entry = mk_pte(page + i, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (!pmd_write(*pmd))
+ entry = pte_wrprotect(entry);
+- else
+- BUG_ON(page_mapcount(page) != 1);
+ if (!pmd_young(*pmd))
+ entry = pte_mkold(entry);
+- if (pmd_numa(*pmd))
+- entry = pte_mknuma(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+diff --git a/mm/memcontrol.c b/mm/memcontrol.c
+index 1f14a430c656..15fe66d83987 100644
+--- a/mm/memcontrol.c
++++ b/mm/memcontrol.c
+@@ -292,6 +292,9 @@ struct mem_cgroup {
+ /* vmpressure notifications */
+ struct vmpressure vmpressure;
+
++ /* css_online() has been completed */
++ int initialized;
++
+ /*
+ * the counter to account for mem+swap usage.
+ */
+@@ -1106,10 +1109,21 @@ skip_node:
+ * skipping css reference should be safe.
+ */
+ if (next_css) {
+- if ((next_css == &root->css) ||
+- ((next_css->flags & CSS_ONLINE) &&
+- css_tryget_online(next_css)))
+- return mem_cgroup_from_css(next_css);
++ struct mem_cgroup *memcg = mem_cgroup_from_css(next_css);
++
++ if (next_css == &root->css)
++ return memcg;
++
++ if (css_tryget_online(next_css)) {
++ /*
++ * Make sure the memcg is initialized:
++ * mem_cgroup_css_online() orders the the
++ * initialization against setting the flag.
++ */
++ if (smp_load_acquire(&memcg->initialized))
++ return memcg;
++ css_put(next_css);
++ }
+
+ prev_css = next_css;
+ goto skip_node;
+@@ -6277,6 +6291,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
+ {
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct mem_cgroup *parent = mem_cgroup_from_css(css->parent);
++ int ret;
+
+ if (css->id > MEM_CGROUP_ID_MAX)
+ return -ENOSPC;
+@@ -6313,7 +6328,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
+ }
+ mutex_unlock(&memcg_create_mutex);
+
+- return memcg_init_kmem(memcg, &memory_cgrp_subsys);
++ ret = memcg_init_kmem(memcg, &memory_cgrp_subsys);
++ if (ret)
++ return ret;
++
++ /*
++ * Make sure the memcg is initialized: mem_cgroup_iter()
++ * orders reading memcg->initialized against its callers
++ * reading the memcg members.
++ */
++ smp_store_release(&memcg->initialized, 1);
++
++ return 0;
+ }
+
+ /*
+diff --git a/mm/migrate.c b/mm/migrate.c
+index be6dbf995c0c..0bba97914af0 100644
+--- a/mm/migrate.c
++++ b/mm/migrate.c
+@@ -146,8 +146,11 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+ pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
+ if (pte_swp_soft_dirty(*ptep))
+ pte = pte_mksoft_dirty(pte);
++
++ /* Recheck VMA as permissions can change since migration started */
+ if (is_write_migration_entry(entry))
+- pte = pte_mkwrite(pte);
++ pte = maybe_mkwrite(pte, vma);
++
+ #ifdef CONFIG_HUGETLB_PAGE
+ if (PageHuge(new)) {
+ pte = pte_mkhuge(pte);
+diff --git a/sound/soc/codecs/ssm2602.c b/sound/soc/codecs/ssm2602.c
+index 97b0454eb346..eb1bb7414b8b 100644
+--- a/sound/soc/codecs/ssm2602.c
++++ b/sound/soc/codecs/ssm2602.c
+@@ -647,7 +647,7 @@ int ssm2602_probe(struct device *dev, enum ssm2602_type type,
+ return -ENOMEM;
+
+ dev_set_drvdata(dev, ssm2602);
+- ssm2602->type = SSM2602;
++ ssm2602->type = type;
+ ssm2602->regmap = regmap;
+
+ return snd_soc_register_codec(dev, &soc_codec_dev_ssm2602,
+diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
+index b87d7d882e6d..49acc989e452 100644
+--- a/sound/soc/soc-core.c
++++ b/sound/soc/soc-core.c
+@@ -3181,7 +3181,7 @@ int snd_soc_bytes_put(struct snd_kcontrol *kcontrol,
+ unsigned int val, mask;
+ void *data;
+
+- if (!component->regmap)
++ if (!component->regmap || !params->num_regs)
+ return -EINVAL;
+
+ len = params->num_regs * component->val_bytes;
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-15 12:42 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-15 12:42 UTC (permalink / raw
To: gentoo-commits
commit: 10da52d34f75c039a20e3e60cb9dc3e05bc1cbb7
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Wed Oct 15 12:42:37 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Wed Oct 15 12:42:37 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=10da52d3
Linux patch 3.16.6
---
0000_README | 4 +
1005_linux-3.16.6.patch | 2652 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 2656 insertions(+)
diff --git a/0000_README b/0000_README
index ede03f9..a7526a7 100644
--- a/0000_README
+++ b/0000_README
@@ -62,6 +62,10 @@ Patch: 1004_linux-3.16.5.patch
From: http://www.kernel.org
Desc: Linux 3.16.5
+Patch: 1005_linux-3.16.6.patch
+From: http://www.kernel.org
+Desc: Linux 3.16.6
+
Patch: 1500_XATTR_USER_PREFIX.patch
From: https://bugs.gentoo.org/show_bug.cgi?id=470644
Desc: Support for namespace user.pax.* on tmpfs.
diff --git a/1005_linux-3.16.6.patch b/1005_linux-3.16.6.patch
new file mode 100644
index 0000000..422fde0
--- /dev/null
+++ b/1005_linux-3.16.6.patch
@@ -0,0 +1,2652 @@
+diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
+index f896f68a3ba3..c4da64b525b2 100644
+--- a/Documentation/kernel-parameters.txt
++++ b/Documentation/kernel-parameters.txt
+@@ -3459,6 +3459,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
+ READ_DISC_INFO command);
+ e = NO_READ_CAPACITY_16 (don't use
+ READ_CAPACITY_16 command);
++ f = NO_REPORT_OPCODES (don't use report opcodes
++ command, uas only);
+ h = CAPACITY_HEURISTICS (decrease the
+ reported device capacity by one
+ sector if the number is odd);
+@@ -3478,6 +3480,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
+ bogus residue values);
+ s = SINGLE_LUN (the device has only one
+ Logical Unit);
++ t = NO_ATA_1X (don't allow ATA(12) and ATA(16)
++ commands, uas only);
+ u = IGNORE_UAS (don't bind to the uas driver);
+ w = NO_WP_DETECT (don't test whether the
+ medium is write-protected).
+diff --git a/Makefile b/Makefile
+index 41efc3d9f2e0..5c4bc3fc18c0 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 5
++SUBLEVEL = 6
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+
+diff --git a/drivers/base/node.c b/drivers/base/node.c
+index 8f7ed9933a7c..40e4585f110a 100644
+--- a/drivers/base/node.c
++++ b/drivers/base/node.c
+@@ -603,7 +603,6 @@ void unregister_one_node(int nid)
+ return;
+
+ unregister_node(node_devices[nid]);
+- kfree(node_devices[nid]);
+ node_devices[nid] = NULL;
+ }
+
+diff --git a/drivers/crypto/caam/caamhash.c b/drivers/crypto/caam/caamhash.c
+index 0d9284ef96a8..42e41f3b5cf1 100644
+--- a/drivers/crypto/caam/caamhash.c
++++ b/drivers/crypto/caam/caamhash.c
+@@ -1338,9 +1338,9 @@ static int ahash_update_first(struct ahash_request *req)
+ struct device *jrdev = ctx->jrdev;
+ gfp_t flags = (req->base.flags & (CRYPTO_TFM_REQ_MAY_BACKLOG |
+ CRYPTO_TFM_REQ_MAY_SLEEP)) ? GFP_KERNEL : GFP_ATOMIC;
+- u8 *next_buf = state->buf_0 + state->current_buf *
+- CAAM_MAX_HASH_BLOCK_SIZE;
+- int *next_buflen = &state->buflen_0 + state->current_buf;
++ u8 *next_buf = state->current_buf ? state->buf_1 : state->buf_0;
++ int *next_buflen = state->current_buf ?
++ &state->buflen_1 : &state->buflen_0;
+ int to_hash;
+ u32 *sh_desc = ctx->sh_desc_update_first, *desc;
+ dma_addr_t ptr = ctx->sh_desc_update_first_dma;
+diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
+index 701f86cd5993..5f29c9a9a316 100644
+--- a/drivers/net/bonding/bond_main.c
++++ b/drivers/net/bonding/bond_main.c
+@@ -3667,8 +3667,14 @@ static int bond_xmit_roundrobin(struct sk_buff *skb, struct net_device *bond_dev
+ else
+ bond_xmit_slave_id(bond, skb, 0);
+ } else {
+- slave_id = bond_rr_gen_slave_id(bond);
+- bond_xmit_slave_id(bond, skb, slave_id % bond->slave_cnt);
++ int slave_cnt = ACCESS_ONCE(bond->slave_cnt);
++
++ if (likely(slave_cnt)) {
++ slave_id = bond_rr_gen_slave_id(bond);
++ bond_xmit_slave_id(bond, skb, slave_id % slave_cnt);
++ } else {
++ dev_kfree_skb_any(skb);
++ }
+ }
+
+ return NETDEV_TX_OK;
+@@ -3699,8 +3705,13 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
+ static int bond_xmit_xor(struct sk_buff *skb, struct net_device *bond_dev)
+ {
+ struct bonding *bond = netdev_priv(bond_dev);
++ int slave_cnt = ACCESS_ONCE(bond->slave_cnt);
+
+- bond_xmit_slave_id(bond, skb, bond_xmit_hash(bond, skb) % bond->slave_cnt);
++ if (likely(slave_cnt))
++ bond_xmit_slave_id(bond, skb,
++ bond_xmit_hash(bond, skb) % slave_cnt);
++ else
++ dev_kfree_skb_any(skb);
+
+ return NETDEV_TX_OK;
+ }
+diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
+index 5776e503e4c5..6e4a6bddf56e 100644
+--- a/drivers/net/ethernet/broadcom/bcmsysport.c
++++ b/drivers/net/ethernet/broadcom/bcmsysport.c
+@@ -757,7 +757,8 @@ static irqreturn_t bcm_sysport_tx_isr(int irq, void *dev_id)
+ return IRQ_HANDLED;
+ }
+
+-static int bcm_sysport_insert_tsb(struct sk_buff *skb, struct net_device *dev)
++static struct sk_buff *bcm_sysport_insert_tsb(struct sk_buff *skb,
++ struct net_device *dev)
+ {
+ struct sk_buff *nskb;
+ struct bcm_tsb *tsb;
+@@ -773,7 +774,7 @@ static int bcm_sysport_insert_tsb(struct sk_buff *skb, struct net_device *dev)
+ if (!nskb) {
+ dev->stats.tx_errors++;
+ dev->stats.tx_dropped++;
+- return -ENOMEM;
++ return NULL;
+ }
+ skb = nskb;
+ }
+@@ -792,7 +793,7 @@ static int bcm_sysport_insert_tsb(struct sk_buff *skb, struct net_device *dev)
+ ip_proto = ipv6_hdr(skb)->nexthdr;
+ break;
+ default:
+- return 0;
++ return skb;
+ }
+
+ /* Get the checksum offset and the L4 (transport) offset */
+@@ -810,7 +811,7 @@ static int bcm_sysport_insert_tsb(struct sk_buff *skb, struct net_device *dev)
+ tsb->l4_ptr_dest_map = csum_info;
+ }
+
+- return 0;
++ return skb;
+ }
+
+ static netdev_tx_t bcm_sysport_xmit(struct sk_buff *skb,
+@@ -844,8 +845,8 @@ static netdev_tx_t bcm_sysport_xmit(struct sk_buff *skb,
+
+ /* Insert TSB and checksum infos */
+ if (priv->tsb_en) {
+- ret = bcm_sysport_insert_tsb(skb, dev);
+- if (ret) {
++ skb = bcm_sysport_insert_tsb(skb, dev);
++ if (!skb) {
+ ret = NETDEV_TX_OK;
+ goto out;
+ }
+diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+index 6a8b1453a1b9..73cfb21899a7 100644
+--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
++++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+@@ -10044,6 +10044,8 @@ static void bnx2x_prev_unload_close_mac(struct bnx2x *bp,
+ }
+
+ #define BNX2X_PREV_UNDI_PROD_ADDR(p) (BAR_TSTRORM_INTMEM + 0x1508 + ((p) << 4))
++#define BNX2X_PREV_UNDI_PROD_ADDR_H(f) (BAR_TSTRORM_INTMEM + \
++ 0x1848 + ((f) << 4))
+ #define BNX2X_PREV_UNDI_RCQ(val) ((val) & 0xffff)
+ #define BNX2X_PREV_UNDI_BD(val) ((val) >> 16 & 0xffff)
+ #define BNX2X_PREV_UNDI_PROD(rcq, bd) ((bd) << 16 | (rcq))
+@@ -10051,8 +10053,6 @@ static void bnx2x_prev_unload_close_mac(struct bnx2x *bp,
+ #define BCM_5710_UNDI_FW_MF_MAJOR (0x07)
+ #define BCM_5710_UNDI_FW_MF_MINOR (0x08)
+ #define BCM_5710_UNDI_FW_MF_VERS (0x05)
+-#define BNX2X_PREV_UNDI_MF_PORT(p) (BAR_TSTRORM_INTMEM + 0x150c + ((p) << 4))
+-#define BNX2X_PREV_UNDI_MF_FUNC(f) (BAR_TSTRORM_INTMEM + 0x184c + ((f) << 4))
+
+ static bool bnx2x_prev_is_after_undi(struct bnx2x *bp)
+ {
+@@ -10071,72 +10071,25 @@ static bool bnx2x_prev_is_after_undi(struct bnx2x *bp)
+ return false;
+ }
+
+-static bool bnx2x_prev_unload_undi_fw_supports_mf(struct bnx2x *bp)
+-{
+- u8 major, minor, version;
+- u32 fw;
+-
+- /* Must check that FW is loaded */
+- if (!(REG_RD(bp, MISC_REG_RESET_REG_1) &
+- MISC_REGISTERS_RESET_REG_1_RST_XSEM)) {
+- BNX2X_DEV_INFO("XSEM is reset - UNDI MF FW is not loaded\n");
+- return false;
+- }
+-
+- /* Read Currently loaded FW version */
+- fw = REG_RD(bp, XSEM_REG_PRAM);
+- major = fw & 0xff;
+- minor = (fw >> 0x8) & 0xff;
+- version = (fw >> 0x10) & 0xff;
+- BNX2X_DEV_INFO("Loaded FW: 0x%08x: Major 0x%02x Minor 0x%02x Version 0x%02x\n",
+- fw, major, minor, version);
+-
+- if (major > BCM_5710_UNDI_FW_MF_MAJOR)
+- return true;
+-
+- if ((major == BCM_5710_UNDI_FW_MF_MAJOR) &&
+- (minor > BCM_5710_UNDI_FW_MF_MINOR))
+- return true;
+-
+- if ((major == BCM_5710_UNDI_FW_MF_MAJOR) &&
+- (minor == BCM_5710_UNDI_FW_MF_MINOR) &&
+- (version >= BCM_5710_UNDI_FW_MF_VERS))
+- return true;
+-
+- return false;
+-}
+-
+-static void bnx2x_prev_unload_undi_mf(struct bnx2x *bp)
+-{
+- int i;
+-
+- /* Due to legacy (FW) code, the first function on each engine has a
+- * different offset macro from the rest of the functions.
+- * Setting this for all 8 functions is harmless regardless of whether
+- * this is actually a multi-function device.
+- */
+- for (i = 0; i < 2; i++)
+- REG_WR(bp, BNX2X_PREV_UNDI_MF_PORT(i), 1);
+-
+- for (i = 2; i < 8; i++)
+- REG_WR(bp, BNX2X_PREV_UNDI_MF_FUNC(i - 2), 1);
+-
+- BNX2X_DEV_INFO("UNDI FW (MF) set to discard\n");
+-}
+-
+-static void bnx2x_prev_unload_undi_inc(struct bnx2x *bp, u8 port, u8 inc)
++static void bnx2x_prev_unload_undi_inc(struct bnx2x *bp, u8 inc)
+ {
+ u16 rcq, bd;
+- u32 tmp_reg = REG_RD(bp, BNX2X_PREV_UNDI_PROD_ADDR(port));
++ u32 addr, tmp_reg;
+
++ if (BP_FUNC(bp) < 2)
++ addr = BNX2X_PREV_UNDI_PROD_ADDR(BP_PORT(bp));
++ else
++ addr = BNX2X_PREV_UNDI_PROD_ADDR_H(BP_FUNC(bp) - 2);
++
++ tmp_reg = REG_RD(bp, addr);
+ rcq = BNX2X_PREV_UNDI_RCQ(tmp_reg) + inc;
+ bd = BNX2X_PREV_UNDI_BD(tmp_reg) + inc;
+
+ tmp_reg = BNX2X_PREV_UNDI_PROD(rcq, bd);
+- REG_WR(bp, BNX2X_PREV_UNDI_PROD_ADDR(port), tmp_reg);
++ REG_WR(bp, addr, tmp_reg);
+
+- BNX2X_DEV_INFO("UNDI producer [%d] rings bd -> 0x%04x, rcq -> 0x%04x\n",
+- port, bd, rcq);
++ BNX2X_DEV_INFO("UNDI producer [%d/%d][%08x] rings bd -> 0x%04x, rcq -> 0x%04x\n",
++ BP_PORT(bp), BP_FUNC(bp), addr, bd, rcq);
+ }
+
+ static int bnx2x_prev_mcp_done(struct bnx2x *bp)
+@@ -10375,7 +10328,6 @@ static int bnx2x_prev_unload_common(struct bnx2x *bp)
+ /* Reset should be performed after BRB is emptied */
+ if (reset_reg & MISC_REGISTERS_RESET_REG_1_RST_BRB1) {
+ u32 timer_count = 1000;
+- bool need_write = true;
+
+ /* Close the MAC Rx to prevent BRB from filling up */
+ bnx2x_prev_unload_close_mac(bp, &mac_vals);
+@@ -10412,20 +10364,10 @@ static int bnx2x_prev_unload_common(struct bnx2x *bp)
+ else
+ timer_count--;
+
+- /* New UNDI FW supports MF and contains better
+- * cleaning methods - might be redundant but harmless.
+- */
+- if (bnx2x_prev_unload_undi_fw_supports_mf(bp)) {
+- if (need_write) {
+- bnx2x_prev_unload_undi_mf(bp);
+- need_write = false;
+- }
+- } else if (prev_undi) {
+- /* If UNDI resides in memory,
+- * manually increment it
+- */
+- bnx2x_prev_unload_undi_inc(bp, BP_PORT(bp), 1);
+- }
++ /* If UNDI resides in memory, manually increment it */
++ if (prev_undi)
++ bnx2x_prev_unload_undi_inc(bp, 1);
++
+ udelay(10);
+ }
+
+diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
+index a3dd5dc64f4c..8345c6523799 100644
+--- a/drivers/net/ethernet/broadcom/tg3.c
++++ b/drivers/net/ethernet/broadcom/tg3.c
+@@ -6918,7 +6918,8 @@ static int tg3_rx(struct tg3_napi *tnapi, int budget)
+ skb->protocol = eth_type_trans(skb, tp->dev);
+
+ if (len > (tp->dev->mtu + ETH_HLEN) &&
+- skb->protocol != htons(ETH_P_8021Q)) {
++ skb->protocol != htons(ETH_P_8021Q) &&
++ skb->protocol != htons(ETH_P_8021AD)) {
+ dev_kfree_skb_any(skb);
+ goto drop_it_no_recycle;
+ }
+@@ -7914,8 +7915,6 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+
+ entry = tnapi->tx_prod;
+ base_flags = 0;
+- if (skb->ip_summed == CHECKSUM_PARTIAL)
+- base_flags |= TXD_FLAG_TCPUDP_CSUM;
+
+ mss = skb_shinfo(skb)->gso_size;
+ if (mss) {
+@@ -7929,6 +7928,13 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+
+ hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb) - ETH_HLEN;
+
++ /* HW/FW can not correctly segment packets that have been
++ * vlan encapsulated.
++ */
++ if (skb->protocol == htons(ETH_P_8021Q) ||
++ skb->protocol == htons(ETH_P_8021AD))
++ return tg3_tso_bug(tp, tnapi, txq, skb);
++
+ if (!skb_is_gso_v6(skb)) {
+ if (unlikely((ETH_HLEN + hdr_len) > 80) &&
+ tg3_flag(tp, TSO_BUG))
+@@ -7979,6 +7985,17 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ base_flags |= tsflags << 12;
+ }
+ }
++ } else if (skb->ip_summed == CHECKSUM_PARTIAL) {
++ /* HW/FW can not correctly checksum packets that have been
++ * vlan encapsulated.
++ */
++ if (skb->protocol == htons(ETH_P_8021Q) ||
++ skb->protocol == htons(ETH_P_8021AD)) {
++ if (skb_checksum_help(skb))
++ goto drop;
++ } else {
++ base_flags |= TXD_FLAG_TCPUDP_CSUM;
++ }
+ }
+
+ if (tg3_flag(tp, USE_JUMBO_BDFLAG) &&
+diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
+index e9daa072ebb4..45b13fda6bed 100644
+--- a/drivers/net/ethernet/cadence/macb.c
++++ b/drivers/net/ethernet/cadence/macb.c
+@@ -30,7 +30,6 @@
+ #include <linux/of_device.h>
+ #include <linux/of_mdio.h>
+ #include <linux/of_net.h>
+-#include <linux/pinctrl/consumer.h>
+
+ #include "macb.h"
+
+@@ -1803,7 +1802,6 @@ static int __init macb_probe(struct platform_device *pdev)
+ struct phy_device *phydev;
+ u32 config;
+ int err = -ENXIO;
+- struct pinctrl *pinctrl;
+ const char *mac;
+
+ regs = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+@@ -1812,15 +1810,6 @@ static int __init macb_probe(struct platform_device *pdev)
+ goto err_out;
+ }
+
+- pinctrl = devm_pinctrl_get_select_default(&pdev->dev);
+- if (IS_ERR(pinctrl)) {
+- err = PTR_ERR(pinctrl);
+- if (err == -EPROBE_DEFER)
+- goto err_out;
+-
+- dev_warn(&pdev->dev, "No pinctrl provided\n");
+- }
+-
+ err = -ENOMEM;
+ dev = alloc_etherdev(sizeof(*bp));
+ if (!dev)
+diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
+index 5d940a26055c..c9d2988e364d 100644
+--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
++++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
+@@ -2380,6 +2380,22 @@ struct mlx4_slaves_pport mlx4_phys_to_slaves_pport_actv(
+ }
+ EXPORT_SYMBOL_GPL(mlx4_phys_to_slaves_pport_actv);
+
++static int mlx4_slaves_closest_port(struct mlx4_dev *dev, int slave, int port)
++{
++ struct mlx4_active_ports actv_ports = mlx4_get_active_ports(dev, slave);
++ int min_port = find_first_bit(actv_ports.ports, dev->caps.num_ports)
++ + 1;
++ int max_port = min_port +
++ bitmap_weight(actv_ports.ports, dev->caps.num_ports);
++
++ if (port < min_port)
++ port = min_port;
++ else if (port >= max_port)
++ port = max_port - 1;
++
++ return port;
++}
++
+ int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac)
+ {
+ struct mlx4_priv *priv = mlx4_priv(dev);
+@@ -2393,6 +2409,7 @@ int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac)
+ if (slave < 0)
+ return -EINVAL;
+
++ port = mlx4_slaves_closest_port(dev, slave, port);
+ s_info = &priv->mfunc.master.vf_admin[slave].vport[port];
+ s_info->mac = mac;
+ mlx4_info(dev, "default mac on vf %d port %d to %llX will take afect only after vf restart\n",
+@@ -2419,6 +2436,7 @@ int mlx4_set_vf_vlan(struct mlx4_dev *dev, int port, int vf, u16 vlan, u8 qos)
+ if (slave < 0)
+ return -EINVAL;
+
++ port = mlx4_slaves_closest_port(dev, slave, port);
+ vf_admin = &priv->mfunc.master.vf_admin[slave].vport[port];
+
+ if ((0 == vlan) && (0 == qos))
+@@ -2446,6 +2464,7 @@ bool mlx4_get_slave_default_vlan(struct mlx4_dev *dev, int port, int slave,
+ struct mlx4_priv *priv;
+
+ priv = mlx4_priv(dev);
++ port = mlx4_slaves_closest_port(dev, slave, port);
+ vp_oper = &priv->mfunc.master.vf_oper[slave].vport[port];
+
+ if (MLX4_VGT != vp_oper->state.default_vlan) {
+@@ -2473,6 +2492,7 @@ int mlx4_set_vf_spoofchk(struct mlx4_dev *dev, int port, int vf, bool setting)
+ if (slave < 0)
+ return -EINVAL;
+
++ port = mlx4_slaves_closest_port(dev, slave, port);
+ s_info = &priv->mfunc.master.vf_admin[slave].vport[port];
+ s_info->spoofchk = setting;
+
+@@ -2526,6 +2546,7 @@ int mlx4_set_vf_link_state(struct mlx4_dev *dev, int port, int vf, int link_stat
+ if (slave < 0)
+ return -EINVAL;
+
++ port = mlx4_slaves_closest_port(dev, slave, port);
+ switch (link_state) {
+ case IFLA_VF_LINK_STATE_AUTO:
+ /* get current link state */
+diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
+index 82ab427290c3..3bdc11e44ec3 100644
+--- a/drivers/net/ethernet/mellanox/mlx4/main.c
++++ b/drivers/net/ethernet/mellanox/mlx4/main.c
+@@ -78,13 +78,13 @@ MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
+ #endif /* CONFIG_PCI_MSI */
+
+ static uint8_t num_vfs[3] = {0, 0, 0};
+-static int num_vfs_argc = 3;
++static int num_vfs_argc;
+ module_param_array(num_vfs, byte , &num_vfs_argc, 0444);
+ MODULE_PARM_DESC(num_vfs, "enable #num_vfs functions if num_vfs > 0\n"
+ "num_vfs=port1,port2,port1+2");
+
+ static uint8_t probe_vf[3] = {0, 0, 0};
+-static int probe_vfs_argc = 3;
++static int probe_vfs_argc;
+ module_param_array(probe_vf, byte, &probe_vfs_argc, 0444);
+ MODULE_PARM_DESC(probe_vf, "number of vfs to probe by pf driver (num_vfs > 0)\n"
+ "probe_vf=port1,port2,port1+2");
+diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+index f3d5d79f1cd1..a173c985aa73 100644
+--- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
++++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+@@ -872,6 +872,10 @@ static int myri10ge_dma_test(struct myri10ge_priv *mgp, int test_type)
+ return -ENOMEM;
+ dmatest_bus = pci_map_page(mgp->pdev, dmatest_page, 0, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
++ if (unlikely(pci_dma_mapping_error(mgp->pdev, dmatest_bus))) {
++ __free_page(dmatest_page);
++ return -ENOMEM;
++ }
+
+ /* Run a small DMA test.
+ * The magic multipliers to the length tell the firmware
+@@ -1293,6 +1297,7 @@ myri10ge_alloc_rx_pages(struct myri10ge_priv *mgp, struct myri10ge_rx_buf *rx,
+ int bytes, int watchdog)
+ {
+ struct page *page;
++ dma_addr_t bus;
+ int idx;
+ #if MYRI10GE_ALLOC_SIZE > 4096
+ int end_offset;
+@@ -1317,11 +1322,21 @@ myri10ge_alloc_rx_pages(struct myri10ge_priv *mgp, struct myri10ge_rx_buf *rx,
+ rx->watchdog_needed = 1;
+ return;
+ }
++
++ bus = pci_map_page(mgp->pdev, page, 0,
++ MYRI10GE_ALLOC_SIZE,
++ PCI_DMA_FROMDEVICE);
++ if (unlikely(pci_dma_mapping_error(mgp->pdev, bus))) {
++ __free_pages(page, MYRI10GE_ALLOC_ORDER);
++ if (rx->fill_cnt - rx->cnt < 16)
++ rx->watchdog_needed = 1;
++ return;
++ }
++
+ rx->page = page;
+ rx->page_offset = 0;
+- rx->bus = pci_map_page(mgp->pdev, page, 0,
+- MYRI10GE_ALLOC_SIZE,
+- PCI_DMA_FROMDEVICE);
++ rx->bus = bus;
++
+ }
+ rx->info[idx].page = rx->page;
+ rx->info[idx].page_offset = rx->page_offset;
+@@ -2763,6 +2778,35 @@ myri10ge_submit_req(struct myri10ge_tx_buf *tx, struct mcp_kreq_ether_send *src,
+ mb();
+ }
+
++static void myri10ge_unmap_tx_dma(struct myri10ge_priv *mgp,
++ struct myri10ge_tx_buf *tx, int idx)
++{
++ unsigned int len;
++ int last_idx;
++
++ /* Free any DMA resources we've alloced and clear out the skb slot */
++ last_idx = (idx + 1) & tx->mask;
++ idx = tx->req & tx->mask;
++ do {
++ len = dma_unmap_len(&tx->info[idx], len);
++ if (len) {
++ if (tx->info[idx].skb != NULL)
++ pci_unmap_single(mgp->pdev,
++ dma_unmap_addr(&tx->info[idx],
++ bus), len,
++ PCI_DMA_TODEVICE);
++ else
++ pci_unmap_page(mgp->pdev,
++ dma_unmap_addr(&tx->info[idx],
++ bus), len,
++ PCI_DMA_TODEVICE);
++ dma_unmap_len_set(&tx->info[idx], len, 0);
++ tx->info[idx].skb = NULL;
++ }
++ idx = (idx + 1) & tx->mask;
++ } while (idx != last_idx);
++}
++
+ /*
+ * Transmit a packet. We need to split the packet so that a single
+ * segment does not cross myri10ge->tx_boundary, so this makes segment
+@@ -2786,7 +2830,7 @@ static netdev_tx_t myri10ge_xmit(struct sk_buff *skb,
+ u32 low;
+ __be32 high_swapped;
+ unsigned int len;
+- int idx, last_idx, avail, frag_cnt, frag_idx, count, mss, max_segments;
++ int idx, avail, frag_cnt, frag_idx, count, mss, max_segments;
+ u16 pseudo_hdr_offset, cksum_offset, queue;
+ int cum_len, seglen, boundary, rdma_count;
+ u8 flags, odd_flag;
+@@ -2883,9 +2927,12 @@ again:
+
+ /* map the skb for DMA */
+ len = skb_headlen(skb);
++ bus = pci_map_single(mgp->pdev, skb->data, len, PCI_DMA_TODEVICE);
++ if (unlikely(pci_dma_mapping_error(mgp->pdev, bus)))
++ goto drop;
++
+ idx = tx->req & tx->mask;
+ tx->info[idx].skb = skb;
+- bus = pci_map_single(mgp->pdev, skb->data, len, PCI_DMA_TODEVICE);
+ dma_unmap_addr_set(&tx->info[idx], bus, bus);
+ dma_unmap_len_set(&tx->info[idx], len, len);
+
+@@ -2984,12 +3031,16 @@ again:
+ break;
+
+ /* map next fragment for DMA */
+- idx = (count + tx->req) & tx->mask;
+ frag = &skb_shinfo(skb)->frags[frag_idx];
+ frag_idx++;
+ len = skb_frag_size(frag);
+ bus = skb_frag_dma_map(&mgp->pdev->dev, frag, 0, len,
+ DMA_TO_DEVICE);
++ if (unlikely(pci_dma_mapping_error(mgp->pdev, bus))) {
++ myri10ge_unmap_tx_dma(mgp, tx, idx);
++ goto drop;
++ }
++ idx = (count + tx->req) & tx->mask;
+ dma_unmap_addr_set(&tx->info[idx], bus, bus);
+ dma_unmap_len_set(&tx->info[idx], len, len);
+ }
+@@ -3020,31 +3071,8 @@ again:
+ return NETDEV_TX_OK;
+
+ abort_linearize:
+- /* Free any DMA resources we've alloced and clear out the skb
+- * slot so as to not trip up assertions, and to avoid a
+- * double-free if linearizing fails */
++ myri10ge_unmap_tx_dma(mgp, tx, idx);
+
+- last_idx = (idx + 1) & tx->mask;
+- idx = tx->req & tx->mask;
+- tx->info[idx].skb = NULL;
+- do {
+- len = dma_unmap_len(&tx->info[idx], len);
+- if (len) {
+- if (tx->info[idx].skb != NULL)
+- pci_unmap_single(mgp->pdev,
+- dma_unmap_addr(&tx->info[idx],
+- bus), len,
+- PCI_DMA_TODEVICE);
+- else
+- pci_unmap_page(mgp->pdev,
+- dma_unmap_addr(&tx->info[idx],
+- bus), len,
+- PCI_DMA_TODEVICE);
+- dma_unmap_len_set(&tx->info[idx], len, 0);
+- tx->info[idx].skb = NULL;
+- }
+- idx = (idx + 1) & tx->mask;
+- } while (idx != last_idx);
+ if (skb_is_gso(skb)) {
+ netdev_err(mgp->dev, "TSO but wanted to linearize?!?!?\n");
+ goto drop;
+diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
+index d97d5f39a04e..7edf976ecfa0 100644
+--- a/drivers/net/hyperv/netvsc.c
++++ b/drivers/net/hyperv/netvsc.c
+@@ -708,6 +708,7 @@ int netvsc_send(struct hv_device *device,
+ unsigned int section_index = NETVSC_INVALID_INDEX;
+ u32 msg_size = 0;
+ struct sk_buff *skb;
++ u16 q_idx = packet->q_idx;
+
+
+ net_device = get_outbound_net_device(device);
+@@ -772,24 +773,24 @@ int netvsc_send(struct hv_device *device,
+
+ if (ret == 0) {
+ atomic_inc(&net_device->num_outstanding_sends);
+- atomic_inc(&net_device->queue_sends[packet->q_idx]);
++ atomic_inc(&net_device->queue_sends[q_idx]);
+
+ if (hv_ringbuf_avail_percent(&out_channel->outbound) <
+ RING_AVAIL_PERCENT_LOWATER) {
+ netif_tx_stop_queue(netdev_get_tx_queue(
+- ndev, packet->q_idx));
++ ndev, q_idx));
+
+ if (atomic_read(&net_device->
+- queue_sends[packet->q_idx]) < 1)
++ queue_sends[q_idx]) < 1)
+ netif_tx_wake_queue(netdev_get_tx_queue(
+- ndev, packet->q_idx));
++ ndev, q_idx));
+ }
+ } else if (ret == -EAGAIN) {
+ netif_tx_stop_queue(netdev_get_tx_queue(
+- ndev, packet->q_idx));
+- if (atomic_read(&net_device->queue_sends[packet->q_idx]) < 1) {
++ ndev, q_idx));
++ if (atomic_read(&net_device->queue_sends[q_idx]) < 1) {
+ netif_tx_wake_queue(netdev_get_tx_queue(
+- ndev, packet->q_idx));
++ ndev, q_idx));
+ ret = -ENOSPC;
+ }
+ } else {
+diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
+index 4fd71b75e666..f15297201777 100644
+--- a/drivers/net/hyperv/netvsc_drv.c
++++ b/drivers/net/hyperv/netvsc_drv.c
+@@ -387,6 +387,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct net_device *net)
+ int hdr_offset;
+ u32 net_trans_info;
+ u32 hash;
++ u32 skb_length = skb->len;
+
+
+ /* We will atmost need two pages to describe the rndis
+@@ -562,7 +563,7 @@ do_send:
+
+ drop:
+ if (ret == 0) {
+- net->stats.tx_bytes += skb->len;
++ net->stats.tx_bytes += skb_length;
+ net->stats.tx_packets++;
+ } else {
+ kfree(packet);
+diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
+index ef8a5c20236a..f3008e3cf118 100644
+--- a/drivers/net/macvlan.c
++++ b/drivers/net/macvlan.c
+@@ -36,6 +36,7 @@
+ #include <linux/netpoll.h>
+
+ #define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
++#define MACVLAN_BC_QUEUE_LEN 1000
+
+ struct macvlan_port {
+ struct net_device *dev;
+@@ -45,10 +46,9 @@ struct macvlan_port {
+ struct sk_buff_head bc_queue;
+ struct work_struct bc_work;
+ bool passthru;
++ int count;
+ };
+
+-#define MACVLAN_PORT_IS_EMPTY(port) list_empty(&port->vlans)
+-
+ struct macvlan_skb_cb {
+ const struct macvlan_dev *src;
+ };
+@@ -249,7 +249,7 @@ static void macvlan_broadcast_enqueue(struct macvlan_port *port,
+ goto err;
+
+ spin_lock(&port->bc_queue.lock);
+- if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) {
++ if (skb_queue_len(&port->bc_queue) < MACVLAN_BC_QUEUE_LEN) {
+ __skb_queue_tail(&port->bc_queue, nskb);
+ err = 0;
+ }
+@@ -667,7 +667,8 @@ static void macvlan_uninit(struct net_device *dev)
+
+ free_percpu(vlan->pcpu_stats);
+
+- if (MACVLAN_PORT_IS_EMPTY(port))
++ port->count -= 1;
++ if (!port->count)
+ macvlan_port_destroy(port->dev);
+ }
+
+@@ -800,6 +801,7 @@ static netdev_features_t macvlan_fix_features(struct net_device *dev,
+ features,
+ mask);
+ features |= ALWAYS_ON_FEATURES;
++ features &= ~NETIF_F_NETNS_LOCAL;
+
+ return features;
+ }
+@@ -1020,12 +1022,13 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+ vlan->flags = nla_get_u16(data[IFLA_MACVLAN_FLAGS]);
+
+ if (vlan->mode == MACVLAN_MODE_PASSTHRU) {
+- if (!MACVLAN_PORT_IS_EMPTY(port))
++ if (port->count)
+ return -EINVAL;
+ port->passthru = true;
+ eth_hw_addr_inherit(dev, lowerdev);
+ }
+
++ port->count += 1;
+ err = register_netdevice(dev);
+ if (err < 0)
+ goto destroy_port;
+@@ -1043,7 +1046,8 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+ unregister_netdev:
+ unregister_netdevice(dev);
+ destroy_port:
+- if (MACVLAN_PORT_IS_EMPTY(port))
++ port->count -= 1;
++ if (!port->count)
+ macvlan_port_destroy(lowerdev);
+
+ return err;
+diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
+index 3381c4f91a8c..0c6adaaf898c 100644
+--- a/drivers/net/macvtap.c
++++ b/drivers/net/macvtap.c
+@@ -112,17 +112,15 @@ out:
+ return err;
+ }
+
++/* Requires RTNL */
+ static int macvtap_set_queue(struct net_device *dev, struct file *file,
+ struct macvtap_queue *q)
+ {
+ struct macvlan_dev *vlan = netdev_priv(dev);
+- int err = -EBUSY;
+
+- rtnl_lock();
+ if (vlan->numqueues == MAX_MACVTAP_QUEUES)
+- goto out;
++ return -EBUSY;
+
+- err = 0;
+ rcu_assign_pointer(q->vlan, vlan);
+ rcu_assign_pointer(vlan->taps[vlan->numvtaps], q);
+ sock_hold(&q->sk);
+@@ -136,9 +134,7 @@ static int macvtap_set_queue(struct net_device *dev, struct file *file,
+ vlan->numvtaps++;
+ vlan->numqueues++;
+
+-out:
+- rtnl_unlock();
+- return err;
++ return 0;
+ }
+
+ static int macvtap_disable_queue(struct macvtap_queue *q)
+@@ -454,11 +450,12 @@ static void macvtap_sock_destruct(struct sock *sk)
+ static int macvtap_open(struct inode *inode, struct file *file)
+ {
+ struct net *net = current->nsproxy->net_ns;
+- struct net_device *dev = dev_get_by_macvtap_minor(iminor(inode));
++ struct net_device *dev;
+ struct macvtap_queue *q;
+- int err;
++ int err = -ENODEV;
+
+- err = -ENODEV;
++ rtnl_lock();
++ dev = dev_get_by_macvtap_minor(iminor(inode));
+ if (!dev)
+ goto out;
+
+@@ -498,6 +495,7 @@ out:
+ if (dev)
+ dev_put(dev);
+
++ rtnl_unlock();
+ return err;
+ }
+
+diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
+index 180c49479c42..a4b08198fb9f 100644
+--- a/drivers/net/phy/smsc.c
++++ b/drivers/net/phy/smsc.c
+@@ -43,6 +43,22 @@ static int smsc_phy_ack_interrupt(struct phy_device *phydev)
+
+ static int smsc_phy_config_init(struct phy_device *phydev)
+ {
++ int rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
++
++ if (rc < 0)
++ return rc;
++
++ /* Enable energy detect mode for this SMSC Transceivers */
++ rc = phy_write(phydev, MII_LAN83C185_CTRL_STATUS,
++ rc | MII_LAN83C185_EDPWRDOWN);
++ if (rc < 0)
++ return rc;
++
++ return smsc_phy_ack_interrupt(phydev);
++}
++
++static int smsc_phy_reset(struct phy_device *phydev)
++{
+ int rc = phy_read(phydev, MII_LAN83C185_SPECIAL_MODES);
+ if (rc < 0)
+ return rc;
+@@ -66,18 +82,7 @@ static int smsc_phy_config_init(struct phy_device *phydev)
+ rc = phy_read(phydev, MII_BMCR);
+ } while (rc & BMCR_RESET);
+ }
+-
+- rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
+- if (rc < 0)
+- return rc;
+-
+- /* Enable energy detect mode for this SMSC Transceivers */
+- rc = phy_write(phydev, MII_LAN83C185_CTRL_STATUS,
+- rc | MII_LAN83C185_EDPWRDOWN);
+- if (rc < 0)
+- return rc;
+-
+- return smsc_phy_ack_interrupt (phydev);
++ return 0;
+ }
+
+ static int lan911x_config_init(struct phy_device *phydev)
+@@ -142,6 +147,7 @@ static struct phy_driver smsc_phy_driver[] = {
+ .config_aneg = genphy_config_aneg,
+ .read_status = genphy_read_status,
+ .config_init = smsc_phy_config_init,
++ .soft_reset = smsc_phy_reset,
+
+ /* IRQ related */
+ .ack_interrupt = smsc_phy_ack_interrupt,
+@@ -164,6 +170,7 @@ static struct phy_driver smsc_phy_driver[] = {
+ .config_aneg = genphy_config_aneg,
+ .read_status = genphy_read_status,
+ .config_init = smsc_phy_config_init,
++ .soft_reset = smsc_phy_reset,
+
+ /* IRQ related */
+ .ack_interrupt = smsc_phy_ack_interrupt,
+@@ -186,6 +193,7 @@ static struct phy_driver smsc_phy_driver[] = {
+ .config_aneg = genphy_config_aneg,
+ .read_status = genphy_read_status,
+ .config_init = smsc_phy_config_init,
++ .soft_reset = smsc_phy_reset,
+
+ /* IRQ related */
+ .ack_interrupt = smsc_phy_ack_interrupt,
+@@ -230,6 +238,7 @@ static struct phy_driver smsc_phy_driver[] = {
+ .config_aneg = genphy_config_aneg,
+ .read_status = lan87xx_read_status,
+ .config_init = smsc_phy_config_init,
++ .soft_reset = smsc_phy_reset,
+
+ /* IRQ related */
+ .ack_interrupt = smsc_phy_ack_interrupt,
+diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
+index b4958c7ffa84..cb2a00e1d95a 100644
+--- a/drivers/net/team/team.c
++++ b/drivers/net/team/team.c
+@@ -647,7 +647,7 @@ static void team_notify_peers(struct team *team)
+ {
+ if (!team->notify_peers.count || !netif_running(team->dev))
+ return;
+- atomic_set(&team->notify_peers.count_pending, team->notify_peers.count);
++ atomic_add(team->notify_peers.count, &team->notify_peers.count_pending);
+ schedule_delayed_work(&team->notify_peers.dw, 0);
+ }
+
+@@ -687,7 +687,7 @@ static void team_mcast_rejoin(struct team *team)
+ {
+ if (!team->mcast_rejoin.count || !netif_running(team->dev))
+ return;
+- atomic_set(&team->mcast_rejoin.count_pending, team->mcast_rejoin.count);
++ atomic_add(team->mcast_rejoin.count, &team->mcast_rejoin.count_pending);
+ schedule_delayed_work(&team->mcast_rejoin.dw, 0);
+ }
+
+diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
+index 9f79192c9aa0..31a7ad0d7d5f 100644
+--- a/drivers/net/vxlan.c
++++ b/drivers/net/vxlan.c
+@@ -1325,7 +1325,7 @@ static int arp_reduce(struct net_device *dev, struct sk_buff *skb)
+ } else if (vxlan->flags & VXLAN_F_L3MISS) {
+ union vxlan_addr ipa = {
+ .sin.sin_addr.s_addr = tip,
+- .sa.sa_family = AF_INET,
++ .sin.sin_family = AF_INET,
+ };
+
+ vxlan_ip_miss(dev, &ipa);
+@@ -1486,7 +1486,7 @@ static int neigh_reduce(struct net_device *dev, struct sk_buff *skb)
+ } else if (vxlan->flags & VXLAN_F_L3MISS) {
+ union vxlan_addr ipa = {
+ .sin6.sin6_addr = msg->target,
+- .sa.sa_family = AF_INET6,
++ .sin6.sin6_family = AF_INET6,
+ };
+
+ vxlan_ip_miss(dev, &ipa);
+@@ -1519,7 +1519,7 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
+ if (!n && (vxlan->flags & VXLAN_F_L3MISS)) {
+ union vxlan_addr ipa = {
+ .sin.sin_addr.s_addr = pip->daddr,
+- .sa.sa_family = AF_INET,
++ .sin.sin_family = AF_INET,
+ };
+
+ vxlan_ip_miss(dev, &ipa);
+@@ -1540,7 +1540,7 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
+ if (!n && (vxlan->flags & VXLAN_F_L3MISS)) {
+ union vxlan_addr ipa = {
+ .sin6.sin6_addr = pip6->daddr,
+- .sa.sa_family = AF_INET6,
++ .sin6.sin6_family = AF_INET6,
+ };
+
+ vxlan_ip_miss(dev, &ipa);
+diff --git a/drivers/tty/serial/8250/8250_pci.c b/drivers/tty/serial/8250/8250_pci.c
+index 33137b3ba94d..370f6e46caf5 100644
+--- a/drivers/tty/serial/8250/8250_pci.c
++++ b/drivers/tty/serial/8250/8250_pci.c
+@@ -1790,6 +1790,7 @@ pci_wch_ch353_setup(struct serial_private *priv,
+ #define PCI_DEVICE_ID_COMMTECH_4222PCIE 0x0022
+ #define PCI_DEVICE_ID_BROADCOM_TRUMANAGE 0x160a
+ #define PCI_DEVICE_ID_AMCC_ADDIDATA_APCI7800 0x818e
++#define PCI_DEVICE_ID_INTEL_QRK_UART 0x0936
+
+ #define PCI_VENDOR_ID_SUNIX 0x1fd4
+ #define PCI_DEVICE_ID_SUNIX_1999 0x1999
+@@ -1900,6 +1901,13 @@ static struct pci_serial_quirk pci_serial_quirks[] __refdata = {
+ .subdevice = PCI_ANY_ID,
+ .setup = byt_serial_setup,
+ },
++ {
++ .vendor = PCI_VENDOR_ID_INTEL,
++ .device = PCI_DEVICE_ID_INTEL_QRK_UART,
++ .subvendor = PCI_ANY_ID,
++ .subdevice = PCI_ANY_ID,
++ .setup = pci_default_setup,
++ },
+ /*
+ * ITE
+ */
+@@ -2742,6 +2750,7 @@ enum pci_board_num_t {
+ pbn_ADDIDATA_PCIe_8_3906250,
+ pbn_ce4100_1_115200,
+ pbn_byt,
++ pbn_qrk,
+ pbn_omegapci,
+ pbn_NETMOS9900_2s_115200,
+ pbn_brcm_trumanage,
+@@ -3492,6 +3501,12 @@ static struct pciserial_board pci_boards[] = {
+ .uart_offset = 0x80,
+ .reg_shift = 2,
+ },
++ [pbn_qrk] = {
++ .flags = FL_BASE0,
++ .num_ports = 1,
++ .base_baud = 2764800,
++ .reg_shift = 2,
++ },
+ [pbn_omegapci] = {
+ .flags = FL_BASE0,
+ .num_ports = 8,
+@@ -5194,6 +5209,12 @@ static struct pci_device_id serial_pci_tbl[] = {
+ pbn_byt },
+
+ /*
++ * Intel Quark x1000
++ */
++ { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_QRK_UART,
++ PCI_ANY_ID, PCI_ANY_ID, 0, 0,
++ pbn_qrk },
++ /*
+ * Cronyx Omega PCI
+ */
+ { PCI_VENDOR_ID_PLX, PCI_DEVICE_ID_PLX_CRONYX_OMEGA,
+diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
+index 50e854509f55..ba2a8f3b8059 100644
+--- a/drivers/usb/core/hub.c
++++ b/drivers/usb/core/hub.c
+@@ -1983,8 +1983,10 @@ void usb_set_device_state(struct usb_device *udev,
+ || new_state == USB_STATE_SUSPENDED)
+ ; /* No change to wakeup settings */
+ else if (new_state == USB_STATE_CONFIGURED)
+- wakeup = udev->actconfig->desc.bmAttributes
+- & USB_CONFIG_ATT_WAKEUP;
++ wakeup = (udev->quirks &
++ USB_QUIRK_IGNORE_REMOTE_WAKEUP) ? 0 :
++ udev->actconfig->desc.bmAttributes &
++ USB_CONFIG_ATT_WAKEUP;
+ else
+ wakeup = 0;
+ }
+diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c
+index 739ee8e8bdfd..5144d11d032c 100644
+--- a/drivers/usb/core/quirks.c
++++ b/drivers/usb/core/quirks.c
+@@ -160,6 +160,10 @@ static const struct usb_device_id usb_interface_quirk_list[] = {
+ { USB_VENDOR_AND_INTERFACE_INFO(0x046d, USB_CLASS_VIDEO, 1, 0),
+ .driver_info = USB_QUIRK_RESET_RESUME },
+
++ /* ASUS Base Station(T100) */
++ { USB_DEVICE(0x0b05, 0x17e0), .driver_info =
++ USB_QUIRK_IGNORE_REMOTE_WAKEUP },
++
+ { } /* terminating entry must be last */
+ };
+
+diff --git a/drivers/usb/musb/musb_dsps.c b/drivers/usb/musb/musb_dsps.c
+index 09529f94e72d..6983e805147b 100644
+--- a/drivers/usb/musb/musb_dsps.c
++++ b/drivers/usb/musb/musb_dsps.c
+@@ -780,6 +780,7 @@ static int dsps_suspend(struct device *dev)
+ struct musb *musb = platform_get_drvdata(glue->musb);
+ void __iomem *mbase = musb->ctrl_base;
+
++ del_timer_sync(&glue->timer);
+ glue->context.control = dsps_readl(mbase, wrp->control);
+ glue->context.epintr = dsps_readl(mbase, wrp->epintr_set);
+ glue->context.coreintr = dsps_readl(mbase, wrp->coreintr_set);
+@@ -805,6 +806,7 @@ static int dsps_resume(struct device *dev)
+ dsps_writel(mbase, wrp->mode, glue->context.mode);
+ dsps_writel(mbase, wrp->tx_mode, glue->context.tx_mode);
+ dsps_writel(mbase, wrp->rx_mode, glue->context.rx_mode);
++ setup_timer(&glue->timer, otg_timer, (unsigned long) musb);
+
+ return 0;
+ }
+diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
+index 330df5ce435b..63b2af2a87c0 100644
+--- a/drivers/usb/serial/cp210x.c
++++ b/drivers/usb/serial/cp210x.c
+@@ -122,6 +122,7 @@ static const struct usb_device_id id_table[] = {
+ { USB_DEVICE(0x10C4, 0x8665) }, /* AC-Services OBD-IF */
+ { USB_DEVICE(0x10C4, 0x88A4) }, /* MMB Networks ZigBee USB Device */
+ { USB_DEVICE(0x10C4, 0x88A5) }, /* Planet Innovation Ingeni ZigBee USB Device */
++ { USB_DEVICE(0x10C4, 0x8946) }, /* Ketra N1 Wireless Interface */
+ { USB_DEVICE(0x10C4, 0xEA60) }, /* Silicon Labs factory default */
+ { USB_DEVICE(0x10C4, 0xEA61) }, /* Silicon Labs factory default */
+ { USB_DEVICE(0x10C4, 0xEA70) }, /* Silicon Labs factory default */
+@@ -155,6 +156,7 @@ static const struct usb_device_id id_table[] = {
+ { USB_DEVICE(0x1ADB, 0x0001) }, /* Schweitzer Engineering C662 Cable */
+ { USB_DEVICE(0x1B1C, 0x1C00) }, /* Corsair USB Dongle */
+ { USB_DEVICE(0x1BE3, 0x07A6) }, /* WAGO 750-923 USB Service Cable */
++ { USB_DEVICE(0x1D6F, 0x0010) }, /* Seluxit ApS RF Dongle */
+ { USB_DEVICE(0x1E29, 0x0102) }, /* Festo CPX-USB */
+ { USB_DEVICE(0x1E29, 0x0501) }, /* Festo CMSP */
+ { USB_DEVICE(0x1FB9, 0x0100) }, /* Lake Shore Model 121 Current Source */
+diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
+index 3f42785f653c..27136935fec3 100644
+--- a/drivers/usb/storage/uas.c
++++ b/drivers/usb/storage/uas.c
+@@ -28,6 +28,7 @@
+ #include <scsi/scsi_tcq.h>
+
+ #include "uas-detect.h"
++#include "scsiglue.h"
+
+ /*
+ * The r00-r01c specs define this version of the SENSE IU data structure.
+@@ -49,6 +50,7 @@ struct uas_dev_info {
+ struct usb_anchor cmd_urbs;
+ struct usb_anchor sense_urbs;
+ struct usb_anchor data_urbs;
++ unsigned long flags;
+ int qdepth, resetting;
+ struct response_iu response;
+ unsigned cmd_pipe, status_pipe, data_in_pipe, data_out_pipe;
+@@ -714,6 +716,15 @@ static int uas_queuecommand_lck(struct scsi_cmnd *cmnd,
+
+ BUILD_BUG_ON(sizeof(struct uas_cmd_info) > sizeof(struct scsi_pointer));
+
++ if ((devinfo->flags & US_FL_NO_ATA_1X) &&
++ (cmnd->cmnd[0] == ATA_12 || cmnd->cmnd[0] == ATA_16)) {
++ memcpy(cmnd->sense_buffer, usb_stor_sense_invalidCDB,
++ sizeof(usb_stor_sense_invalidCDB));
++ cmnd->result = SAM_STAT_CHECK_CONDITION;
++ cmnd->scsi_done(cmnd);
++ return 0;
++ }
++
+ spin_lock_irqsave(&devinfo->lock, flags);
+
+ if (devinfo->resetting) {
+@@ -950,6 +961,10 @@ static int uas_slave_alloc(struct scsi_device *sdev)
+ static int uas_slave_configure(struct scsi_device *sdev)
+ {
+ struct uas_dev_info *devinfo = sdev->hostdata;
++
++ if (devinfo->flags & US_FL_NO_REPORT_OPCODES)
++ sdev->no_report_opcodes = 1;
++
+ scsi_set_tag_type(sdev, MSG_ORDERED_TAG);
+ scsi_activate_tcq(sdev, devinfo->qdepth - 2);
+ return 0;
+@@ -1080,6 +1095,8 @@ static int uas_probe(struct usb_interface *intf, const struct usb_device_id *id)
+ devinfo->resetting = 0;
+ devinfo->running_task = 0;
+ devinfo->shutdown = 0;
++ devinfo->flags = id->driver_info;
++ usb_stor_adjust_quirks(udev, &devinfo->flags);
+ init_usb_anchor(&devinfo->cmd_urbs);
+ init_usb_anchor(&devinfo->sense_urbs);
+ init_usb_anchor(&devinfo->data_urbs);
+diff --git a/drivers/usb/storage/unusual_uas.h b/drivers/usb/storage/unusual_uas.h
+index 7244444df8ee..8511b54a65d9 100644
+--- a/drivers/usb/storage/unusual_uas.h
++++ b/drivers/usb/storage/unusual_uas.h
+@@ -40,13 +40,38 @@
+ * and don't forget to CC: the USB development list <linux-usb@vger.kernel.org>
+ */
+
+-/*
+- * This is an example entry for the US_FL_IGNORE_UAS flag. Once we have an
+- * actual entry using US_FL_IGNORE_UAS this entry should be removed.
+- *
+- * UNUSUAL_DEV( 0xabcd, 0x1234, 0x0100, 0x0100,
+- * "Example",
+- * "Storage with broken UAS",
+- * USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+- * US_FL_IGNORE_UAS),
+- */
++/* https://bugzilla.kernel.org/show_bug.cgi?id=79511 */
++UNUSUAL_DEV(0x0bc2, 0x2312, 0x0000, 0x9999,
++ "Seagate",
++ "Expansion Desk",
++ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++ US_FL_NO_ATA_1X),
++
++/* https://bbs.archlinux.org/viewtopic.php?id=183190 */
++UNUSUAL_DEV(0x0bc2, 0x3312, 0x0000, 0x9999,
++ "Seagate",
++ "Expansion Desk",
++ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++ US_FL_NO_ATA_1X),
++
++/* https://bbs.archlinux.org/viewtopic.php?id=183190 */
++UNUSUAL_DEV(0x0bc2, 0xab20, 0x0000, 0x9999,
++ "Seagate",
++ "Backup+ BK",
++ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++ US_FL_NO_ATA_1X),
++
++/* Reported-by: Claudio Bizzarri <claudio.bizzarri@gmail.com> */
++UNUSUAL_DEV(0x152d, 0x0567, 0x0000, 0x9999,
++ "JMicron",
++ "JMS567",
++ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++ US_FL_NO_REPORT_OPCODES),
++
++/* Most ASM1051 based devices have issues with uas, blacklist them all */
++/* Reported-by: Hans de Goede <hdegoede@redhat.com> */
++UNUSUAL_DEV(0x174c, 0x5106, 0x0000, 0x9999,
++ "ASMedia",
++ "ASM1051",
++ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++ US_FL_IGNORE_UAS),
+diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c
+index f1c96261a501..20c5bcc6d3df 100644
+--- a/drivers/usb/storage/usb.c
++++ b/drivers/usb/storage/usb.c
+@@ -476,7 +476,8 @@ void usb_stor_adjust_quirks(struct usb_device *udev, unsigned long *fflags)
+ US_FL_CAPACITY_OK | US_FL_IGNORE_RESIDUE |
+ US_FL_SINGLE_LUN | US_FL_NO_WP_DETECT |
+ US_FL_NO_READ_DISC_INFO | US_FL_NO_READ_CAPACITY_16 |
+- US_FL_INITIAL_READ10 | US_FL_WRITE_CACHE);
++ US_FL_INITIAL_READ10 | US_FL_WRITE_CACHE |
++ US_FL_NO_ATA_1X | US_FL_NO_REPORT_OPCODES);
+
+ p = quirks;
+ while (*p) {
+@@ -514,6 +515,9 @@ void usb_stor_adjust_quirks(struct usb_device *udev, unsigned long *fflags)
+ case 'e':
+ f |= US_FL_NO_READ_CAPACITY_16;
+ break;
++ case 'f':
++ f |= US_FL_NO_REPORT_OPCODES;
++ break;
+ case 'h':
+ f |= US_FL_CAPACITY_HEURISTICS;
+ break;
+@@ -541,6 +545,9 @@ void usb_stor_adjust_quirks(struct usb_device *udev, unsigned long *fflags)
+ case 's':
+ f |= US_FL_SINGLE_LUN;
+ break;
++ case 't':
++ f |= US_FL_NO_ATA_1X;
++ break;
+ case 'u':
+ f |= US_FL_IGNORE_UAS;
+ break;
+diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
+index 4967916fe4ac..d69f0577a319 100644
+--- a/include/linux/if_vlan.h
++++ b/include/linux/if_vlan.h
+@@ -187,7 +187,6 @@ vlan_dev_get_egress_qos_mask(struct net_device *dev, u32 skprio)
+ }
+
+ extern bool vlan_do_receive(struct sk_buff **skb);
+-extern struct sk_buff *vlan_untag(struct sk_buff *skb);
+
+ extern int vlan_vid_add(struct net_device *dev, __be16 proto, u16 vid);
+ extern void vlan_vid_del(struct net_device *dev, __be16 proto, u16 vid);
+@@ -241,11 +240,6 @@ static inline bool vlan_do_receive(struct sk_buff **skb)
+ return false;
+ }
+
+-static inline struct sk_buff *vlan_untag(struct sk_buff *skb)
+-{
+- return skb;
+-}
+-
+ static inline int vlan_vid_add(struct net_device *dev, __be16 proto, u16 vid)
+ {
+ return 0;
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index ec89301ada41..6bb6bd86b0dc 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -2549,6 +2549,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen);
+ void skb_scrub_packet(struct sk_buff *skb, bool xnet);
+ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
+ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
++struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
+
+ struct skb_checksum_ops {
+ __wsum (*update)(const void *mem, int len, __wsum wsum);
+diff --git a/include/linux/usb/quirks.h b/include/linux/usb/quirks.h
+index 52f944dfe2fd..49587dc22f5d 100644
+--- a/include/linux/usb/quirks.h
++++ b/include/linux/usb/quirks.h
+@@ -30,4 +30,7 @@
+ descriptor */
+ #define USB_QUIRK_DELAY_INIT 0x00000040
+
++/* device generates spurious wakeup, ignore remote wakeup capability */
++#define USB_QUIRK_IGNORE_REMOTE_WAKEUP 0x00000200
++
+ #endif /* __LINUX_USB_QUIRKS_H */
+diff --git a/include/linux/usb_usual.h b/include/linux/usb_usual.h
+index 9b7de1b46437..a7f2604c5f25 100644
+--- a/include/linux/usb_usual.h
++++ b/include/linux/usb_usual.h
+@@ -73,6 +73,10 @@
+ /* Device advertises UAS but it is broken */ \
+ US_FLAG(BROKEN_FUA, 0x01000000) \
+ /* Cannot handle FUA in WRITE or READ CDBs */ \
++ US_FLAG(NO_ATA_1X, 0x02000000) \
++ /* Cannot handle ATA_12 or ATA_16 CDBs */ \
++ US_FLAG(NO_REPORT_OPCODES, 0x04000000) \
++ /* Cannot handle MI_REPORT_SUPPORTED_OPERATION_CODES */ \
+
+ #define US_FLAG(name, value) US_FL_##name = value ,
+ enum { US_DO_ALL_FLAGS };
+diff --git a/include/net/dst.h b/include/net/dst.h
+index 71c60f42be48..a8ae4e760778 100644
+--- a/include/net/dst.h
++++ b/include/net/dst.h
+@@ -480,6 +480,7 @@ void dst_init(void);
+ /* Flags for xfrm_lookup flags argument. */
+ enum {
+ XFRM_LOOKUP_ICMP = 1 << 0,
++ XFRM_LOOKUP_QUEUE = 1 << 1,
+ };
+
+ struct flowi;
+@@ -490,7 +491,16 @@ static inline struct dst_entry *xfrm_lookup(struct net *net,
+ int flags)
+ {
+ return dst_orig;
+-}
++}
++
++static inline struct dst_entry *xfrm_lookup_route(struct net *net,
++ struct dst_entry *dst_orig,
++ const struct flowi *fl,
++ struct sock *sk,
++ int flags)
++{
++ return dst_orig;
++}
+
+ static inline struct xfrm_state *dst_xfrm(const struct dst_entry *dst)
+ {
+@@ -502,6 +512,10 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
+ const struct flowi *fl, struct sock *sk,
+ int flags);
+
++struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig,
++ const struct flowi *fl, struct sock *sk,
++ int flags);
++
+ /* skb attached with this dst needs transformation if dst->xfrm is valid */
+ static inline struct xfrm_state *dst_xfrm(const struct dst_entry *dst)
+ {
+diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
+index 7a4313887568..5fbe6568c3cf 100644
+--- a/include/net/inet_connection_sock.h
++++ b/include/net/inet_connection_sock.h
+@@ -62,6 +62,7 @@ struct inet_connection_sock_af_ops {
+ void (*addr2sockaddr)(struct sock *sk, struct sockaddr *);
+ int (*bind_conflict)(const struct sock *sk,
+ const struct inet_bind_bucket *tb, bool relax);
++ void (*mtu_reduced)(struct sock *sk);
+ };
+
+ /** inet_connection_sock - INET connection oriented sock
+diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
+index 9bcb220bd4ad..cf485f9aa563 100644
+--- a/include/net/ip6_fib.h
++++ b/include/net/ip6_fib.h
+@@ -114,16 +114,13 @@ struct rt6_info {
+ u32 rt6i_flags;
+ struct rt6key rt6i_src;
+ struct rt6key rt6i_prefsrc;
+- u32 rt6i_metric;
+
+ struct inet6_dev *rt6i_idev;
+ unsigned long _rt6i_peer;
+
+- u32 rt6i_genid;
+-
++ u32 rt6i_metric;
+ /* more non-fragment space at head required */
+ unsigned short rt6i_nfheader_len;
+-
+ u8 rt6i_protocol;
+ };
+
+diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
+index 361d26077196..e0d64667a4b3 100644
+--- a/include/net/net_namespace.h
++++ b/include/net/net_namespace.h
+@@ -352,26 +352,12 @@ static inline void rt_genid_bump_ipv4(struct net *net)
+ atomic_inc(&net->ipv4.rt_genid);
+ }
+
+-#if IS_ENABLED(CONFIG_IPV6)
+-static inline int rt_genid_ipv6(struct net *net)
+-{
+- return atomic_read(&net->ipv6.rt_genid);
+-}
+-
+-static inline void rt_genid_bump_ipv6(struct net *net)
+-{
+- atomic_inc(&net->ipv6.rt_genid);
+-}
+-#else
+-static inline int rt_genid_ipv6(struct net *net)
+-{
+- return 0;
+-}
+-
++extern void (*__fib6_flush_trees)(struct net *net);
+ static inline void rt_genid_bump_ipv6(struct net *net)
+ {
++ if (__fib6_flush_trees)
++ __fib6_flush_trees(net);
+ }
+-#endif
+
+ #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
+ static inline struct netns_ieee802154_lowpan *
+diff --git a/include/net/sctp/command.h b/include/net/sctp/command.h
+index 4b7cd695e431..cfcbc3f627bd 100644
+--- a/include/net/sctp/command.h
++++ b/include/net/sctp/command.h
+@@ -115,7 +115,7 @@ typedef enum {
+ * analysis of the state functions, but in reality just taken from
+ * thin air in the hopes othat we don't trigger a kernel panic.
+ */
+-#define SCTP_MAX_NUM_COMMANDS 14
++#define SCTP_MAX_NUM_COMMANDS 20
+
+ typedef union {
+ __s32 i32;
+diff --git a/include/net/sock.h b/include/net/sock.h
+index 156350745700..6cc7944d65bf 100644
+--- a/include/net/sock.h
++++ b/include/net/sock.h
+@@ -971,7 +971,6 @@ struct proto {
+ struct sk_buff *skb);
+
+ void (*release_cb)(struct sock *sk);
+- void (*mtu_reduced)(struct sock *sk);
+
+ /* Keeping track of sk's, looking them up, and port selection methods. */
+ void (*hash)(struct sock *sk);
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 7286db80e8b8..d587ff0f8828 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -448,6 +448,7 @@ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
+ */
+
+ void tcp_v4_send_check(struct sock *sk, struct sk_buff *skb);
++void tcp_v4_mtu_reduced(struct sock *sk);
+ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb);
+ struct sock *tcp_create_openreq_child(struct sock *sk,
+ struct request_sock *req,
+@@ -718,8 +719,10 @@ struct tcp_skb_cb {
+ #define TCPCB_SACKED_RETRANS 0x02 /* SKB retransmitted */
+ #define TCPCB_LOST 0x04 /* SKB is lost */
+ #define TCPCB_TAGBITS 0x07 /* All tag bits */
++#define TCPCB_REPAIRED 0x10 /* SKB repaired (no skb_mstamp) */
+ #define TCPCB_EVER_RETRANS 0x80 /* Ever retransmitted frame */
+-#define TCPCB_RETRANS (TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS)
++#define TCPCB_RETRANS (TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS| \
++ TCPCB_REPAIRED)
+
+ __u8 ip_dsfield; /* IPv4 tos or IPv6 dsfield */
+ /* 1 byte hole */
+diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
+index 75d427763992..90cc2bdd4064 100644
+--- a/net/8021q/vlan_core.c
++++ b/net/8021q/vlan_core.c
+@@ -112,59 +112,6 @@ __be16 vlan_dev_vlan_proto(const struct net_device *dev)
+ }
+ EXPORT_SYMBOL(vlan_dev_vlan_proto);
+
+-static struct sk_buff *vlan_reorder_header(struct sk_buff *skb)
+-{
+- if (skb_cow(skb, skb_headroom(skb)) < 0) {
+- kfree_skb(skb);
+- return NULL;
+- }
+-
+- memmove(skb->data - ETH_HLEN, skb->data - VLAN_ETH_HLEN, 2 * ETH_ALEN);
+- skb->mac_header += VLAN_HLEN;
+- return skb;
+-}
+-
+-struct sk_buff *vlan_untag(struct sk_buff *skb)
+-{
+- struct vlan_hdr *vhdr;
+- u16 vlan_tci;
+-
+- if (unlikely(vlan_tx_tag_present(skb))) {
+- /* vlan_tci is already set-up so leave this for another time */
+- return skb;
+- }
+-
+- skb = skb_share_check(skb, GFP_ATOMIC);
+- if (unlikely(!skb))
+- goto err_free;
+-
+- if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
+- goto err_free;
+-
+- vhdr = (struct vlan_hdr *) skb->data;
+- vlan_tci = ntohs(vhdr->h_vlan_TCI);
+- __vlan_hwaccel_put_tag(skb, skb->protocol, vlan_tci);
+-
+- skb_pull_rcsum(skb, VLAN_HLEN);
+- vlan_set_encap_proto(skb, vhdr);
+-
+- skb = vlan_reorder_header(skb);
+- if (unlikely(!skb))
+- goto err_free;
+-
+- skb_reset_network_header(skb);
+- skb_reset_transport_header(skb);
+- skb_reset_mac_len(skb);
+-
+- return skb;
+-
+-err_free:
+- kfree_skb(skb);
+- return NULL;
+-}
+-EXPORT_SYMBOL(vlan_untag);
+-
+-
+ /*
+ * vlan info and vid list
+ */
+diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
+index 23caf5b0309e..4fd47a1a0e9a 100644
+--- a/net/bridge/br_private.h
++++ b/net/bridge/br_private.h
+@@ -309,6 +309,9 @@ struct br_input_skb_cb {
+ int igmp;
+ int mrouters_only;
+ #endif
++#ifdef CONFIG_BRIDGE_VLAN_FILTERING
++ bool vlan_filtered;
++#endif
+ };
+
+ #define BR_INPUT_SKB_CB(__skb) ((struct br_input_skb_cb *)(__skb)->cb)
+diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
+index 2b2774fe0703..b03e884fba3e 100644
+--- a/net/bridge/br_vlan.c
++++ b/net/bridge/br_vlan.c
+@@ -127,7 +127,8 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
+ {
+ u16 vid;
+
+- if (!br->vlan_enabled)
++ /* If this packet was not filtered at input, let it pass */
++ if (!BR_INPUT_SKB_CB(skb)->vlan_filtered)
+ goto out;
+
+ /* Vlan filter table must be configured at this point. The
+@@ -166,8 +167,10 @@ bool br_allowed_ingress(struct net_bridge *br, struct net_port_vlans *v,
+ /* If VLAN filtering is disabled on the bridge, all packets are
+ * permitted.
+ */
+- if (!br->vlan_enabled)
++ if (!br->vlan_enabled) {
++ BR_INPUT_SKB_CB(skb)->vlan_filtered = false;
+ return true;
++ }
+
+ /* If there are no vlan in the permitted list, all packets are
+ * rejected.
+@@ -175,6 +178,7 @@ bool br_allowed_ingress(struct net_bridge *br, struct net_port_vlans *v,
+ if (!v)
+ goto drop;
+
++ BR_INPUT_SKB_CB(skb)->vlan_filtered = true;
+ proto = br->vlan_proto;
+
+ /* If vlan tx offload is disabled on bridge device and frame was
+@@ -183,7 +187,7 @@ bool br_allowed_ingress(struct net_bridge *br, struct net_port_vlans *v,
+ */
+ if (unlikely(!vlan_tx_tag_present(skb) &&
+ skb->protocol == proto)) {
+- skb = vlan_untag(skb);
++ skb = skb_vlan_untag(skb);
+ if (unlikely(!skb))
+ return false;
+ }
+@@ -253,7 +257,8 @@ bool br_allowed_egress(struct net_bridge *br,
+ {
+ u16 vid;
+
+- if (!br->vlan_enabled)
++ /* If this packet was not filtered at input, let it pass */
++ if (!BR_INPUT_SKB_CB(skb)->vlan_filtered)
+ return true;
+
+ if (!v)
+@@ -272,6 +277,7 @@ bool br_should_learn(struct net_bridge_port *p, struct sk_buff *skb, u16 *vid)
+ struct net_bridge *br = p->br;
+ struct net_port_vlans *v;
+
++ /* If filtering was disabled at input, let it pass. */
+ if (!br->vlan_enabled)
+ return true;
+
+diff --git a/net/core/dev.c b/net/core/dev.c
+index 367a586d0c8a..2647b508eb4d 100644
+--- a/net/core/dev.c
++++ b/net/core/dev.c
+@@ -2576,13 +2576,19 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
+ return harmonize_features(skb, features);
+ }
+
+- features &= (skb->dev->vlan_features | NETIF_F_HW_VLAN_CTAG_TX |
+- NETIF_F_HW_VLAN_STAG_TX);
++ features = netdev_intersect_features(features,
++ skb->dev->vlan_features |
++ NETIF_F_HW_VLAN_CTAG_TX |
++ NETIF_F_HW_VLAN_STAG_TX);
+
+ if (protocol == htons(ETH_P_8021Q) || protocol == htons(ETH_P_8021AD))
+- features &= NETIF_F_SG | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST |
+- NETIF_F_GEN_CSUM | NETIF_F_HW_VLAN_CTAG_TX |
+- NETIF_F_HW_VLAN_STAG_TX;
++ features = netdev_intersect_features(features,
++ NETIF_F_SG |
++ NETIF_F_HIGHDMA |
++ NETIF_F_FRAGLIST |
++ NETIF_F_GEN_CSUM |
++ NETIF_F_HW_VLAN_CTAG_TX |
++ NETIF_F_HW_VLAN_STAG_TX);
+
+ return harmonize_features(skb, features);
+ }
+@@ -3588,7 +3594,7 @@ another_round:
+
+ if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
+ skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
+- skb = vlan_untag(skb);
++ skb = skb_vlan_untag(skb);
+ if (unlikely(!skb))
+ goto unlock;
+ }
+diff --git a/net/core/filter.c b/net/core/filter.c
+index 1dbf6462f766..3139f966a178 100644
+--- a/net/core/filter.c
++++ b/net/core/filter.c
+@@ -1318,6 +1318,7 @@ static int sk_store_orig_filter(struct sk_filter *fp,
+ fkprog->filter = kmemdup(fp->insns, fsize, GFP_KERNEL);
+ if (!fkprog->filter) {
+ kfree(fp->orig_prog);
++ fp->orig_prog = NULL;
+ return -ENOMEM;
+ }
+
+diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
+index 1063996f8317..e0b5ca349049 100644
+--- a/net/core/rtnetlink.c
++++ b/net/core/rtnetlink.c
+@@ -799,7 +799,8 @@ static inline int rtnl_vfinfo_size(const struct net_device *dev,
+ (nla_total_size(sizeof(struct ifla_vf_mac)) +
+ nla_total_size(sizeof(struct ifla_vf_vlan)) +
+ nla_total_size(sizeof(struct ifla_vf_spoofchk)) +
+- nla_total_size(sizeof(struct ifla_vf_rate)));
++ nla_total_size(sizeof(struct ifla_vf_rate)) +
++ nla_total_size(sizeof(struct ifla_vf_link_state)));
+ return size;
+ } else
+ return 0;
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index 58ff88edbefd..f5f14d54d6a2 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -62,6 +62,7 @@
+ #include <linux/scatterlist.h>
+ #include <linux/errqueue.h>
+ #include <linux/prefetch.h>
++#include <linux/if_vlan.h>
+
+ #include <net/protocol.h>
+ #include <net/dst.h>
+@@ -3151,6 +3152,9 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+ NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
+ goto done;
+ }
++ /* switch back to head shinfo */
++ pinfo = skb_shinfo(p);
++
+ if (pinfo->frag_list)
+ goto merge;
+ if (skb_gro_len(p) != pinfo->gso_size)
+@@ -3959,3 +3963,55 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb)
+ return shinfo->gso_size;
+ }
+ EXPORT_SYMBOL_GPL(skb_gso_transport_seglen);
++
++static struct sk_buff *skb_reorder_vlan_header(struct sk_buff *skb)
++{
++ if (skb_cow(skb, skb_headroom(skb)) < 0) {
++ kfree_skb(skb);
++ return NULL;
++ }
++
++ memmove(skb->data - ETH_HLEN, skb->data - VLAN_ETH_HLEN, 2 * ETH_ALEN);
++ skb->mac_header += VLAN_HLEN;
++ return skb;
++}
++
++struct sk_buff *skb_vlan_untag(struct sk_buff *skb)
++{
++ struct vlan_hdr *vhdr;
++ u16 vlan_tci;
++
++ if (unlikely(vlan_tx_tag_present(skb))) {
++ /* vlan_tci is already set-up so leave this for another time */
++ return skb;
++ }
++
++ skb = skb_share_check(skb, GFP_ATOMIC);
++ if (unlikely(!skb))
++ goto err_free;
++
++ if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
++ goto err_free;
++
++ vhdr = (struct vlan_hdr *)skb->data;
++ vlan_tci = ntohs(vhdr->h_vlan_TCI);
++ __vlan_hwaccel_put_tag(skb, skb->protocol, vlan_tci);
++
++ skb_pull_rcsum(skb, VLAN_HLEN);
++ vlan_set_encap_proto(skb, vhdr);
++
++ skb = skb_reorder_vlan_header(skb);
++ if (unlikely(!skb))
++ goto err_free;
++
++ skb_reset_network_header(skb);
++ skb_reset_transport_header(skb);
++ skb_reset_mac_len(skb);
++
++ return skb;
++
++err_free:
++ kfree_skb(skb);
++ return NULL;
++}
++EXPORT_SYMBOL(skb_vlan_untag);
+diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
+index 45920d928341..6c2719373bc5 100644
+--- a/net/ipv4/ip_tunnel.c
++++ b/net/ipv4/ip_tunnel.c
+@@ -764,9 +764,14 @@ int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd)
+
+ t = ip_tunnel_find(itn, p, itn->fb_tunnel_dev->type);
+
+- if (!t && (cmd == SIOCADDTUNNEL)) {
+- t = ip_tunnel_create(net, itn, p);
+- err = PTR_ERR_OR_ZERO(t);
++ if (cmd == SIOCADDTUNNEL) {
++ if (!t) {
++ t = ip_tunnel_create(net, itn, p);
++ err = PTR_ERR_OR_ZERO(t);
++ break;
++ }
++
++ err = -EEXIST;
+ break;
+ }
+ if (dev != itn->fb_tunnel_dev && cmd == SIOCCHGTUNNEL) {
+diff --git a/net/ipv4/route.c b/net/ipv4/route.c
+index 190199851c9a..4b340c30a037 100644
+--- a/net/ipv4/route.c
++++ b/net/ipv4/route.c
+@@ -2267,9 +2267,9 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
+ return rt;
+
+ if (flp4->flowi4_proto)
+- rt = (struct rtable *) xfrm_lookup(net, &rt->dst,
+- flowi4_to_flowi(flp4),
+- sk, 0);
++ rt = (struct rtable *)xfrm_lookup_route(net, &rt->dst,
++ flowi4_to_flowi(flp4),
++ sk, 0);
+
+ return rt;
+ }
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 9d2118e5fbc7..0717f45b5171 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -1175,13 +1175,6 @@ new_segment:
+ goto wait_for_memory;
+
+ /*
+- * All packets are restored as if they have
+- * already been sent.
+- */
+- if (tp->repair)
+- TCP_SKB_CB(skb)->when = tcp_time_stamp;
+-
+- /*
+ * Check whether we can use HW checksum.
+ */
+ if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
+@@ -1190,6 +1183,13 @@ new_segment:
+ skb_entail(sk, skb);
+ copy = size_goal;
+ max = size_goal;
++
++ /* All packets are restored as if they have
++ * already been sent. skb_mstamp isn't set to
++ * avoid wrong rtt estimation.
++ */
++ if (tp->repair)
++ TCP_SKB_CB(skb)->sacked |= TCPCB_REPAIRED;
+ }
+
+ /* Try to append data to the end of skb. */
+diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
+index 40639c288dc2..a1bbebb03490 100644
+--- a/net/ipv4/tcp_input.c
++++ b/net/ipv4/tcp_input.c
+@@ -2680,7 +2680,6 @@ static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
+ */
+ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
+ {
+- struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+ bool recovered = !before(tp->snd_una, tp->high_seq);
+
+@@ -2706,12 +2705,9 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
+
+ if (recovered) {
+ /* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
+- icsk->icsk_retransmits = 0;
+ tcp_try_undo_recovery(sk);
+ return;
+ }
+- if (flag & FLAG_DATA_ACKED)
+- icsk->icsk_retransmits = 0;
+ if (tcp_is_reno(tp)) {
+ /* A Reno DUPACK means new data in F-RTO step 2.b above are
+ * delivered. Lower inflight to clock out (re)tranmissions.
+@@ -3393,8 +3389,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+ icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
+ tcp_rearm_rto(sk);
+
+- if (after(ack, prior_snd_una))
++ if (after(ack, prior_snd_una)) {
+ flag |= FLAG_SND_UNA_ADVANCED;
++ icsk->icsk_retransmits = 0;
++ }
+
+ prior_fackets = tp->fackets_out;
+
+diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
+index 77cccda1ad0c..f63c524de5d9 100644
+--- a/net/ipv4/tcp_ipv4.c
++++ b/net/ipv4/tcp_ipv4.c
+@@ -269,7 +269,7 @@ EXPORT_SYMBOL(tcp_v4_connect);
+ * It can be called through tcp_release_cb() if socket was owned by user
+ * at the time tcp_v4_err() was called to handle ICMP message.
+ */
+-static void tcp_v4_mtu_reduced(struct sock *sk)
++void tcp_v4_mtu_reduced(struct sock *sk)
+ {
+ struct dst_entry *dst;
+ struct inet_sock *inet = inet_sk(sk);
+@@ -300,6 +300,7 @@ static void tcp_v4_mtu_reduced(struct sock *sk)
+ tcp_simple_retransmit(sk);
+ } /* else let the usual retransmit timer handle it */
+ }
++EXPORT_SYMBOL(tcp_v4_mtu_reduced);
+
+ static void do_redirect(struct sk_buff *skb, struct sock *sk)
+ {
+@@ -1880,6 +1881,7 @@ const struct inet_connection_sock_af_ops ipv4_specific = {
+ .compat_setsockopt = compat_ip_setsockopt,
+ .compat_getsockopt = compat_ip_getsockopt,
+ #endif
++ .mtu_reduced = tcp_v4_mtu_reduced,
+ };
+ EXPORT_SYMBOL(ipv4_specific);
+
+@@ -2499,7 +2501,6 @@ struct proto tcp_prot = {
+ .sendpage = tcp_sendpage,
+ .backlog_rcv = tcp_v4_do_rcv,
+ .release_cb = tcp_release_cb,
+- .mtu_reduced = tcp_v4_mtu_reduced,
+ .hash = inet_hash,
+ .unhash = inet_unhash,
+ .get_port = inet_csk_get_port,
+diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
+index 179b51e6bda3..4e4932b5079b 100644
+--- a/net/ipv4/tcp_output.c
++++ b/net/ipv4/tcp_output.c
+@@ -800,7 +800,7 @@ void tcp_release_cb(struct sock *sk)
+ __sock_put(sk);
+ }
+ if (flags & (1UL << TCP_MTU_REDUCED_DEFERRED)) {
+- sk->sk_prot->mtu_reduced(sk);
++ inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
+ __sock_put(sk);
+ }
+ }
+@@ -1916,8 +1916,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
+ BUG_ON(!tso_segs);
+
+- if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE)
++ if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
++ /* "when" is used as a start point for the retransmit timer */
++ TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ goto repair; /* Skip network transmission */
++ }
+
+ cwnd_quota = tcp_cwnd_test(tp, skb);
+ if (!cwnd_quota) {
+diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
+index 5667b3003af9..4a9a34954923 100644
+--- a/net/ipv6/addrconf.c
++++ b/net/ipv6/addrconf.c
+@@ -1679,14 +1679,12 @@ void addrconf_dad_failure(struct inet6_ifaddr *ifp)
+ addrconf_mod_dad_work(ifp, 0);
+ }
+
+-/* Join to solicited addr multicast group. */
+-
++/* Join to solicited addr multicast group.
++ * caller must hold RTNL */
+ void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr)
+ {
+ struct in6_addr maddr;
+
+- ASSERT_RTNL();
+-
+ if (dev->flags&(IFF_LOOPBACK|IFF_NOARP))
+ return;
+
+@@ -1694,12 +1692,11 @@ void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr)
+ ipv6_dev_mc_inc(dev, &maddr);
+ }
+
++/* caller must hold RTNL */
+ void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr)
+ {
+ struct in6_addr maddr;
+
+- ASSERT_RTNL();
+-
+ if (idev->dev->flags&(IFF_LOOPBACK|IFF_NOARP))
+ return;
+
+@@ -1707,12 +1704,11 @@ void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr)
+ __ipv6_dev_mc_dec(idev, &maddr);
+ }
+
++/* caller must hold RTNL */
+ static void addrconf_join_anycast(struct inet6_ifaddr *ifp)
+ {
+ struct in6_addr addr;
+
+- ASSERT_RTNL();
+-
+ if (ifp->prefix_len >= 127) /* RFC 6164 */
+ return;
+ ipv6_addr_prefix(&addr, &ifp->addr, ifp->prefix_len);
+@@ -1721,12 +1717,11 @@ static void addrconf_join_anycast(struct inet6_ifaddr *ifp)
+ ipv6_dev_ac_inc(ifp->idev->dev, &addr);
+ }
+
++/* caller must hold RTNL */
+ static void addrconf_leave_anycast(struct inet6_ifaddr *ifp)
+ {
+ struct in6_addr addr;
+
+- ASSERT_RTNL();
+-
+ if (ifp->prefix_len >= 127) /* RFC 6164 */
+ return;
+ ipv6_addr_prefix(&addr, &ifp->addr, ifp->prefix_len);
+@@ -4751,10 +4746,11 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
+
+ if (ip6_del_rt(ifp->rt))
+ dst_free(&ifp->rt->dst);
++
++ rt_genid_bump_ipv6(net);
+ break;
+ }
+ atomic_inc(&net->ipv6.dev_addr_genid);
+- rt_genid_bump_ipv6(net);
+ }
+
+ static void ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
+diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
+index e6960457f625..98cc4cd570e2 100644
+--- a/net/ipv6/addrconf_core.c
++++ b/net/ipv6/addrconf_core.c
+@@ -8,6 +8,13 @@
+ #include <net/addrconf.h>
+ #include <net/ip.h>
+
++/* if ipv6 module registers this function is used by xfrm to force all
++ * sockets to relookup their nodes - this is fairly expensive, be
++ * careful
++ */
++void (*__fib6_flush_trees)(struct net *);
++EXPORT_SYMBOL(__fib6_flush_trees);
++
+ #define IPV6_ADDR_SCOPE_TYPE(scope) ((scope) << 16)
+
+ static inline unsigned int ipv6_addr_scope2type(unsigned int scope)
+diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
+index 210183244689..ff2de7d9d8e6 100644
+--- a/net/ipv6/anycast.c
++++ b/net/ipv6/anycast.c
+@@ -77,6 +77,7 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ pac->acl_next = NULL;
+ pac->acl_addr = *addr;
+
++ rtnl_lock();
+ rcu_read_lock();
+ if (ifindex == 0) {
+ struct rt6_info *rt;
+@@ -137,6 +138,7 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+
+ error:
+ rcu_read_unlock();
++ rtnl_unlock();
+ if (pac)
+ sock_kfree_s(sk, pac, sizeof(*pac));
+ return err;
+@@ -171,11 +173,13 @@ int ipv6_sock_ac_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
+
+ spin_unlock_bh(&ipv6_sk_ac_lock);
+
++ rtnl_lock();
+ rcu_read_lock();
+ dev = dev_get_by_index_rcu(net, pac->acl_ifindex);
+ if (dev)
+ ipv6_dev_ac_dec(dev, &pac->acl_addr);
+ rcu_read_unlock();
++ rtnl_unlock();
+
+ sock_kfree_s(sk, pac, sizeof(*pac));
+ return 0;
+@@ -198,6 +202,7 @@ void ipv6_sock_ac_close(struct sock *sk)
+ spin_unlock_bh(&ipv6_sk_ac_lock);
+
+ prev_index = 0;
++ rtnl_lock();
+ rcu_read_lock();
+ while (pac) {
+ struct ipv6_ac_socklist *next = pac->acl_next;
+@@ -212,6 +217,7 @@ void ipv6_sock_ac_close(struct sock *sk)
+ pac = next;
+ }
+ rcu_read_unlock();
++ rtnl_unlock();
+ }
+
+ static void aca_put(struct ifacaddr6 *ac)
+@@ -233,6 +239,8 @@ int ipv6_dev_ac_inc(struct net_device *dev, const struct in6_addr *addr)
+ struct rt6_info *rt;
+ int err;
+
++ ASSERT_RTNL();
++
+ idev = in6_dev_get(dev);
+
+ if (idev == NULL)
+@@ -302,6 +310,8 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr)
+ {
+ struct ifacaddr6 *aca, *prev_aca;
+
++ ASSERT_RTNL();
++
+ write_lock_bh(&idev->lock);
+ prev_aca = NULL;
+ for (aca = idev->ac_list; aca; aca = aca->aca_next) {
+diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
+index cb4459bd1d29..97b9fa8de377 100644
+--- a/net/ipv6/ip6_fib.c
++++ b/net/ipv6/ip6_fib.c
+@@ -643,7 +643,7 @@ static int fib6_commit_metrics(struct dst_entry *dst,
+ if (dst->flags & DST_HOST) {
+ mp = dst_metrics_write_ptr(dst);
+ } else {
+- mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL);
++ mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
+ if (!mp)
+ return -ENOMEM;
+ dst_init_metrics(dst, mp, 0);
+@@ -1605,6 +1605,24 @@ static void fib6_prune_clones(struct net *net, struct fib6_node *fn)
+ fib6_clean_tree(net, fn, fib6_prune_clone, 1, NULL);
+ }
+
++static int fib6_update_sernum(struct rt6_info *rt, void *arg)
++{
++ __u32 sernum = *(__u32 *)arg;
++
++ if (rt->rt6i_node &&
++ rt->rt6i_node->fn_sernum != sernum)
++ rt->rt6i_node->fn_sernum = sernum;
++
++ return 0;
++}
++
++static void fib6_flush_trees(struct net *net)
++{
++ __u32 new_sernum = fib6_new_sernum();
++
++ fib6_clean_all(net, fib6_update_sernum, &new_sernum);
++}
++
+ /*
+ * Garbage collection
+ */
+@@ -1788,6 +1806,8 @@ int __init fib6_init(void)
+ NULL);
+ if (ret)
+ goto out_unregister_subsys;
++
++ __fib6_flush_trees = fib6_flush_trees;
+ out:
+ return ret;
+
+diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
+index 3873181ed856..43bc1fc24621 100644
+--- a/net/ipv6/ip6_gre.c
++++ b/net/ipv6/ip6_gre.c
+@@ -778,7 +778,7 @@ static inline int ip6gre_xmit_ipv4(struct sk_buff *skb, struct net_device *dev)
+ encap_limit = t->parms.encap_limit;
+
+ memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6));
+- fl6.flowi6_proto = IPPROTO_IPIP;
++ fl6.flowi6_proto = IPPROTO_GRE;
+
+ dsfield = ipv4_get_dsfield(iph);
+
+@@ -828,7 +828,7 @@ static inline int ip6gre_xmit_ipv6(struct sk_buff *skb, struct net_device *dev)
+ encap_limit = t->parms.encap_limit;
+
+ memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6));
+- fl6.flowi6_proto = IPPROTO_IPV6;
++ fl6.flowi6_proto = IPPROTO_GRE;
+
+ dsfield = ipv6_get_dsfield(ipv6h);
+ if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS)
+diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
+index 45702b8cd141..59345af6d3a7 100644
+--- a/net/ipv6/ip6_output.c
++++ b/net/ipv6/ip6_output.c
+@@ -1008,7 +1008,7 @@ struct dst_entry *ip6_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
+ if (final_dst)
+ fl6->daddr = *final_dst;
+
+- return xfrm_lookup(sock_net(sk), dst, flowi6_to_flowi(fl6), sk, 0);
++ return xfrm_lookup_route(sock_net(sk), dst, flowi6_to_flowi(fl6), sk, 0);
+ }
+ EXPORT_SYMBOL_GPL(ip6_dst_lookup_flow);
+
+@@ -1040,7 +1040,7 @@ struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
+ if (final_dst)
+ fl6->daddr = *final_dst;
+
+- return xfrm_lookup(sock_net(sk), dst, flowi6_to_flowi(fl6), sk, 0);
++ return xfrm_lookup_route(sock_net(sk), dst, flowi6_to_flowi(fl6), sk, 0);
+ }
+ EXPORT_SYMBOL_GPL(ip6_sk_dst_lookup_flow);
+
+diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
+index 617f0958e164..a23b655a7627 100644
+--- a/net/ipv6/mcast.c
++++ b/net/ipv6/mcast.c
+@@ -172,6 +172,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ mc_lst->next = NULL;
+ mc_lst->addr = *addr;
+
++ rtnl_lock();
+ rcu_read_lock();
+ if (ifindex == 0) {
+ struct rt6_info *rt;
+@@ -185,6 +186,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+
+ if (dev == NULL) {
+ rcu_read_unlock();
++ rtnl_unlock();
+ sock_kfree_s(sk, mc_lst, sizeof(*mc_lst));
+ return -ENODEV;
+ }
+@@ -202,6 +204,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+
+ if (err) {
+ rcu_read_unlock();
++ rtnl_unlock();
+ sock_kfree_s(sk, mc_lst, sizeof(*mc_lst));
+ return err;
+ }
+@@ -212,6 +215,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ spin_unlock(&ipv6_sk_mc_lock);
+
+ rcu_read_unlock();
++ rtnl_unlock();
+
+ return 0;
+ }
+@@ -229,6 +233,7 @@ int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ if (!ipv6_addr_is_multicast(addr))
+ return -EINVAL;
+
++ rtnl_lock();
+ spin_lock(&ipv6_sk_mc_lock);
+ for (lnk = &np->ipv6_mc_list;
+ (mc_lst = rcu_dereference_protected(*lnk,
+@@ -252,12 +257,15 @@ int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ } else
+ (void) ip6_mc_leave_src(sk, mc_lst, NULL);
+ rcu_read_unlock();
++ rtnl_unlock();
++
+ atomic_sub(sizeof(*mc_lst), &sk->sk_omem_alloc);
+ kfree_rcu(mc_lst, rcu);
+ return 0;
+ }
+ }
+ spin_unlock(&ipv6_sk_mc_lock);
++ rtnl_unlock();
+
+ return -EADDRNOTAVAIL;
+ }
+@@ -302,6 +310,7 @@ void ipv6_sock_mc_close(struct sock *sk)
+ if (!rcu_access_pointer(np->ipv6_mc_list))
+ return;
+
++ rtnl_lock();
+ spin_lock(&ipv6_sk_mc_lock);
+ while ((mc_lst = rcu_dereference_protected(np->ipv6_mc_list,
+ lockdep_is_held(&ipv6_sk_mc_lock))) != NULL) {
+@@ -328,6 +337,7 @@ void ipv6_sock_mc_close(struct sock *sk)
+ spin_lock(&ipv6_sk_mc_lock);
+ }
+ spin_unlock(&ipv6_sk_mc_lock);
++ rtnl_unlock();
+ }
+
+ int ip6_mc_source(int add, int omode, struct sock *sk,
+@@ -845,6 +855,8 @@ int ipv6_dev_mc_inc(struct net_device *dev, const struct in6_addr *addr)
+ struct ifmcaddr6 *mc;
+ struct inet6_dev *idev;
+
++ ASSERT_RTNL();
++
+ /* we need to take a reference on idev */
+ idev = in6_dev_get(dev);
+
+@@ -916,6 +928,8 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr)
+ {
+ struct ifmcaddr6 *ma, **map;
+
++ ASSERT_RTNL();
++
+ write_lock_bh(&idev->lock);
+ for (map = &idev->mc_list; (ma=*map) != NULL; map = &ma->next) {
+ if (ipv6_addr_equal(&ma->mca_addr, addr)) {
+diff --git a/net/ipv6/route.c b/net/ipv6/route.c
+index f23fbd28a501..bafde82324c5 100644
+--- a/net/ipv6/route.c
++++ b/net/ipv6/route.c
+@@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
+
+ memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
+ rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
+- rt->rt6i_genid = rt_genid_ipv6(net);
+ INIT_LIST_HEAD(&rt->rt6i_siblings);
+ }
+ return rt;
+@@ -1098,9 +1097,6 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
+ * DST_OBSOLETE_FORCE_CHK which forces validation calls down
+ * into this function always.
+ */
+- if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
+- return NULL;
+-
+ if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
+ return NULL;
+
+diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
+index 4f408176dc64..9906535ce9de 100644
+--- a/net/ipv6/sit.c
++++ b/net/ipv6/sit.c
+@@ -101,19 +101,19 @@ static struct ip_tunnel *ipip6_tunnel_lookup(struct net *net,
+ for_each_ip_tunnel_rcu(t, sitn->tunnels_r_l[h0 ^ h1]) {
+ if (local == t->parms.iph.saddr &&
+ remote == t->parms.iph.daddr &&
+- (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
++ (!dev || !t->parms.link || dev->ifindex == t->parms.link) &&
+ (t->dev->flags & IFF_UP))
+ return t;
+ }
+ for_each_ip_tunnel_rcu(t, sitn->tunnels_r[h0]) {
+ if (remote == t->parms.iph.daddr &&
+- (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
++ (!dev || !t->parms.link || dev->ifindex == t->parms.link) &&
+ (t->dev->flags & IFF_UP))
+ return t;
+ }
+ for_each_ip_tunnel_rcu(t, sitn->tunnels_l[h1]) {
+ if (local == t->parms.iph.saddr &&
+- (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
++ (!dev || !t->parms.link || dev->ifindex == t->parms.link) &&
+ (t->dev->flags & IFF_UP))
+ return t;
+ }
+diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
+index 229239ad96b1..cb5125c5328d 100644
+--- a/net/ipv6/tcp_ipv6.c
++++ b/net/ipv6/tcp_ipv6.c
+@@ -1681,6 +1681,7 @@ static const struct inet_connection_sock_af_ops ipv6_specific = {
+ .compat_setsockopt = compat_ipv6_setsockopt,
+ .compat_getsockopt = compat_ipv6_getsockopt,
+ #endif
++ .mtu_reduced = tcp_v6_mtu_reduced,
+ };
+
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1711,6 +1712,7 @@ static const struct inet_connection_sock_af_ops ipv6_mapped = {
+ .compat_setsockopt = compat_ipv6_setsockopt,
+ .compat_getsockopt = compat_ipv6_getsockopt,
+ #endif
++ .mtu_reduced = tcp_v4_mtu_reduced,
+ };
+
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1950,7 +1952,6 @@ struct proto tcpv6_prot = {
+ .sendpage = tcp_sendpage,
+ .backlog_rcv = tcp_v6_do_rcv,
+ .release_cb = tcp_release_cb,
+- .mtu_reduced = tcp_v6_mtu_reduced,
+ .hash = tcp_v6_hash,
+ .unhash = inet_unhash,
+ .get_port = inet_csk_get_port,
+diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
+index 13752d96275e..b704a9356208 100644
+--- a/net/l2tp/l2tp_ppp.c
++++ b/net/l2tp/l2tp_ppp.c
+@@ -755,7 +755,8 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
+ /* If PMTU discovery was enabled, use the MTU that was discovered */
+ dst = sk_dst_get(tunnel->sock);
+ if (dst != NULL) {
+- u32 pmtu = dst_mtu(__sk_dst_get(tunnel->sock));
++ u32 pmtu = dst_mtu(dst);
++
+ if (pmtu != 0)
+ session->mtu = session->mru = pmtu -
+ PPPOL2TP_HEADER_OVERHEAD;
+diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
+index e6fac7e3db52..48fc607a211e 100644
+--- a/net/netlink/af_netlink.c
++++ b/net/netlink/af_netlink.c
+@@ -205,7 +205,7 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
+ nskb->protocol = htons((u16) sk->sk_protocol);
+ nskb->pkt_type = netlink_is_kernel(sk) ?
+ PACKET_KERNEL : PACKET_USER;
+-
++ skb_reset_network_header(nskb);
+ ret = dev_queue_xmit(nskb);
+ if (unlikely(ret > 0))
+ ret = net_xmit_errno(ret);
+diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
+index e70d8b18e962..10736e6b192b 100644
+--- a/net/openvswitch/actions.c
++++ b/net/openvswitch/actions.c
+@@ -42,6 +42,9 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
+
+ static int make_writable(struct sk_buff *skb, int write_len)
+ {
++ if (!pskb_may_pull(skb, write_len))
++ return -ENOMEM;
++
+ if (!skb_cloned(skb) || skb_clone_writable(skb, write_len))
+ return 0;
+
+@@ -70,6 +73,8 @@ static int __pop_vlan_tci(struct sk_buff *skb, __be16 *current_tci)
+
+ vlan_set_encap_proto(skb, vhdr);
+ skb->mac_header += VLAN_HLEN;
++ if (skb_network_offset(skb) < ETH_HLEN)
++ skb_set_network_header(skb, ETH_HLEN);
+ skb_reset_mac_len(skb);
+
+ return 0;
+diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
+index b85c67ccb797..3eb786fd3f22 100644
+--- a/net/packet/af_packet.c
++++ b/net/packet/af_packet.c
+@@ -636,6 +636,7 @@ static void init_prb_bdqc(struct packet_sock *po,
+ p1->tov_in_jiffies = msecs_to_jiffies(p1->retire_blk_tov);
+ p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv;
+
++ p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
+ prb_init_ft_ops(p1, req_u);
+ prb_setup_retire_blk_timer(po, tx_ring);
+ prb_open_block(p1, pbd);
+@@ -1946,6 +1947,18 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
+ if ((int)snaplen < 0)
+ snaplen = 0;
+ }
++ } else if (unlikely(macoff + snaplen >
++ GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len)) {
++ u32 nval;
++
++ nval = GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len - macoff;
++ pr_err_once("tpacket_rcv: packet too big, clamped from %u to %u. macoff=%u\n",
++ snaplen, nval, macoff);
++ snaplen = nval;
++ if (unlikely((int)snaplen < 0)) {
++ snaplen = 0;
++ macoff = GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len;
++ }
+ }
+ spin_lock(&sk->sk_receive_queue.lock);
+ h.raw = packet_current_rx_frame(po, skb,
+@@ -3789,6 +3802,10 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
+ goto out;
+ if (unlikely(req->tp_block_size & (PAGE_SIZE - 1)))
+ goto out;
++ if (po->tp_version >= TPACKET_V3 &&
++ (int)(req->tp_block_size -
++ BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
++ goto out;
+ if (unlikely(req->tp_frame_size < po->tp_hdrlen +
+ po->tp_reserve))
+ goto out;
+diff --git a/net/packet/internal.h b/net/packet/internal.h
+index eb9580a6b25f..cdddf6a30399 100644
+--- a/net/packet/internal.h
++++ b/net/packet/internal.h
+@@ -29,6 +29,7 @@ struct tpacket_kbdq_core {
+ char *pkblk_start;
+ char *pkblk_end;
+ int kblk_size;
++ unsigned int max_frame_len;
+ unsigned int knum_blocks;
+ uint64_t knxt_seq_num;
+ char *prev;
+diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
+index 45527e6b52db..3b2617aa6bcd 100644
+--- a/net/sched/cls_api.c
++++ b/net/sched/cls_api.c
+@@ -549,6 +549,7 @@ void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
+ tcf_tree_lock(tp);
+ list_splice_init(&dst->actions, &tmp);
+ list_splice(&src->actions, &dst->actions);
++ dst->type = src->type;
+ tcf_tree_unlock(tp);
+ tcf_action_destroy(&tmp, TCA_ACT_UNBIND);
+ #endif
+diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
+index 5170a1ff95a1..7194fe8589b0 100644
+--- a/net/sctp/sm_statefuns.c
++++ b/net/sctp/sm_statefuns.c
+@@ -1775,9 +1775,22 @@ static sctp_disposition_t sctp_sf_do_dupcook_a(struct net *net,
+ /* Update the content of current association. */
+ sctp_add_cmd_sf(commands, SCTP_CMD_UPDATE_ASSOC, SCTP_ASOC(new_asoc));
+ sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP, SCTP_ULPEVENT(ev));
+- sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
+- SCTP_STATE(SCTP_STATE_ESTABLISHED));
+- sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(repl));
++ if (sctp_state(asoc, SHUTDOWN_PENDING) &&
++ (sctp_sstate(asoc->base.sk, CLOSING) ||
++ sock_flag(asoc->base.sk, SOCK_DEAD))) {
++ /* if were currently in SHUTDOWN_PENDING, but the socket
++ * has been closed by user, don't transition to ESTABLISHED.
++ * Instead trigger SHUTDOWN bundled with COOKIE_ACK.
++ */
++ sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(repl));
++ return sctp_sf_do_9_2_start_shutdown(net, ep, asoc,
++ SCTP_ST_CHUNK(0), NULL,
++ commands);
++ } else {
++ sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
++ SCTP_STATE(SCTP_STATE_ESTABLISHED));
++ sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(repl));
++ }
+ return SCTP_DISPOSITION_CONSUME;
+
+ nomem_ev:
+diff --git a/net/tipc/port.h b/net/tipc/port.h
+index cf4ca5b1d9a4..3f34cac07a2c 100644
+--- a/net/tipc/port.h
++++ b/net/tipc/port.h
+@@ -229,9 +229,12 @@ static inline int tipc_port_importance(struct tipc_port *port)
+ return msg_importance(&port->phdr);
+ }
+
+-static inline void tipc_port_set_importance(struct tipc_port *port, int imp)
++static inline int tipc_port_set_importance(struct tipc_port *port, int imp)
+ {
++ if (imp > TIPC_CRITICAL_IMPORTANCE)
++ return -EINVAL;
+ msg_set_importance(&port->phdr, (u32)imp);
++ return 0;
+ }
+
+ #endif
+diff --git a/net/tipc/socket.c b/net/tipc/socket.c
+index ef0475568f9e..4093fd81edd5 100644
+--- a/net/tipc/socket.c
++++ b/net/tipc/socket.c
+@@ -1841,7 +1841,7 @@ static int tipc_setsockopt(struct socket *sock, int lvl, int opt,
+
+ switch (opt) {
+ case TIPC_IMPORTANCE:
+- tipc_port_set_importance(port, value);
++ res = tipc_port_set_importance(port, value);
+ break;
+ case TIPC_SRC_DROPPABLE:
+ if (sock->type != SOCK_STREAM)
+diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
+index 0525d78ba328..93e755b97486 100644
+--- a/net/xfrm/xfrm_policy.c
++++ b/net/xfrm/xfrm_policy.c
+@@ -39,6 +39,11 @@
+ #define XFRM_QUEUE_TMO_MAX ((unsigned)(60*HZ))
+ #define XFRM_MAX_QUEUE_LEN 100
+
++struct xfrm_flo {
++ struct dst_entry *dst_orig;
++ u8 flags;
++};
++
+ static DEFINE_SPINLOCK(xfrm_policy_afinfo_lock);
+ static struct xfrm_policy_afinfo __rcu *xfrm_policy_afinfo[NPROTO]
+ __read_mostly;
+@@ -1877,13 +1882,14 @@ static int xdst_queue_output(struct sock *sk, struct sk_buff *skb)
+ }
+
+ static struct xfrm_dst *xfrm_create_dummy_bundle(struct net *net,
+- struct dst_entry *dst,
++ struct xfrm_flo *xflo,
+ const struct flowi *fl,
+ int num_xfrms,
+ u16 family)
+ {
+ int err;
+ struct net_device *dev;
++ struct dst_entry *dst;
+ struct dst_entry *dst1;
+ struct xfrm_dst *xdst;
+
+@@ -1891,9 +1897,12 @@ static struct xfrm_dst *xfrm_create_dummy_bundle(struct net *net,
+ if (IS_ERR(xdst))
+ return xdst;
+
+- if (net->xfrm.sysctl_larval_drop || num_xfrms <= 0)
++ if (!(xflo->flags & XFRM_LOOKUP_QUEUE) ||
++ net->xfrm.sysctl_larval_drop ||
++ num_xfrms <= 0)
+ return xdst;
+
++ dst = xflo->dst_orig;
+ dst1 = &xdst->u.dst;
+ dst_hold(dst);
+ xdst->route = dst;
+@@ -1935,7 +1944,7 @@ static struct flow_cache_object *
+ xfrm_bundle_lookup(struct net *net, const struct flowi *fl, u16 family, u8 dir,
+ struct flow_cache_object *oldflo, void *ctx)
+ {
+- struct dst_entry *dst_orig = (struct dst_entry *)ctx;
++ struct xfrm_flo *xflo = (struct xfrm_flo *)ctx;
+ struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
+ struct xfrm_dst *xdst, *new_xdst;
+ int num_pols = 0, num_xfrms = 0, i, err, pol_dead;
+@@ -1976,7 +1985,8 @@ xfrm_bundle_lookup(struct net *net, const struct flowi *fl, u16 family, u8 dir,
+ goto make_dummy_bundle;
+ }
+
+- new_xdst = xfrm_resolve_and_create_bundle(pols, num_pols, fl, family, dst_orig);
++ new_xdst = xfrm_resolve_and_create_bundle(pols, num_pols, fl, family,
++ xflo->dst_orig);
+ if (IS_ERR(new_xdst)) {
+ err = PTR_ERR(new_xdst);
+ if (err != -EAGAIN)
+@@ -2010,7 +2020,7 @@ make_dummy_bundle:
+ /* We found policies, but there's no bundles to instantiate:
+ * either because the policy blocks, has no transformations or
+ * we could not build template (no xfrm_states).*/
+- xdst = xfrm_create_dummy_bundle(net, dst_orig, fl, num_xfrms, family);
++ xdst = xfrm_create_dummy_bundle(net, xflo, fl, num_xfrms, family);
+ if (IS_ERR(xdst)) {
+ xfrm_pols_put(pols, num_pols);
+ return ERR_CAST(xdst);
+@@ -2104,13 +2114,18 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
+ }
+
+ if (xdst == NULL) {
++ struct xfrm_flo xflo;
++
++ xflo.dst_orig = dst_orig;
++ xflo.flags = flags;
++
+ /* To accelerate a bit... */
+ if ((dst_orig->flags & DST_NOXFRM) ||
+ !net->xfrm.policy_count[XFRM_POLICY_OUT])
+ goto nopol;
+
+ flo = flow_cache_lookup(net, fl, family, dir,
+- xfrm_bundle_lookup, dst_orig);
++ xfrm_bundle_lookup, &xflo);
+ if (flo == NULL)
+ goto nopol;
+ if (IS_ERR(flo)) {
+@@ -2138,7 +2153,7 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
+ xfrm_pols_put(pols, drop_pols);
+ XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
+
+- return make_blackhole(net, family, dst_orig);
++ return ERR_PTR(-EREMOTE);
+ }
+
+ err = -EAGAIN;
+@@ -2195,6 +2210,23 @@ dropdst:
+ }
+ EXPORT_SYMBOL(xfrm_lookup);
+
++/* Callers of xfrm_lookup_route() must ensure a call to dst_output().
++ * Otherwise we may send out blackholed packets.
++ */
++struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig,
++ const struct flowi *fl,
++ struct sock *sk, int flags)
++{
++ struct dst_entry *dst = xfrm_lookup(net, dst_orig, fl, sk,
++ flags | XFRM_LOOKUP_QUEUE);
++
++ if (IS_ERR(dst) && PTR_ERR(dst) == -EREMOTE)
++ return make_blackhole(net, dst_orig->ops->family, dst_orig);
++
++ return dst;
++}
++EXPORT_SYMBOL(xfrm_lookup_route);
++
+ static inline int
+ xfrm_secpath_reject(int idx, struct sk_buff *skb, const struct flowi *fl)
+ {
+@@ -2460,7 +2492,7 @@ int __xfrm_route_forward(struct sk_buff *skb, unsigned short family)
+
+ skb_dst_force(skb);
+
+- dst = xfrm_lookup(net, skb_dst(skb), &fl, NULL, 0);
++ dst = xfrm_lookup(net, skb_dst(skb), &fl, NULL, XFRM_LOOKUP_QUEUE);
+ if (IS_ERR(dst)) {
+ res = 0;
+ dst = NULL;
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-30 19:29 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-30 19:29 UTC (permalink / raw
To: gentoo-commits
commit: 5ca4fd40116dd22e8caab91c470be1860fe0141d
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Thu Oct 30 19:29:00 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Thu Oct 30 19:29:00 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=5ca4fd40
Linux patch 3.16.7
---
0000_README | 4 +
1006_linux-3.16.7.patch | 6873 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 6877 insertions(+)
diff --git a/0000_README b/0000_README
index a7526a7..9bf3b17 100644
--- a/0000_README
+++ b/0000_README
@@ -66,6 +66,10 @@ Patch: 1005_linux-3.16.6.patch
From: http://www.kernel.org
Desc: Linux 3.16.6
+Patch: 1006_linux-3.16.7.patch
+From: http://www.kernel.org
+Desc: Linux 3.16.7
+
Patch: 1500_XATTR_USER_PREFIX.patch
From: https://bugs.gentoo.org/show_bug.cgi?id=470644
Desc: Support for namespace user.pax.* on tmpfs.
diff --git a/1006_linux-3.16.7.patch b/1006_linux-3.16.7.patch
new file mode 100644
index 0000000..9776e1b
--- /dev/null
+++ b/1006_linux-3.16.7.patch
@@ -0,0 +1,6873 @@
+diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
+new file mode 100644
+index 000000000000..ea45dd3901e3
+--- /dev/null
++++ b/Documentation/lzo.txt
+@@ -0,0 +1,164 @@
++
++LZO stream format as understood by Linux's LZO decompressor
++===========================================================
++
++Introduction
++
++ This is not a specification. No specification seems to be publicly available
++ for the LZO stream format. This document describes what input format the LZO
++ decompressor as implemented in the Linux kernel understands. The file subject
++ of this analysis is lib/lzo/lzo1x_decompress_safe.c. No analysis was made on
++ the compressor nor on any other implementations though it seems likely that
++ the format matches the standard one. The purpose of this document is to
++ better understand what the code does in order to propose more efficient fixes
++ for future bug reports.
++
++Description
++
++ The stream is composed of a series of instructions, operands, and data. The
++ instructions consist in a few bits representing an opcode, and bits forming
++ the operands for the instruction, whose size and position depend on the
++ opcode and on the number of literals copied by previous instruction. The
++ operands are used to indicate :
++
++ - a distance when copying data from the dictionary (past output buffer)
++ - a length (number of bytes to copy from dictionary)
++ - the number of literals to copy, which is retained in variable "state"
++ as a piece of information for next instructions.
++
++ Optionally depending on the opcode and operands, extra data may follow. These
++ extra data can be a complement for the operand (eg: a length or a distance
++ encoded on larger values), or a literal to be copied to the output buffer.
++
++ The first byte of the block follows a different encoding from other bytes, it
++ seems to be optimized for literal use only, since there is no dictionary yet
++ prior to that byte.
++
++ Lengths are always encoded on a variable size starting with a small number
++ of bits in the operand. If the number of bits isn't enough to represent the
++ length, up to 255 may be added in increments by consuming more bytes with a
++ rate of at most 255 per extra byte (thus the compression ratio cannot exceed
++ around 255:1). The variable length encoding using #bits is always the same :
++
++ length = byte & ((1 << #bits) - 1)
++ if (!length) {
++ length = ((1 << #bits) - 1)
++ length += 255*(number of zero bytes)
++ length += first-non-zero-byte
++ }
++ length += constant (generally 2 or 3)
++
++ For references to the dictionary, distances are relative to the output
++ pointer. Distances are encoded using very few bits belonging to certain
++ ranges, resulting in multiple copy instructions using different encodings.
++ Certain encodings involve one extra byte, others involve two extra bytes
++ forming a little-endian 16-bit quantity (marked LE16 below).
++
++ After any instruction except the large literal copy, 0, 1, 2 or 3 literals
++ are copied before starting the next instruction. The number of literals that
++ were copied may change the meaning and behaviour of the next instruction. In
++ practice, only one instruction needs to know whether 0, less than 4, or more
++ literals were copied. This is the information stored in the <state> variable
++ in this implementation. This number of immediate literals to be copied is
++ generally encoded in the last two bits of the instruction but may also be
++ taken from the last two bits of an extra operand (eg: distance).
++
++ End of stream is declared when a block copy of distance 0 is seen. Only one
++ instruction may encode this distance (0001HLLL), it takes one LE16 operand
++ for the distance, thus requiring 3 bytes.
++
++ IMPORTANT NOTE : in the code some length checks are missing because certain
++ instructions are called under the assumption that a certain number of bytes
++ follow because it has already been garanteed before parsing the instructions.
++ They just have to "refill" this credit if they consume extra bytes. This is
++ an implementation design choice independant on the algorithm or encoding.
++
++Byte sequences
++
++ First byte encoding :
++
++ 0..17 : follow regular instruction encoding, see below. It is worth
++ noting that codes 16 and 17 will represent a block copy from
++ the dictionary which is empty, and that they will always be
++ invalid at this place.
++
++ 18..21 : copy 0..3 literals
++ state = (byte - 17) = 0..3 [ copy <state> literals ]
++ skip byte
++
++ 22..255 : copy literal string
++ length = (byte - 17) = 4..238
++ state = 4 [ don't copy extra literals ]
++ skip byte
++
++ Instruction encoding :
++
++ 0 0 0 0 X X X X (0..15)
++ Depends on the number of literals copied by the last instruction.
++ If last instruction did not copy any literal (state == 0), this
++ encoding will be a copy of 4 or more literal, and must be interpreted
++ like this :
++
++ 0 0 0 0 L L L L (0..15) : copy long literal string
++ length = 3 + (L ?: 15 + (zero_bytes * 255) + non_zero_byte)
++ state = 4 (no extra literals are copied)
++
++ If last instruction used to copy between 1 to 3 literals (encoded in
++ the instruction's opcode or distance), the instruction is a copy of a
++ 2-byte block from the dictionary within a 1kB distance. It is worth
++ noting that this instruction provides little savings since it uses 2
++ bytes to encode a copy of 2 other bytes but it encodes the number of
++ following literals for free. It must be interpreted like this :
++
++ 0 0 0 0 D D S S (0..15) : copy 2 bytes from <= 1kB distance
++ length = 2
++ state = S (copy S literals after this block)
++ Always followed by exactly one byte : H H H H H H H H
++ distance = (H << 2) + D + 1
++
++ If last instruction used to copy 4 or more literals (as detected by
++ state == 4), the instruction becomes a copy of a 3-byte block from the
++ dictionary from a 2..3kB distance, and must be interpreted like this :
++
++ 0 0 0 0 D D S S (0..15) : copy 3 bytes from 2..3 kB distance
++ length = 3
++ state = S (copy S literals after this block)
++ Always followed by exactly one byte : H H H H H H H H
++ distance = (H << 2) + D + 2049
++
++ 0 0 0 1 H L L L (16..31)
++ Copy of a block within 16..48kB distance (preferably less than 10B)
++ length = 2 + (L ?: 7 + (zero_bytes * 255) + non_zero_byte)
++ Always followed by exactly one LE16 : D D D D D D D D : D D D D D D S S
++ distance = 16384 + (H << 14) + D
++ state = S (copy S literals after this block)
++ End of stream is reached if distance == 16384
++
++ 0 0 1 L L L L L (32..63)
++ Copy of small block within 16kB distance (preferably less than 34B)
++ length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
++ Always followed by exactly one LE16 : D D D D D D D D : D D D D D D S S
++ distance = D + 1
++ state = S (copy S literals after this block)
++
++ 0 1 L D D D S S (64..127)
++ Copy 3-4 bytes from block within 2kB distance
++ state = S (copy S literals after this block)
++ length = 3 + L
++ Always followed by exactly one byte : H H H H H H H H
++ distance = (H << 3) + D + 1
++
++ 1 L L D D D S S (128..255)
++ Copy 5-8 bytes from block within 2kB distance
++ state = S (copy S literals after this block)
++ length = 5 + L
++ Always followed by exactly one byte : H H H H H H H H
++ distance = (H << 3) + D + 1
++
++Authors
++
++ This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
++ analysis of the decompression code available in Linux 3.16-rc5. The code is
++ tricky, it is possible that this document contains mistakes or that a few
++ corner cases were overlooked. In any case, please report any doubt, fix, or
++ proposed updates to the author(s) so that the document can be updated.
+diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
+index 290894176142..53838d9c6295 100644
+--- a/Documentation/virtual/kvm/mmu.txt
++++ b/Documentation/virtual/kvm/mmu.txt
+@@ -425,6 +425,20 @@ fault through the slow path.
+ Since only 19 bits are used to store generation-number on mmio spte, all
+ pages are zapped when there is an overflow.
+
++Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
++times, the last one happening when the generation number is retrieved and
++stored into the MMIO spte. Thus, the MMIO spte might be created based on
++out-of-date information, but with an up-to-date generation number.
++
++To avoid this, the generation number is incremented again after synchronize_srcu
++returns; thus, the low bit of kvm_memslots(kvm)->generation is only 1 during a
++memslot update, while some SRCU readers might be using the old copy. We do not
++want to use an MMIO sptes created with an odd generation number, and we can do
++this without losing a bit in the MMIO spte. The low bit of the generation
++is not stored in MMIO spte, and presumed zero when it is extracted out of the
++spte. If KVM is unlucky and creates an MMIO spte while the low bit is 1,
++the next access to the spte will always be a cache miss.
++
+
+ Further reading
+ ===============
+diff --git a/Makefile b/Makefile
+index 5c4bc3fc18c0..29ba21cde7c0 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 6
++SUBLEVEL = 7
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+
+diff --git a/arch/arm/boot/dts/Makefile b/arch/arm/boot/dts/Makefile
+index adb5ed9e269e..c04db0ae0895 100644
+--- a/arch/arm/boot/dts/Makefile
++++ b/arch/arm/boot/dts/Makefile
+@@ -137,8 +137,8 @@ kirkwood := \
+ kirkwood-openrd-client.dtb \
+ kirkwood-openrd-ultimate.dtb \
+ kirkwood-rd88f6192.dtb \
+- kirkwood-rd88f6281-a0.dtb \
+- kirkwood-rd88f6281-a1.dtb \
++ kirkwood-rd88f6281-z0.dtb \
++ kirkwood-rd88f6281-a.dtb \
+ kirkwood-rs212.dtb \
+ kirkwood-rs409.dtb \
+ kirkwood-rs411.dtb \
+diff --git a/arch/arm/boot/dts/armada-370-netgear-rn102.dts b/arch/arm/boot/dts/armada-370-netgear-rn102.dts
+index d6d572e5af32..285524fb915e 100644
+--- a/arch/arm/boot/dts/armada-370-netgear-rn102.dts
++++ b/arch/arm/boot/dts/armada-370-netgear-rn102.dts
+@@ -143,6 +143,10 @@
+ marvell,nand-enable-arbiter;
+ nand-on-flash-bbt;
+
++ /* Use Hardware BCH ECC */
++ nand-ecc-strength = <4>;
++ nand-ecc-step-size = <512>;
++
+ partition@0 {
+ label = "u-boot";
+ reg = <0x0000000 0x180000>; /* 1.5MB */
+diff --git a/arch/arm/boot/dts/armada-370-netgear-rn104.dts b/arch/arm/boot/dts/armada-370-netgear-rn104.dts
+index c5fe8b5dcdc7..4ec1ce561d34 100644
+--- a/arch/arm/boot/dts/armada-370-netgear-rn104.dts
++++ b/arch/arm/boot/dts/armada-370-netgear-rn104.dts
+@@ -145,6 +145,10 @@
+ marvell,nand-enable-arbiter;
+ nand-on-flash-bbt;
+
++ /* Use Hardware BCH ECC */
++ nand-ecc-strength = <4>;
++ nand-ecc-step-size = <512>;
++
+ partition@0 {
+ label = "u-boot";
+ reg = <0x0000000 0x180000>; /* 1.5MB */
+diff --git a/arch/arm/boot/dts/armada-xp-netgear-rn2120.dts b/arch/arm/boot/dts/armada-xp-netgear-rn2120.dts
+index 0cf999abc4ed..c5ed85a70ed9 100644
+--- a/arch/arm/boot/dts/armada-xp-netgear-rn2120.dts
++++ b/arch/arm/boot/dts/armada-xp-netgear-rn2120.dts
+@@ -223,6 +223,10 @@
+ marvell,nand-enable-arbiter;
+ nand-on-flash-bbt;
+
++ /* Use Hardware BCH ECC */
++ nand-ecc-strength = <4>;
++ nand-ecc-step-size = <512>;
++
+ partition@0 {
+ label = "u-boot";
+ reg = <0x0000000 0x180000>; /* 1.5MB */
+diff --git a/arch/arm/boot/dts/at91sam9263.dtsi b/arch/arm/boot/dts/at91sam9263.dtsi
+index fece8665fb63..b8f234bf7de8 100644
+--- a/arch/arm/boot/dts/at91sam9263.dtsi
++++ b/arch/arm/boot/dts/at91sam9263.dtsi
+@@ -535,6 +535,7 @@
+ compatible = "atmel,hsmci";
+ reg = <0xfff80000 0x600>;
+ interrupts = <10 IRQ_TYPE_LEVEL_HIGH 0>;
++ pinctrl-names = "default";
+ #address-cells = <1>;
+ #size-cells = <0>;
+ status = "disabled";
+@@ -544,6 +545,7 @@
+ compatible = "atmel,hsmci";
+ reg = <0xfff84000 0x600>;
+ interrupts = <11 IRQ_TYPE_LEVEL_HIGH 0>;
++ pinctrl-names = "default";
+ #address-cells = <1>;
+ #size-cells = <0>;
+ status = "disabled";
+diff --git a/arch/arm/boot/dts/imx28-evk.dts b/arch/arm/boot/dts/imx28-evk.dts
+index e4cc44c98585..41a983405e7d 100644
+--- a/arch/arm/boot/dts/imx28-evk.dts
++++ b/arch/arm/boot/dts/imx28-evk.dts
+@@ -193,7 +193,6 @@
+ i2c0: i2c@80058000 {
+ pinctrl-names = "default";
+ pinctrl-0 = <&i2c0_pins_a>;
+- clock-frequency = <400000>;
+ status = "okay";
+
+ sgtl5000: codec@0a {
+diff --git a/arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts b/arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts
+index 8f76d28759a3..f82827d6fcff 100644
+--- a/arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts
++++ b/arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts
+@@ -123,11 +123,11 @@
+
+ dsa@0 {
+ compatible = "marvell,dsa";
+- #address-cells = <2>;
++ #address-cells = <1>;
+ #size-cells = <0>;
+
+- dsa,ethernet = <ð0>;
+- dsa,mii-bus = <ðphy0>;
++ dsa,ethernet = <ð0port>;
++ dsa,mii-bus = <&mdio>;
+
+ switch@0 {
+ #address-cells = <1>;
+@@ -169,17 +169,13 @@
+
+ &mdio {
+ status = "okay";
+-
+- ethphy0: ethernet-phy@ff {
+- reg = <0xff>; /* No phy attached */
+- speed = <1000>;
+- duplex = <1>;
+- };
+ };
+
+ ð0 {
+ status = "okay";
++
+ ethernet0-port@0 {
+- phy-handle = <ðphy0>;
++ speed = <1000>;
++ duplex = <1>;
+ };
+ };
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281-a.dts b/arch/arm/boot/dts/kirkwood-rd88f6281-a.dts
+new file mode 100644
+index 000000000000..f2e08b3b33ea
+--- /dev/null
++++ b/arch/arm/boot/dts/kirkwood-rd88f6281-a.dts
+@@ -0,0 +1,43 @@
++/*
++ * Marvell RD88F6181 A Board descrition
++ *
++ * Andrew Lunn <andrew@lunn.ch>
++ *
++ * This file is licensed under the terms of the GNU General Public
++ * License version 2. This program is licensed "as is" without any
++ * warranty of any kind, whether express or implied.
++ *
++ * This file contains the definitions for the board with the A0 or
++ * higher stepping of the SoC. The ethernet switch does not have a
++ * "wan" port.
++ */
++
++/dts-v1/;
++#include "kirkwood-rd88f6281.dtsi"
++
++/ {
++ model = "Marvell RD88f6281 Reference design, with A0 or higher SoC";
++ compatible = "marvell,rd88f6281-a", "marvell,rd88f6281","marvell,kirkwood-88f6281", "marvell,kirkwood";
++
++ dsa@0 {
++ switch@0 {
++ reg = <10 0>; /* MDIO address 10, switch 0 in tree */
++ };
++ };
++};
++
++&mdio {
++ status = "okay";
++
++ ethphy1: ethernet-phy@11 {
++ reg = <11>;
++ };
++};
++
++ð1 {
++ status = "okay";
++
++ ethernet1-port@0 {
++ phy-handle = <ðphy1>;
++ };
++};
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281-a0.dts b/arch/arm/boot/dts/kirkwood-rd88f6281-a0.dts
+deleted file mode 100644
+index a803bbb70bc8..000000000000
+--- a/arch/arm/boot/dts/kirkwood-rd88f6281-a0.dts
++++ /dev/null
+@@ -1,26 +0,0 @@
+-/*
+- * Marvell RD88F6181 A0 Board descrition
+- *
+- * Andrew Lunn <andrew@lunn.ch>
+- *
+- * This file is licensed under the terms of the GNU General Public
+- * License version 2. This program is licensed "as is" without any
+- * warranty of any kind, whether express or implied.
+- *
+- * This file contains the definitions for the board with the A0 variant of
+- * the SoC. The ethernet switch does not have a "wan" port.
+- */
+-
+-/dts-v1/;
+-#include "kirkwood-rd88f6281.dtsi"
+-
+-/ {
+- model = "Marvell RD88f6281 Reference design, with A0 SoC";
+- compatible = "marvell,rd88f6281-a0", "marvell,rd88f6281","marvell,kirkwood-88f6281", "marvell,kirkwood";
+-
+- dsa@0 {
+- switch@0 {
+- reg = <10 0>; /* MDIO address 10, switch 0 in tree */
+- };
+- };
+-};
+\ No newline at end of file
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281-a1.dts b/arch/arm/boot/dts/kirkwood-rd88f6281-a1.dts
+deleted file mode 100644
+index baeebbf1d8c7..000000000000
+--- a/arch/arm/boot/dts/kirkwood-rd88f6281-a1.dts
++++ /dev/null
+@@ -1,31 +0,0 @@
+-/*
+- * Marvell RD88F6181 A1 Board descrition
+- *
+- * Andrew Lunn <andrew@lunn.ch>
+- *
+- * This file is licensed under the terms of the GNU General Public
+- * License version 2. This program is licensed "as is" without any
+- * warranty of any kind, whether express or implied.
+- *
+- * This file contains the definitions for the board with the A1 variant of
+- * the SoC. The ethernet switch has a "wan" port.
+- */
+-
+-/dts-v1/;
+-
+-#include "kirkwood-rd88f6281.dtsi"
+-
+-/ {
+- model = "Marvell RD88f6281 Reference design, with A1 SoC";
+- compatible = "marvell,rd88f6281-a1", "marvell,rd88f6281","marvell,kirkwood-88f6281", "marvell,kirkwood";
+-
+- dsa@0 {
+- switch@0 {
+- reg = <0 0>; /* MDIO address 0, switch 0 in tree */
+- port@4 {
+- reg = <4>;
+- label = "wan";
+- };
+- };
+- };
+-};
+\ No newline at end of file
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281-z0.dts b/arch/arm/boot/dts/kirkwood-rd88f6281-z0.dts
+new file mode 100644
+index 000000000000..f4272b64ed7f
+--- /dev/null
++++ b/arch/arm/boot/dts/kirkwood-rd88f6281-z0.dts
+@@ -0,0 +1,35 @@
++/*
++ * Marvell RD88F6181 Z0 stepping descrition
++ *
++ * Andrew Lunn <andrew@lunn.ch>
++ *
++ * This file is licensed under the terms of the GNU General Public
++ * License version 2. This program is licensed "as is" without any
++ * warranty of any kind, whether express or implied.
++ *
++ * This file contains the definitions for the board using the Z0
++ * stepping of the SoC. The ethernet switch has a "wan" port.
++*/
++
++/dts-v1/;
++
++#include "kirkwood-rd88f6281.dtsi"
++
++/ {
++ model = "Marvell RD88f6281 Reference design, with Z0 SoC";
++ compatible = "marvell,rd88f6281-z0", "marvell,rd88f6281","marvell,kirkwood-88f6281", "marvell,kirkwood";
++
++ dsa@0 {
++ switch@0 {
++ reg = <0 0>; /* MDIO address 0, switch 0 in tree */
++ port@4 {
++ reg = <4>;
++ label = "wan";
++ };
++ };
++ };
++};
++
++ð1 {
++ status = "disabled";
++};
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281.dtsi b/arch/arm/boot/dts/kirkwood-rd88f6281.dtsi
+index 26cf0e0ccefd..d195e884b3b5 100644
+--- a/arch/arm/boot/dts/kirkwood-rd88f6281.dtsi
++++ b/arch/arm/boot/dts/kirkwood-rd88f6281.dtsi
+@@ -37,7 +37,6 @@
+
+ ocp@f1000000 {
+ pinctrl: pin-controller@10000 {
+- pinctrl-0 = <&pmx_sdio_cd>;
+ pinctrl-names = "default";
+
+ pmx_sdio_cd: pmx-sdio-cd {
+@@ -69,8 +68,8 @@
+ #address-cells = <2>;
+ #size-cells = <0>;
+
+- dsa,ethernet = <ð0>;
+- dsa,mii-bus = <ðphy1>;
++ dsa,ethernet = <ð0port>;
++ dsa,mii-bus = <&mdio>;
+
+ switch@0 {
+ #address-cells = <1>;
+@@ -119,35 +118,19 @@
+ };
+
+ partition@300000 {
+- label = "data";
++ label = "rootfs";
+ reg = <0x0300000 0x500000>;
+ };
+ };
+
+ &mdio {
+ status = "okay";
+-
+- ethphy0: ethernet-phy@0 {
+- reg = <0>;
+- };
+-
+- ethphy1: ethernet-phy@ff {
+- reg = <0xff>; /* No PHY attached */
+- speed = <1000>;
+- duple = <1>;
+- };
+ };
+
+ ð0 {
+ status = "okay";
+ ethernet0-port@0 {
+- phy-handle = <ðphy0>;
+- };
+-};
+-
+-ð1 {
+- status = "okay";
+- ethernet1-port@0 {
+- phy-handle = <ðphy1>;
++ speed = <1000>;
++ duplex = <1>;
+ };
+ };
+diff --git a/arch/arm/boot/dts/kirkwood.dtsi b/arch/arm/boot/dts/kirkwood.dtsi
+index afc640cd80c5..464f09a1a4a5 100644
+--- a/arch/arm/boot/dts/kirkwood.dtsi
++++ b/arch/arm/boot/dts/kirkwood.dtsi
+@@ -309,7 +309,7 @@
+ marvell,tx-checksum-limit = <1600>;
+ status = "disabled";
+
+- ethernet0-port@0 {
++ eth0port: ethernet0-port@0 {
+ compatible = "marvell,kirkwood-eth-port";
+ reg = <0>;
+ interrupts = <11>;
+@@ -342,7 +342,7 @@
+ pinctrl-names = "default";
+ status = "disabled";
+
+- ethernet1-port@0 {
++ eth1port: ethernet1-port@0 {
+ compatible = "marvell,kirkwood-eth-port";
+ reg = <0>;
+ interrupts = <15>;
+diff --git a/arch/arm/boot/dts/sama5d3_can.dtsi b/arch/arm/boot/dts/sama5d3_can.dtsi
+index a0775851cce5..eaf41451ad0c 100644
+--- a/arch/arm/boot/dts/sama5d3_can.dtsi
++++ b/arch/arm/boot/dts/sama5d3_can.dtsi
+@@ -40,7 +40,7 @@
+ atmel,clk-output-range = <0 66000000>;
+ };
+
+- can1_clk: can0_clk {
++ can1_clk: can1_clk {
+ #clock-cells = <0>;
+ reg = <41>;
+ atmel,clk-output-range = <0 66000000>;
+diff --git a/arch/arm/mach-at91/clock.c b/arch/arm/mach-at91/clock.c
+index 034529d801b2..d66f102c352a 100644
+--- a/arch/arm/mach-at91/clock.c
++++ b/arch/arm/mach-at91/clock.c
+@@ -962,6 +962,7 @@ static int __init at91_clock_reset(void)
+ }
+
+ at91_pmc_write(AT91_PMC_SCDR, scdr);
++ at91_pmc_write(AT91_PMC_PCDR, pcdr);
+ if (cpu_is_sama5d3())
+ at91_pmc_write(AT91_PMC_PCDR1, pcdr1);
+
+diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
+index 253e33bc94fb..56de5aadede2 100644
+--- a/arch/arm64/include/asm/compat.h
++++ b/arch/arm64/include/asm/compat.h
+@@ -37,8 +37,8 @@ typedef s32 compat_ssize_t;
+ typedef s32 compat_time_t;
+ typedef s32 compat_clock_t;
+ typedef s32 compat_pid_t;
+-typedef u32 __compat_uid_t;
+-typedef u32 __compat_gid_t;
++typedef u16 __compat_uid_t;
++typedef u16 __compat_gid_t;
+ typedef u16 __compat_uid16_t;
+ typedef u16 __compat_gid16_t;
+ typedef u32 __compat_uid32_t;
+diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
+index 9ce04ba6bcb0..8993a69099c7 100644
+--- a/arch/arm64/kernel/entry.S
++++ b/arch/arm64/kernel/entry.S
+@@ -298,7 +298,6 @@ el1_dbg:
+ mrs x0, far_el1
+ mov x2, sp // struct pt_regs
+ bl do_debug_exception
+- enable_dbg
+ kernel_exit 1
+ el1_inv:
+ // TODO: add support for undefined instructions in kernel mode
+diff --git a/arch/m68k/mm/hwtest.c b/arch/m68k/mm/hwtest.c
+index 2c7dde3c6430..2a5259fd23eb 100644
+--- a/arch/m68k/mm/hwtest.c
++++ b/arch/m68k/mm/hwtest.c
+@@ -28,9 +28,11 @@
+ int hwreg_present( volatile void *regp )
+ {
+ int ret = 0;
++ unsigned long flags;
+ long save_sp, save_vbr;
+ long tmp_vectors[3];
+
++ local_irq_save(flags);
+ __asm__ __volatile__
+ ( "movec %/vbr,%2\n\t"
+ "movel #Lberr1,%4@(8)\n\t"
+@@ -46,6 +48,7 @@ int hwreg_present( volatile void *regp )
+ : "=&d" (ret), "=&r" (save_sp), "=&r" (save_vbr)
+ : "a" (regp), "a" (tmp_vectors)
+ );
++ local_irq_restore(flags);
+
+ return( ret );
+ }
+@@ -58,9 +61,11 @@ EXPORT_SYMBOL(hwreg_present);
+ int hwreg_write( volatile void *regp, unsigned short val )
+ {
+ int ret;
++ unsigned long flags;
+ long save_sp, save_vbr;
+ long tmp_vectors[3];
+
++ local_irq_save(flags);
+ __asm__ __volatile__
+ ( "movec %/vbr,%2\n\t"
+ "movel #Lberr2,%4@(8)\n\t"
+@@ -78,6 +83,7 @@ int hwreg_write( volatile void *regp, unsigned short val )
+ : "=&d" (ret), "=&r" (save_sp), "=&r" (save_vbr)
+ : "a" (regp), "a" (tmp_vectors), "g" (val)
+ );
++ local_irq_restore(flags);
+
+ return( ret );
+ }
+diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
+index 94802d267022..b20f9d63a664 100644
+--- a/arch/powerpc/kernel/eeh_pe.c
++++ b/arch/powerpc/kernel/eeh_pe.c
+@@ -570,6 +570,8 @@ static void *__eeh_pe_state_clear(void *data, void *flag)
+ {
+ struct eeh_pe *pe = (struct eeh_pe *)data;
+ int state = *((int *)flag);
++ struct eeh_dev *edev, *tmp;
++ struct pci_dev *pdev;
+
+ /* Keep the state of permanently removed PE intact */
+ if ((pe->freeze_count > EEH_MAX_ALLOWED_FREEZES) &&
+@@ -578,9 +580,22 @@ static void *__eeh_pe_state_clear(void *data, void *flag)
+
+ pe->state &= ~state;
+
+- /* Clear check count since last isolation */
+- if (state & EEH_PE_ISOLATED)
+- pe->check_count = 0;
++ /*
++ * Special treatment on clearing isolated state. Clear
++ * check count since last isolation and put all affected
++ * devices to normal state.
++ */
++ if (!(state & EEH_PE_ISOLATED))
++ return NULL;
++
++ pe->check_count = 0;
++ eeh_pe_for_each_dev(pe, edev, tmp) {
++ pdev = eeh_dev_to_pci_dev(edev);
++ if (!pdev)
++ continue;
++
++ pdev->error_state = pci_channel_io_normal;
++ }
+
+ return NULL;
+ }
+diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
+index 4642d6a4d356..de1ec54a2a57 100644
+--- a/arch/powerpc/platforms/pseries/iommu.c
++++ b/arch/powerpc/platforms/pseries/iommu.c
+@@ -329,16 +329,16 @@ struct direct_window {
+
+ /* Dynamic DMA Window support */
+ struct ddw_query_response {
+- __be32 windows_available;
+- __be32 largest_available_block;
+- __be32 page_size;
+- __be32 migration_capable;
++ u32 windows_available;
++ u32 largest_available_block;
++ u32 page_size;
++ u32 migration_capable;
+ };
+
+ struct ddw_create_response {
+- __be32 liobn;
+- __be32 addr_hi;
+- __be32 addr_lo;
++ u32 liobn;
++ u32 addr_hi;
++ u32 addr_lo;
+ };
+
+ static LIST_HEAD(direct_window_list);
+@@ -725,16 +725,18 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
+ {
+ struct dynamic_dma_window_prop *dwp;
+ struct property *win64;
+- const u32 *ddw_avail;
++ u32 ddw_avail[3];
+ u64 liobn;
+- int len, ret = 0;
++ int ret = 0;
++
++ ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
++ &ddw_avail[0], 3);
+
+- ddw_avail = of_get_property(np, "ibm,ddw-applicable", &len);
+ win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
+ if (!win64)
+ return;
+
+- if (!ddw_avail || len < 3 * sizeof(u32) || win64->length < sizeof(*dwp))
++ if (ret || win64->length < sizeof(*dwp))
+ goto delprop;
+
+ dwp = win64->value;
+@@ -872,8 +874,9 @@ static int create_ddw(struct pci_dev *dev, const u32 *ddw_avail,
+
+ do {
+ /* extra outputs are LIOBN and dma-addr (hi, lo) */
+- ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create, cfg_addr,
+- BUID_HI(buid), BUID_LO(buid), page_shift, window_shift);
++ ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create,
++ cfg_addr, BUID_HI(buid), BUID_LO(buid),
++ page_shift, window_shift);
+ } while (rtas_busy_delay(ret));
+ dev_info(&dev->dev,
+ "ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d "
+@@ -910,7 +913,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ int page_shift;
+ u64 dma_addr, max_addr;
+ struct device_node *dn;
+- const u32 *uninitialized_var(ddw_avail);
++ u32 ddw_avail[3];
+ struct direct_window *window;
+ struct property *win64;
+ struct dynamic_dma_window_prop *ddwprop;
+@@ -942,8 +945,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ * for the given node in that order.
+ * the property is actually in the parent, not the PE
+ */
+- ddw_avail = of_get_property(pdn, "ibm,ddw-applicable", &len);
+- if (!ddw_avail || len < 3 * sizeof(u32))
++ ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
++ &ddw_avail[0], 3);
++ if (ret)
+ goto out_failed;
+
+ /*
+@@ -966,11 +970,11 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ dev_dbg(&dev->dev, "no free dynamic windows");
+ goto out_failed;
+ }
+- if (be32_to_cpu(query.page_size) & 4) {
++ if (query.page_size & 4) {
+ page_shift = 24; /* 16MB */
+- } else if (be32_to_cpu(query.page_size) & 2) {
++ } else if (query.page_size & 2) {
+ page_shift = 16; /* 64kB */
+- } else if (be32_to_cpu(query.page_size) & 1) {
++ } else if (query.page_size & 1) {
+ page_shift = 12; /* 4kB */
+ } else {
+ dev_dbg(&dev->dev, "no supported direct page size in mask %x",
+@@ -980,7 +984,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ /* verify the window * number of ptes will map the partition */
+ /* check largest block * page size > max memory hotplug addr */
+ max_addr = memory_hotplug_max();
+- if (be32_to_cpu(query.largest_available_block) < (max_addr >> page_shift)) {
++ if (query.largest_available_block < (max_addr >> page_shift)) {
+ dev_dbg(&dev->dev, "can't map partiton max 0x%llx with %u "
+ "%llu-sized pages\n", max_addr, query.largest_available_block,
+ 1ULL << page_shift);
+@@ -1006,8 +1010,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ if (ret != 0)
+ goto out_free_prop;
+
+- ddwprop->liobn = create.liobn;
+- ddwprop->dma_base = cpu_to_be64(of_read_number(&create.addr_hi, 2));
++ ddwprop->liobn = cpu_to_be32(create.liobn);
++ ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
++ create.addr_lo);
+ ddwprop->tce_shift = cpu_to_be32(page_shift);
+ ddwprop->window_shift = cpu_to_be32(len);
+
+@@ -1039,7 +1044,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ list_add(&window->list, &direct_window_list);
+ spin_unlock(&direct_window_list_lock);
+
+- dma_addr = of_read_number(&create.addr_hi, 2);
++ dma_addr = be64_to_cpu(ddwprop->dma_base);
+ goto out_unlock;
+
+ out_free_window:
+diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
+index 90c8de22a2a0..5d5ebd400162 100644
+--- a/arch/s390/kvm/interrupt.c
++++ b/arch/s390/kvm/interrupt.c
+@@ -85,6 +85,7 @@ static int __interrupt_is_deliverable(struct kvm_vcpu *vcpu,
+ return 0;
+ if (vcpu->arch.sie_block->gcr[0] & 0x2000ul)
+ return 1;
++ return 0;
+ case KVM_S390_INT_EMERGENCY:
+ if (psw_extint_disabled(vcpu))
+ return 0;
+diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
+index 407c87d9879a..db7d3bf4357e 100644
+--- a/arch/sparc/Kconfig
++++ b/arch/sparc/Kconfig
+@@ -67,6 +67,7 @@ config SPARC64
+ select HAVE_SYSCALL_TRACEPOINTS
+ select HAVE_CONTEXT_TRACKING
+ select HAVE_DEBUG_KMEMLEAK
++ select SPARSE_IRQ
+ select RTC_DRV_CMOS
+ select RTC_DRV_BQ4802
+ select RTC_DRV_SUN4V
+diff --git a/arch/sparc/include/asm/hypervisor.h b/arch/sparc/include/asm/hypervisor.h
+index 94b39caea3eb..4f6725ff4c33 100644
+--- a/arch/sparc/include/asm/hypervisor.h
++++ b/arch/sparc/include/asm/hypervisor.h
+@@ -2947,6 +2947,16 @@ unsigned long sun4v_vt_set_perfreg(unsigned long reg_num,
+ unsigned long reg_val);
+ #endif
+
++#define HV_FAST_T5_GET_PERFREG 0x1a8
++#define HV_FAST_T5_SET_PERFREG 0x1a9
++
++#ifndef __ASSEMBLY__
++unsigned long sun4v_t5_get_perfreg(unsigned long reg_num,
++ unsigned long *reg_val);
++unsigned long sun4v_t5_set_perfreg(unsigned long reg_num,
++ unsigned long reg_val);
++#endif
++
+ /* Function numbers for HV_CORE_TRAP. */
+ #define HV_CORE_SET_VER 0x00
+ #define HV_CORE_PUTCHAR 0x01
+@@ -2978,6 +2988,7 @@ unsigned long sun4v_vt_set_perfreg(unsigned long reg_num,
+ #define HV_GRP_VF_CPU 0x0205
+ #define HV_GRP_KT_CPU 0x0209
+ #define HV_GRP_VT_CPU 0x020c
++#define HV_GRP_T5_CPU 0x0211
+ #define HV_GRP_DIAG 0x0300
+
+ #ifndef __ASSEMBLY__
+diff --git a/arch/sparc/include/asm/irq_64.h b/arch/sparc/include/asm/irq_64.h
+index 91d219381306..3f70f900e834 100644
+--- a/arch/sparc/include/asm/irq_64.h
++++ b/arch/sparc/include/asm/irq_64.h
+@@ -37,7 +37,7 @@
+ *
+ * ino_bucket->irq allocation is made during {sun4v_,}build_irq().
+ */
+-#define NR_IRQS 255
++#define NR_IRQS (2048)
+
+ void irq_install_pre_handler(int irq,
+ void (*func)(unsigned int, void *, void *),
+@@ -57,11 +57,8 @@ unsigned int sun4u_build_msi(u32 portid, unsigned int *irq_p,
+ unsigned long iclr_base);
+ void sun4u_destroy_msi(unsigned int irq);
+
+-unsigned char irq_alloc(unsigned int dev_handle,
+- unsigned int dev_ino);
+-#ifdef CONFIG_PCI_MSI
++unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino);
+ void irq_free(unsigned int irq);
+-#endif
+
+ void __init init_IRQ(void);
+ void fixup_irqs(void);
+diff --git a/arch/sparc/include/asm/ldc.h b/arch/sparc/include/asm/ldc.h
+index c8c67f621f4f..58ab64de25d2 100644
+--- a/arch/sparc/include/asm/ldc.h
++++ b/arch/sparc/include/asm/ldc.h
+@@ -53,13 +53,14 @@ struct ldc_channel;
+ /* Allocate state for a channel. */
+ struct ldc_channel *ldc_alloc(unsigned long id,
+ const struct ldc_channel_config *cfgp,
+- void *event_arg);
++ void *event_arg,
++ const char *name);
+
+ /* Shut down and free state for a channel. */
+ void ldc_free(struct ldc_channel *lp);
+
+ /* Register TX and RX queues of the link with the hypervisor. */
+-int ldc_bind(struct ldc_channel *lp, const char *name);
++int ldc_bind(struct ldc_channel *lp);
+
+ /* For non-RAW protocols we need to complete a handshake before
+ * communication can proceed. ldc_connect() does that, if the
+diff --git a/arch/sparc/include/asm/oplib_64.h b/arch/sparc/include/asm/oplib_64.h
+index f34682430fcf..2e3a4add8591 100644
+--- a/arch/sparc/include/asm/oplib_64.h
++++ b/arch/sparc/include/asm/oplib_64.h
+@@ -62,7 +62,8 @@ struct linux_mem_p1275 {
+ /* You must call prom_init() before using any of the library services,
+ * preferably as early as possible. Pass it the romvec pointer.
+ */
+-void prom_init(void *cif_handler, void *cif_stack);
++void prom_init(void *cif_handler);
++void prom_init_report(void);
+
+ /* Boot argument acquisition, returns the boot command line string. */
+ char *prom_getbootargs(void);
+diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
+index bf109984a032..8c2a8c937540 100644
+--- a/arch/sparc/include/asm/page_64.h
++++ b/arch/sparc/include/asm/page_64.h
+@@ -57,18 +57,21 @@ void copy_user_page(void *to, void *from, unsigned long vaddr, struct page *topa
+ typedef struct { unsigned long pte; } pte_t;
+ typedef struct { unsigned long iopte; } iopte_t;
+ typedef struct { unsigned long pmd; } pmd_t;
++typedef struct { unsigned long pud; } pud_t;
+ typedef struct { unsigned long pgd; } pgd_t;
+ typedef struct { unsigned long pgprot; } pgprot_t;
+
+ #define pte_val(x) ((x).pte)
+ #define iopte_val(x) ((x).iopte)
+ #define pmd_val(x) ((x).pmd)
++#define pud_val(x) ((x).pud)
+ #define pgd_val(x) ((x).pgd)
+ #define pgprot_val(x) ((x).pgprot)
+
+ #define __pte(x) ((pte_t) { (x) } )
+ #define __iopte(x) ((iopte_t) { (x) } )
+ #define __pmd(x) ((pmd_t) { (x) } )
++#define __pud(x) ((pud_t) { (x) } )
+ #define __pgd(x) ((pgd_t) { (x) } )
+ #define __pgprot(x) ((pgprot_t) { (x) } )
+
+@@ -77,18 +80,21 @@ typedef struct { unsigned long pgprot; } pgprot_t;
+ typedef unsigned long pte_t;
+ typedef unsigned long iopte_t;
+ typedef unsigned long pmd_t;
++typedef unsigned long pud_t;
+ typedef unsigned long pgd_t;
+ typedef unsigned long pgprot_t;
+
+ #define pte_val(x) (x)
+ #define iopte_val(x) (x)
+ #define pmd_val(x) (x)
++#define pud_val(x) (x)
+ #define pgd_val(x) (x)
+ #define pgprot_val(x) (x)
+
+ #define __pte(x) (x)
+ #define __iopte(x) (x)
+ #define __pmd(x) (x)
++#define __pud(x) (x)
+ #define __pgd(x) (x)
+ #define __pgprot(x) (x)
+
+@@ -96,21 +102,14 @@ typedef unsigned long pgprot_t;
+
+ typedef pte_t *pgtable_t;
+
+-/* These two values define the virtual address space range in which we
+- * must forbid 64-bit user processes from making mappings. It used to
+- * represent precisely the virtual address space hole present in most
+- * early sparc64 chips including UltraSPARC-I. But now it also is
+- * further constrained by the limits of our page tables, which is
+- * 43-bits of virtual address.
+- */
+-#define SPARC64_VA_HOLE_TOP _AC(0xfffffc0000000000,UL)
+-#define SPARC64_VA_HOLE_BOTTOM _AC(0x0000040000000000,UL)
++extern unsigned long sparc64_va_hole_top;
++extern unsigned long sparc64_va_hole_bottom;
+
+ /* The next two defines specify the actual exclusion region we
+ * enforce, wherein we use a 4GB red zone on each side of the VA hole.
+ */
+-#define VA_EXCLUDE_START (SPARC64_VA_HOLE_BOTTOM - (1UL << 32UL))
+-#define VA_EXCLUDE_END (SPARC64_VA_HOLE_TOP + (1UL << 32UL))
++#define VA_EXCLUDE_START (sparc64_va_hole_bottom - (1UL << 32UL))
++#define VA_EXCLUDE_END (sparc64_va_hole_top + (1UL << 32UL))
+
+ #define TASK_UNMAPPED_BASE (test_thread_flag(TIF_32BIT) ? \
+ _AC(0x0000000070000000,UL) : \
+@@ -118,20 +117,16 @@ typedef pte_t *pgtable_t;
+
+ #include <asm-generic/memory_model.h>
+
+-#define PAGE_OFFSET_BY_BITS(X) (-(_AC(1,UL) << (X)))
+ extern unsigned long PAGE_OFFSET;
+
+ #endif /* !(__ASSEMBLY__) */
+
+-/* The maximum number of physical memory address bits we support, this
+- * is used to size various tables used to manage kernel TLB misses and
+- * also the sparsemem code.
++/* The maximum number of physical memory address bits we support. The
++ * largest value we can support is whatever "KPGD_SHIFT + KPTE_BITS"
++ * evaluates to.
+ */
+-#define MAX_PHYS_ADDRESS_BITS 47
++#define MAX_PHYS_ADDRESS_BITS 53
+
+-/* These two shift counts are used when indexing sparc64_valid_addr_bitmap
+- * and kpte_linear_bitmap.
+- */
+ #define ILOG2_4MB 22
+ #define ILOG2_256MB 28
+
+diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
+index 39a7ac49b00c..5e3187185b4a 100644
+--- a/arch/sparc/include/asm/pgalloc_64.h
++++ b/arch/sparc/include/asm/pgalloc_64.h
+@@ -15,6 +15,13 @@
+
+ extern struct kmem_cache *pgtable_cache;
+
++static inline void __pgd_populate(pgd_t *pgd, pud_t *pud)
++{
++ pgd_set(pgd, pud);
++}
++
++#define pgd_populate(MM, PGD, PUD) __pgd_populate(PGD, PUD)
++
+ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
+ {
+ return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+@@ -25,7 +32,23 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
+ kmem_cache_free(pgtable_cache, pgd);
+ }
+
+-#define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD)
++static inline void __pud_populate(pud_t *pud, pmd_t *pmd)
++{
++ pud_set(pud, pmd);
++}
++
++#define pud_populate(MM, PUD, PMD) __pud_populate(PUD, PMD)
++
++static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
++{
++ return kmem_cache_alloc(pgtable_cache,
++ GFP_KERNEL|__GFP_REPEAT);
++}
++
++static inline void pud_free(struct mm_struct *mm, pud_t *pud)
++{
++ kmem_cache_free(pgtable_cache, pud);
++}
+
+ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+ {
+@@ -91,4 +114,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pte_t *pte,
+ #define __pmd_free_tlb(tlb, pmd, addr) \
+ pgtable_free_tlb(tlb, pmd, false)
+
++#define __pud_free_tlb(tlb, pud, addr) \
++ pgtable_free_tlb(tlb, pud, false)
++
+ #endif /* _SPARC64_PGALLOC_H */
+diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
+index 3770bf5c6e1b..bfeb626085ac 100644
+--- a/arch/sparc/include/asm/pgtable_64.h
++++ b/arch/sparc/include/asm/pgtable_64.h
+@@ -20,8 +20,6 @@
+ #include <asm/page.h>
+ #include <asm/processor.h>
+
+-#include <asm-generic/pgtable-nopud.h>
+-
+ /* The kernel image occupies 0x4000000 to 0x6000000 (4MB --> 96MB).
+ * The page copy blockops can use 0x6000000 to 0x8000000.
+ * The 8K TSB is mapped in the 0x8000000 to 0x8400000 range.
+@@ -42,10 +40,7 @@
+ #define LOW_OBP_ADDRESS _AC(0x00000000f0000000,UL)
+ #define HI_OBP_ADDRESS _AC(0x0000000100000000,UL)
+ #define VMALLOC_START _AC(0x0000000100000000,UL)
+-#define VMALLOC_END _AC(0x0000010000000000,UL)
+-#define VMEMMAP_BASE _AC(0x0000010000000000,UL)
+-
+-#define vmemmap ((struct page *)VMEMMAP_BASE)
++#define VMEMMAP_BASE VMALLOC_END
+
+ /* PMD_SHIFT determines the size of the area a second-level page
+ * table can map
+@@ -55,13 +50,25 @@
+ #define PMD_MASK (~(PMD_SIZE-1))
+ #define PMD_BITS (PAGE_SHIFT - 3)
+
+-/* PGDIR_SHIFT determines what a third-level page table entry can map */
+-#define PGDIR_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-3) + PMD_BITS)
++/* PUD_SHIFT determines the size of the area a third-level page
++ * table can map
++ */
++#define PUD_SHIFT (PMD_SHIFT + PMD_BITS)
++#define PUD_SIZE (_AC(1,UL) << PUD_SHIFT)
++#define PUD_MASK (~(PUD_SIZE-1))
++#define PUD_BITS (PAGE_SHIFT - 3)
++
++/* PGDIR_SHIFT determines what a fourth-level page table entry can map */
++#define PGDIR_SHIFT (PUD_SHIFT + PUD_BITS)
+ #define PGDIR_SIZE (_AC(1,UL) << PGDIR_SHIFT)
+ #define PGDIR_MASK (~(PGDIR_SIZE-1))
+ #define PGDIR_BITS (PAGE_SHIFT - 3)
+
+-#if (PGDIR_SHIFT + PGDIR_BITS) != 43
++#if (MAX_PHYS_ADDRESS_BITS > PGDIR_SHIFT + PGDIR_BITS)
++#error MAX_PHYS_ADDRESS_BITS exceeds what kernel page tables can support
++#endif
++
++#if (PGDIR_SHIFT + PGDIR_BITS) != 53
+ #error Page table parameters do not cover virtual address space properly.
+ #endif
+
+@@ -71,28 +78,18 @@
+
+ #ifndef __ASSEMBLY__
+
+-#include <linux/sched.h>
+-
+-extern unsigned long sparc64_valid_addr_bitmap[];
++extern unsigned long VMALLOC_END;
+
+-/* Needs to be defined here and not in linux/mm.h, as it is arch dependent */
+-static inline bool __kern_addr_valid(unsigned long paddr)
+-{
+- if ((paddr >> MAX_PHYS_ADDRESS_BITS) != 0UL)
+- return false;
+- return test_bit(paddr >> ILOG2_4MB, sparc64_valid_addr_bitmap);
+-}
++#define vmemmap ((struct page *)VMEMMAP_BASE)
+
+-static inline bool kern_addr_valid(unsigned long addr)
+-{
+- unsigned long paddr = __pa(addr);
++#include <linux/sched.h>
+
+- return __kern_addr_valid(paddr);
+-}
++bool kern_addr_valid(unsigned long addr);
+
+ /* Entries per page directory level. */
+ #define PTRS_PER_PTE (1UL << (PAGE_SHIFT-3))
+ #define PTRS_PER_PMD (1UL << PMD_BITS)
++#define PTRS_PER_PUD (1UL << PUD_BITS)
+ #define PTRS_PER_PGD (1UL << PGDIR_BITS)
+
+ /* Kernel has a separate 44bit address space. */
+@@ -101,6 +98,9 @@ static inline bool kern_addr_valid(unsigned long addr)
+ #define pmd_ERROR(e) \
+ pr_err("%s:%d: bad pmd %p(%016lx) seen at (%pS)\n", \
+ __FILE__, __LINE__, &(e), pmd_val(e), __builtin_return_address(0))
++#define pud_ERROR(e) \
++ pr_err("%s:%d: bad pud %p(%016lx) seen at (%pS)\n", \
++ __FILE__, __LINE__, &(e), pud_val(e), __builtin_return_address(0))
+ #define pgd_ERROR(e) \
+ pr_err("%s:%d: bad pgd %p(%016lx) seen at (%pS)\n", \
+ __FILE__, __LINE__, &(e), pgd_val(e), __builtin_return_address(0))
+@@ -112,6 +112,7 @@ static inline bool kern_addr_valid(unsigned long addr)
+ #define _PAGE_R _AC(0x8000000000000000,UL) /* Keep ref bit uptodate*/
+ #define _PAGE_SPECIAL _AC(0x0200000000000000,UL) /* Special page */
+ #define _PAGE_PMD_HUGE _AC(0x0100000000000000,UL) /* Huge page */
++#define _PAGE_PUD_HUGE _PAGE_PMD_HUGE
+
+ /* Advertise support for _PAGE_SPECIAL */
+ #define __HAVE_ARCH_PTE_SPECIAL
+@@ -658,26 +659,26 @@ static inline unsigned long pmd_large(pmd_t pmd)
+ return pte_val(pte) & _PAGE_PMD_HUGE;
+ }
+
+-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+-static inline unsigned long pmd_young(pmd_t pmd)
++static inline unsigned long pmd_pfn(pmd_t pmd)
+ {
+ pte_t pte = __pte(pmd_val(pmd));
+
+- return pte_young(pte);
++ return pte_pfn(pte);
+ }
+
+-static inline unsigned long pmd_write(pmd_t pmd)
++#ifdef CONFIG_TRANSPARENT_HUGEPAGE
++static inline unsigned long pmd_young(pmd_t pmd)
+ {
+ pte_t pte = __pte(pmd_val(pmd));
+
+- return pte_write(pte);
++ return pte_young(pte);
+ }
+
+-static inline unsigned long pmd_pfn(pmd_t pmd)
++static inline unsigned long pmd_write(pmd_t pmd)
+ {
+ pte_t pte = __pte(pmd_val(pmd));
+
+- return pte_pfn(pte);
++ return pte_write(pte);
+ }
+
+ static inline unsigned long pmd_trans_huge(pmd_t pmd)
+@@ -771,13 +772,15 @@ static inline int pmd_present(pmd_t pmd)
+ * the top bits outside of the range of any physical address size we
+ * support are clear as well. We also validate the physical itself.
+ */
+-#define pmd_bad(pmd) ((pmd_val(pmd) & ~PAGE_MASK) || \
+- !__kern_addr_valid(pmd_val(pmd)))
++#define pmd_bad(pmd) (pmd_val(pmd) & ~PAGE_MASK)
+
+ #define pud_none(pud) (!pud_val(pud))
+
+-#define pud_bad(pud) ((pud_val(pud) & ~PAGE_MASK) || \
+- !__kern_addr_valid(pud_val(pud)))
++#define pud_bad(pud) (pud_val(pud) & ~PAGE_MASK)
++
++#define pgd_none(pgd) (!pgd_val(pgd))
++
++#define pgd_bad(pgd) (pgd_val(pgd) & ~PAGE_MASK)
+
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+@@ -815,10 +818,31 @@ static inline unsigned long __pmd_page(pmd_t pmd)
+ #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0UL)
+ #define pud_present(pud) (pud_val(pud) != 0U)
+ #define pud_clear(pudp) (pud_val(*(pudp)) = 0UL)
++#define pgd_page_vaddr(pgd) \
++ ((unsigned long) __va(pgd_val(pgd)))
++#define pgd_present(pgd) (pgd_val(pgd) != 0U)
++#define pgd_clear(pgdp) (pgd_val(*(pgd)) = 0UL)
++
++static inline unsigned long pud_large(pud_t pud)
++{
++ pte_t pte = __pte(pud_val(pud));
++
++ return pte_val(pte) & _PAGE_PMD_HUGE;
++}
++
++static inline unsigned long pud_pfn(pud_t pud)
++{
++ pte_t pte = __pte(pud_val(pud));
++
++ return pte_pfn(pte);
++}
+
+ /* Same in both SUN4V and SUN4U. */
+ #define pte_none(pte) (!pte_val(pte))
+
++#define pgd_set(pgdp, pudp) \
++ (pgd_val(*(pgdp)) = (__pa((unsigned long) (pudp))))
++
+ /* to find an entry in a page-table-directory. */
+ #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
+ #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address))
+@@ -826,6 +850,11 @@ static inline unsigned long __pmd_page(pmd_t pmd)
+ /* to find an entry in a kernel page-table-directory */
+ #define pgd_offset_k(address) pgd_offset(&init_mm, address)
+
++/* Find an entry in the third-level page table.. */
++#define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
++#define pud_offset(pgdp, address) \
++ ((pud_t *) pgd_page_vaddr(*(pgdp)) + pud_index(address))
++
+ /* Find an entry in the second-level page table.. */
+ #define pmd_offset(pudp, address) \
+ ((pmd_t *) pud_page_vaddr(*(pudp)) + \
+@@ -898,7 +927,6 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
+ #endif
+
+ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
+-extern pmd_t swapper_low_pmd_dir[PTRS_PER_PMD];
+
+ void paging_init(void);
+ unsigned long find_ecache_flush_span(unsigned long size);
+diff --git a/arch/sparc/include/asm/setup.h b/arch/sparc/include/asm/setup.h
+index f5fffd84d0dd..29d64b1758ed 100644
+--- a/arch/sparc/include/asm/setup.h
++++ b/arch/sparc/include/asm/setup.h
+@@ -48,6 +48,8 @@ unsigned long safe_compute_effective_address(struct pt_regs *, unsigned int);
+ #endif
+
+ #ifdef CONFIG_SPARC64
++void __init start_early_boot(void);
++
+ /* unaligned_64.c */
+ int handle_ldf_stq(u32 insn, struct pt_regs *regs);
+ void handle_ld_nf(u32 insn, struct pt_regs *regs);
+diff --git a/arch/sparc/include/asm/spitfire.h b/arch/sparc/include/asm/spitfire.h
+index 3fc58691dbd0..56f933816144 100644
+--- a/arch/sparc/include/asm/spitfire.h
++++ b/arch/sparc/include/asm/spitfire.h
+@@ -45,6 +45,8 @@
+ #define SUN4V_CHIP_NIAGARA3 0x03
+ #define SUN4V_CHIP_NIAGARA4 0x04
+ #define SUN4V_CHIP_NIAGARA5 0x05
++#define SUN4V_CHIP_SPARC_M6 0x06
++#define SUN4V_CHIP_SPARC_M7 0x07
+ #define SUN4V_CHIP_SPARC64X 0x8a
+ #define SUN4V_CHIP_UNKNOWN 0xff
+
+diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
+index a5f01ac6d0f1..cc6275c931a5 100644
+--- a/arch/sparc/include/asm/thread_info_64.h
++++ b/arch/sparc/include/asm/thread_info_64.h
+@@ -63,7 +63,8 @@ struct thread_info {
+ struct pt_regs *kern_una_regs;
+ unsigned int kern_una_insn;
+
+- unsigned long fpregs[0] __attribute__ ((aligned(64)));
++ unsigned long fpregs[(7 * 256) / sizeof(unsigned long)]
++ __attribute__ ((aligned(64)));
+ };
+
+ #endif /* !(__ASSEMBLY__) */
+@@ -102,6 +103,7 @@ struct thread_info {
+ #define FAULT_CODE_ITLB 0x04 /* Miss happened in I-TLB */
+ #define FAULT_CODE_WINFIXUP 0x08 /* Miss happened during spill/fill */
+ #define FAULT_CODE_BLKCOMMIT 0x10 /* Use blk-commit ASI in copy_page */
++#define FAULT_CODE_BAD_RA 0x20 /* Bad RA for sun4v */
+
+ #if PAGE_SHIFT == 13
+ #define THREAD_SIZE (2*PAGE_SIZE)
+diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
+index 90916f955cac..ecb49cfa3be9 100644
+--- a/arch/sparc/include/asm/tsb.h
++++ b/arch/sparc/include/asm/tsb.h
+@@ -133,9 +133,24 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ sub TSB, 0x8, TSB; \
+ TSB_STORE(TSB, TAG);
+
+- /* Do a kernel page table walk. Leaves physical PTE pointer in
+- * REG1. Jumps to FAIL_LABEL on early page table walk termination.
+- * VADDR will not be clobbered, but REG2 will.
++ /* Do a kernel page table walk. Leaves valid PTE value in
++ * REG1. Jumps to FAIL_LABEL on early page table walk
++ * termination. VADDR will not be clobbered, but REG2 will.
++ *
++ * There are two masks we must apply to propagate bits from
++ * the virtual address into the PTE physical address field
++ * when dealing with huge pages. This is because the page
++ * table boundaries do not match the huge page size(s) the
++ * hardware supports.
++ *
++ * In these cases we propagate the bits that are below the
++ * page table level where we saw the huge page mapping, but
++ * are still within the relevant physical bits for the huge
++ * page size in question. So for PMD mappings (which fall on
++ * bit 23, for 8MB per PMD) we must propagate bit 22 for a
++ * 4MB huge page. For huge PUDs (which fall on bit 33, for
++ * 8GB per PUD), we have to accomodate 256MB and 2GB huge
++ * pages. So for those we propagate bits 32 to 28.
+ */
+ #define KERN_PGTABLE_WALK(VADDR, REG1, REG2, FAIL_LABEL) \
+ sethi %hi(swapper_pg_dir), REG1; \
+@@ -145,15 +160,40 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ andn REG2, 0x7, REG2; \
+ ldx [REG1 + REG2], REG1; \
+ brz,pn REG1, FAIL_LABEL; \
+- sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
++ sllx VADDR, 64 - (PUD_SHIFT + PUD_BITS), REG2; \
+ srlx REG2, 64 - PAGE_SHIFT, REG2; \
+ andn REG2, 0x7, REG2; \
+ ldxa [REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+ brz,pn REG1, FAIL_LABEL; \
+- sllx VADDR, 64 - PMD_SHIFT, REG2; \
++ sethi %uhi(_PAGE_PUD_HUGE), REG2; \
++ brz,pn REG1, FAIL_LABEL; \
++ sllx REG2, 32, REG2; \
++ andcc REG1, REG2, %g0; \
++ sethi %hi(0xf8000000), REG2; \
++ bne,pt %xcc, 697f; \
++ sllx REG2, 1, REG2; \
++ sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
+ srlx REG2, 64 - PAGE_SHIFT, REG2; \
+ andn REG2, 0x7, REG2; \
+- add REG1, REG2, REG1;
++ ldxa [REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
++ sethi %uhi(_PAGE_PMD_HUGE), REG2; \
++ brz,pn REG1, FAIL_LABEL; \
++ sllx REG2, 32, REG2; \
++ andcc REG1, REG2, %g0; \
++ be,pn %xcc, 698f; \
++ sethi %hi(0x400000), REG2; \
++697: brgez,pn REG1, FAIL_LABEL; \
++ andn REG1, REG2, REG1; \
++ and VADDR, REG2, REG2; \
++ ba,pt %xcc, 699f; \
++ or REG1, REG2, REG1; \
++698: sllx VADDR, 64 - PMD_SHIFT, REG2; \
++ srlx REG2, 64 - PAGE_SHIFT, REG2; \
++ andn REG2, 0x7, REG2; \
++ ldxa [REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
++ brgez,pn REG1, FAIL_LABEL; \
++ nop; \
++699:
+
+ /* PMD has been loaded into REG1, interpret the value, seeing
+ * if it is a HUGE PMD or a normal one. If it is not valid
+@@ -198,6 +238,11 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ andn REG2, 0x7, REG2; \
+ ldxa [PHYS_PGD + REG2] ASI_PHYS_USE_EC, REG1; \
+ brz,pn REG1, FAIL_LABEL; \
++ sllx VADDR, 64 - (PUD_SHIFT + PUD_BITS), REG2; \
++ srlx REG2, 64 - PAGE_SHIFT, REG2; \
++ andn REG2, 0x7, REG2; \
++ ldxa [REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
++ brz,pn REG1, FAIL_LABEL; \
+ sllx VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
+ srlx REG2, 64 - PAGE_SHIFT, REG2; \
+ andn REG2, 0x7, REG2; \
+@@ -246,8 +291,6 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ (KERNEL_TSB_SIZE_BYTES / 16)
+ #define KERNEL_TSB4M_NENTRIES 4096
+
+-#define KTSB_PHYS_SHIFT 15
+-
+ /* Do a kernel TSB lookup at tl>0 on VADDR+TAG, branch to OK_LABEL
+ * on TSB hit. REG1, REG2, REG3, and REG4 are used as temporaries
+ * and the found TTE will be left in REG1. REG3 and REG4 must
+@@ -256,17 +299,15 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ * VADDR and TAG will be preserved and not clobbered by this macro.
+ */
+ #define KERN_TSB_LOOKUP_TL1(VADDR, TAG, REG1, REG2, REG3, REG4, OK_LABEL) \
+-661: sethi %hi(swapper_tsb), REG1; \
+- or REG1, %lo(swapper_tsb), REG1; \
++661: sethi %uhi(swapper_tsb), REG1; \
++ sethi %hi(swapper_tsb), REG2; \
++ or REG1, %ulo(swapper_tsb), REG1; \
++ or REG2, %lo(swapper_tsb), REG2; \
+ .section .swapper_tsb_phys_patch, "ax"; \
+ .word 661b; \
+ .previous; \
+-661: nop; \
+- .section .tsb_ldquad_phys_patch, "ax"; \
+- .word 661b; \
+- sllx REG1, KTSB_PHYS_SHIFT, REG1; \
+- sllx REG1, KTSB_PHYS_SHIFT, REG1; \
+- .previous; \
++ sllx REG1, 32, REG1; \
++ or REG1, REG2, REG1; \
+ srlx VADDR, PAGE_SHIFT, REG2; \
+ and REG2, (KERNEL_TSB_NENTRIES - 1), REG2; \
+ sllx REG2, 4, REG2; \
+@@ -281,17 +322,15 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ * we can make use of that for the index computation.
+ */
+ #define KERN_TSB4M_LOOKUP_TL1(TAG, REG1, REG2, REG3, REG4, OK_LABEL) \
+-661: sethi %hi(swapper_4m_tsb), REG1; \
+- or REG1, %lo(swapper_4m_tsb), REG1; \
++661: sethi %uhi(swapper_4m_tsb), REG1; \
++ sethi %hi(swapper_4m_tsb), REG2; \
++ or REG1, %ulo(swapper_4m_tsb), REG1; \
++ or REG2, %lo(swapper_4m_tsb), REG2; \
+ .section .swapper_4m_tsb_phys_patch, "ax"; \
+ .word 661b; \
+ .previous; \
+-661: nop; \
+- .section .tsb_ldquad_phys_patch, "ax"; \
+- .word 661b; \
+- sllx REG1, KTSB_PHYS_SHIFT, REG1; \
+- sllx REG1, KTSB_PHYS_SHIFT, REG1; \
+- .previous; \
++ sllx REG1, 32, REG1; \
++ or REG1, REG2, REG1; \
+ and TAG, (KERNEL_TSB4M_NENTRIES - 1), REG2; \
+ sllx REG2, 4, REG2; \
+ add REG1, REG2, REG2; \
+diff --git a/arch/sparc/include/asm/visasm.h b/arch/sparc/include/asm/visasm.h
+index b26673759283..1f0aa2024e94 100644
+--- a/arch/sparc/include/asm/visasm.h
++++ b/arch/sparc/include/asm/visasm.h
+@@ -39,6 +39,14 @@
+ 297: wr %o5, FPRS_FEF, %fprs; \
+ 298:
+
++#define VISEntryHalfFast(fail_label) \
++ rd %fprs, %o5; \
++ andcc %o5, FPRS_FEF, %g0; \
++ be,pt %icc, 297f; \
++ nop; \
++ ba,a,pt %xcc, fail_label; \
++297: wr %o5, FPRS_FEF, %fprs;
++
+ #define VISExitHalf \
+ wr %o5, 0, %fprs;
+
+diff --git a/arch/sparc/kernel/cpu.c b/arch/sparc/kernel/cpu.c
+index 82a3a71c451e..dfad8b1aea9f 100644
+--- a/arch/sparc/kernel/cpu.c
++++ b/arch/sparc/kernel/cpu.c
+@@ -494,6 +494,18 @@ static void __init sun4v_cpu_probe(void)
+ sparc_pmu_type = "niagara5";
+ break;
+
++ case SUN4V_CHIP_SPARC_M6:
++ sparc_cpu_type = "SPARC-M6";
++ sparc_fpu_type = "SPARC-M6 integrated FPU";
++ sparc_pmu_type = "sparc-m6";
++ break;
++
++ case SUN4V_CHIP_SPARC_M7:
++ sparc_cpu_type = "SPARC-M7";
++ sparc_fpu_type = "SPARC-M7 integrated FPU";
++ sparc_pmu_type = "sparc-m7";
++ break;
++
+ case SUN4V_CHIP_SPARC64X:
+ sparc_cpu_type = "SPARC64-X";
+ sparc_fpu_type = "SPARC64-X integrated FPU";
+diff --git a/arch/sparc/kernel/cpumap.c b/arch/sparc/kernel/cpumap.c
+index de1c844dfabc..e69ec0e3f155 100644
+--- a/arch/sparc/kernel/cpumap.c
++++ b/arch/sparc/kernel/cpumap.c
+@@ -326,6 +326,8 @@ static int iterate_cpu(struct cpuinfo_tree *t, unsigned int root_index)
+ case SUN4V_CHIP_NIAGARA3:
+ case SUN4V_CHIP_NIAGARA4:
+ case SUN4V_CHIP_NIAGARA5:
++ case SUN4V_CHIP_SPARC_M6:
++ case SUN4V_CHIP_SPARC_M7:
+ case SUN4V_CHIP_SPARC64X:
+ rover_inc_table = niagara_iterate_method;
+ break;
+diff --git a/arch/sparc/kernel/ds.c b/arch/sparc/kernel/ds.c
+index dff60abbea01..f87a55d77094 100644
+--- a/arch/sparc/kernel/ds.c
++++ b/arch/sparc/kernel/ds.c
+@@ -1200,14 +1200,14 @@ static int ds_probe(struct vio_dev *vdev, const struct vio_device_id *id)
+ ds_cfg.tx_irq = vdev->tx_irq;
+ ds_cfg.rx_irq = vdev->rx_irq;
+
+- lp = ldc_alloc(vdev->channel_id, &ds_cfg, dp);
++ lp = ldc_alloc(vdev->channel_id, &ds_cfg, dp, "DS");
+ if (IS_ERR(lp)) {
+ err = PTR_ERR(lp);
+ goto out_free_ds_states;
+ }
+ dp->lp = lp;
+
+- err = ldc_bind(lp, "DS");
++ err = ldc_bind(lp);
+ if (err)
+ goto out_free_ldc;
+
+diff --git a/arch/sparc/kernel/dtlb_prot.S b/arch/sparc/kernel/dtlb_prot.S
+index b2c2c5be281c..d668ca149e64 100644
+--- a/arch/sparc/kernel/dtlb_prot.S
++++ b/arch/sparc/kernel/dtlb_prot.S
+@@ -24,11 +24,11 @@
+ mov TLB_TAG_ACCESS, %g4 ! For reload of vaddr
+
+ /* PROT ** ICACHE line 2: More real fault processing */
++ ldxa [%g4] ASI_DMMU, %g5 ! Put tagaccess in %g5
+ bgu,pn %xcc, winfix_trampoline ! Yes, perform winfixup
+- ldxa [%g4] ASI_DMMU, %g5 ! Put tagaccess in %g5
+- ba,pt %xcc, sparc64_realfault_common ! Nope, normal fault
+ mov FAULT_CODE_DTLB | FAULT_CODE_WRITE, %g4
+- nop
++ ba,pt %xcc, sparc64_realfault_common ! Nope, normal fault
++ nop
+ nop
+ nop
+ nop
+diff --git a/arch/sparc/kernel/entry.h b/arch/sparc/kernel/entry.h
+index ebaba6167dd4..88d322b67fac 100644
+--- a/arch/sparc/kernel/entry.h
++++ b/arch/sparc/kernel/entry.h
+@@ -65,13 +65,10 @@ struct pause_patch_entry {
+ extern struct pause_patch_entry __pause_3insn_patch,
+ __pause_3insn_patch_end;
+
+-void __init per_cpu_patch(void);
+ void sun4v_patch_1insn_range(struct sun4v_1insn_patch_entry *,
+ struct sun4v_1insn_patch_entry *);
+ void sun4v_patch_2insn_range(struct sun4v_2insn_patch_entry *,
+ struct sun4v_2insn_patch_entry *);
+-void __init sun4v_patch(void);
+-void __init boot_cpu_id_too_large(int cpu);
+ extern unsigned int dcache_parity_tl1_occurred;
+ extern unsigned int icache_parity_tl1_occurred;
+
+diff --git a/arch/sparc/kernel/head_64.S b/arch/sparc/kernel/head_64.S
+index 452f04fe8da6..3d61fcae7ee3 100644
+--- a/arch/sparc/kernel/head_64.S
++++ b/arch/sparc/kernel/head_64.S
+@@ -427,6 +427,12 @@ sun4v_chip_type:
+ cmp %g2, '5'
+ be,pt %xcc, 5f
+ mov SUN4V_CHIP_NIAGARA5, %g4
++ cmp %g2, '6'
++ be,pt %xcc, 5f
++ mov SUN4V_CHIP_SPARC_M6, %g4
++ cmp %g2, '7'
++ be,pt %xcc, 5f
++ mov SUN4V_CHIP_SPARC_M7, %g4
+ ba,pt %xcc, 49f
+ nop
+
+@@ -585,6 +591,12 @@ niagara_tlb_fixup:
+ cmp %g1, SUN4V_CHIP_NIAGARA5
+ be,pt %xcc, niagara4_patch
+ nop
++ cmp %g1, SUN4V_CHIP_SPARC_M6
++ be,pt %xcc, niagara4_patch
++ nop
++ cmp %g1, SUN4V_CHIP_SPARC_M7
++ be,pt %xcc, niagara4_patch
++ nop
+
+ call generic_patch_copyops
+ nop
+@@ -660,14 +672,12 @@ tlb_fixup_done:
+ sethi %hi(init_thread_union), %g6
+ or %g6, %lo(init_thread_union), %g6
+ ldx [%g6 + TI_TASK], %g4
+- mov %sp, %l6
+
+ wr %g0, ASI_P, %asi
+ mov 1, %g1
+ sllx %g1, THREAD_SHIFT, %g1
+ sub %g1, (STACKFRAME_SZ + STACK_BIAS), %g1
+ add %g6, %g1, %sp
+- mov 0, %fp
+
+ /* Set per-cpu pointer initially to zero, this makes
+ * the boot-cpu use the in-kernel-image per-cpu areas
+@@ -694,44 +704,14 @@ tlb_fixup_done:
+ nop
+ #endif
+
+- mov %l6, %o1 ! OpenPROM stack
+ call prom_init
+ mov %l7, %o0 ! OpenPROM cif handler
+
+- /* Initialize current_thread_info()->cpu as early as possible.
+- * In order to do that accurately we have to patch up the get_cpuid()
+- * assembler sequences. And that, in turn, requires that we know
+- * if we are on a Starfire box or not. While we're here, patch up
+- * the sun4v sequences as well.
++ /* To create a one-register-window buffer between the kernel's
++ * initial stack and the last stack frame we use from the firmware,
++ * do the rest of the boot from a C helper function.
+ */
+- call check_if_starfire
+- nop
+- call per_cpu_patch
+- nop
+- call sun4v_patch
+- nop
+-
+-#ifdef CONFIG_SMP
+- call hard_smp_processor_id
+- nop
+- cmp %o0, NR_CPUS
+- blu,pt %xcc, 1f
+- nop
+- call boot_cpu_id_too_large
+- nop
+- /* Not reached... */
+-
+-1:
+-#else
+- mov 0, %o0
+-#endif
+- sth %o0, [%g6 + TI_CPU]
+-
+- call prom_init_report
+- nop
+-
+- /* Off we go.... */
+- call start_kernel
++ call start_early_boot
+ nop
+ /* Not reached... */
+
+diff --git a/arch/sparc/kernel/hvapi.c b/arch/sparc/kernel/hvapi.c
+index c0a2de0fd624..5c55145bfbf0 100644
+--- a/arch/sparc/kernel/hvapi.c
++++ b/arch/sparc/kernel/hvapi.c
+@@ -46,6 +46,7 @@ static struct api_info api_table[] = {
+ { .group = HV_GRP_VF_CPU, },
+ { .group = HV_GRP_KT_CPU, },
+ { .group = HV_GRP_VT_CPU, },
++ { .group = HV_GRP_T5_CPU, },
+ { .group = HV_GRP_DIAG, .flags = FLAG_PRE_API },
+ };
+
+diff --git a/arch/sparc/kernel/hvcalls.S b/arch/sparc/kernel/hvcalls.S
+index f3ab509b76a8..caedf8320416 100644
+--- a/arch/sparc/kernel/hvcalls.S
++++ b/arch/sparc/kernel/hvcalls.S
+@@ -821,3 +821,19 @@ ENTRY(sun4v_vt_set_perfreg)
+ retl
+ nop
+ ENDPROC(sun4v_vt_set_perfreg)
++
++ENTRY(sun4v_t5_get_perfreg)
++ mov %o1, %o4
++ mov HV_FAST_T5_GET_PERFREG, %o5
++ ta HV_FAST_TRAP
++ stx %o1, [%o4]
++ retl
++ nop
++ENDPROC(sun4v_t5_get_perfreg)
++
++ENTRY(sun4v_t5_set_perfreg)
++ mov HV_FAST_T5_SET_PERFREG, %o5
++ ta HV_FAST_TRAP
++ retl
++ nop
++ENDPROC(sun4v_t5_set_perfreg)
+diff --git a/arch/sparc/kernel/hvtramp.S b/arch/sparc/kernel/hvtramp.S
+index b7ddcdd1dea9..cdbfec299f2f 100644
+--- a/arch/sparc/kernel/hvtramp.S
++++ b/arch/sparc/kernel/hvtramp.S
+@@ -109,7 +109,6 @@ hv_cpu_startup:
+ sllx %g5, THREAD_SHIFT, %g5
+ sub %g5, (STACKFRAME_SZ + STACK_BIAS), %g5
+ add %g6, %g5, %sp
+- mov 0, %fp
+
+ call init_irqwork_curcpu
+ nop
+diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
+index 7f08ec8a7c68..28fed53b13a0 100644
+--- a/arch/sparc/kernel/ioport.c
++++ b/arch/sparc/kernel/ioport.c
+@@ -278,7 +278,8 @@ static void *sbus_alloc_coherent(struct device *dev, size_t len,
+ }
+
+ order = get_order(len_total);
+- if ((va = __get_free_pages(GFP_KERNEL|__GFP_COMP, order)) == 0)
++ va = __get_free_pages(gfp, order);
++ if (va == 0)
+ goto err_nopages;
+
+ if ((res = kzalloc(sizeof(struct resource), GFP_KERNEL)) == NULL)
+@@ -443,7 +444,7 @@ static void *pci32_alloc_coherent(struct device *dev, size_t len,
+ }
+
+ order = get_order(len_total);
+- va = (void *) __get_free_pages(GFP_KERNEL, order);
++ va = (void *) __get_free_pages(gfp, order);
+ if (va == NULL) {
+ printk("pci_alloc_consistent: no %ld pages\n", len_total>>PAGE_SHIFT);
+ goto err_nopages;
+diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
+index 666193f4e8bb..4033c23bdfa6 100644
+--- a/arch/sparc/kernel/irq_64.c
++++ b/arch/sparc/kernel/irq_64.c
+@@ -47,8 +47,6 @@
+ #include "cpumap.h"
+ #include "kstack.h"
+
+-#define NUM_IVECS (IMAP_INR + 1)
+-
+ struct ino_bucket *ivector_table;
+ unsigned long ivector_table_pa;
+
+@@ -107,55 +105,196 @@ static void bucket_set_irq(unsigned long bucket_pa, unsigned int irq)
+
+ #define irq_work_pa(__cpu) &(trap_block[(__cpu)].irq_worklist_pa)
+
+-static struct {
+- unsigned int dev_handle;
+- unsigned int dev_ino;
+- unsigned int in_use;
+-} irq_table[NR_IRQS];
+-static DEFINE_SPINLOCK(irq_alloc_lock);
++static unsigned long hvirq_major __initdata;
++static int __init early_hvirq_major(char *p)
++{
++ int rc = kstrtoul(p, 10, &hvirq_major);
++
++ return rc;
++}
++early_param("hvirq", early_hvirq_major);
++
++static int hv_irq_version;
++
++/* Major version 2.0 of HV_GRP_INTR added support for the VIRQ cookie
++ * based interfaces, but:
++ *
++ * 1) Several OSs, Solaris and Linux included, use them even when only
++ * negotiating version 1.0 (or failing to negotiate at all). So the
++ * hypervisor has a workaround that provides the VIRQ interfaces even
++ * when only verion 1.0 of the API is in use.
++ *
++ * 2) Second, and more importantly, with major version 2.0 these VIRQ
++ * interfaces only were actually hooked up for LDC interrupts, even
++ * though the Hypervisor specification clearly stated:
++ *
++ * The new interrupt API functions will be available to a guest
++ * when it negotiates version 2.0 in the interrupt API group 0x2. When
++ * a guest negotiates version 2.0, all interrupt sources will only
++ * support using the cookie interface, and any attempt to use the
++ * version 1.0 interrupt APIs numbered 0xa0 to 0xa6 will result in the
++ * ENOTSUPPORTED error being returned.
++ *
++ * with an emphasis on "all interrupt sources".
++ *
++ * To correct this, major version 3.0 was created which does actually
++ * support VIRQs for all interrupt sources (not just LDC devices). So
++ * if we want to move completely over the cookie based VIRQs we must
++ * negotiate major version 3.0 or later of HV_GRP_INTR.
++ */
++static bool sun4v_cookie_only_virqs(void)
++{
++ if (hv_irq_version >= 3)
++ return true;
++ return false;
++}
+
+-unsigned char irq_alloc(unsigned int dev_handle, unsigned int dev_ino)
++static void __init irq_init_hv(void)
+ {
+- unsigned long flags;
+- unsigned char ent;
++ unsigned long hv_error, major, minor = 0;
++
++ if (tlb_type != hypervisor)
++ return;
+
+- BUILD_BUG_ON(NR_IRQS >= 256);
++ if (hvirq_major)
++ major = hvirq_major;
++ else
++ major = 3;
+
+- spin_lock_irqsave(&irq_alloc_lock, flags);
++ hv_error = sun4v_hvapi_register(HV_GRP_INTR, major, &minor);
++ if (!hv_error)
++ hv_irq_version = major;
++ else
++ hv_irq_version = 1;
+
+- for (ent = 1; ent < NR_IRQS; ent++) {
+- if (!irq_table[ent].in_use)
++ pr_info("SUN4V: Using IRQ API major %d, cookie only virqs %s\n",
++ hv_irq_version,
++ sun4v_cookie_only_virqs() ? "enabled" : "disabled");
++}
++
++/* This function is for the timer interrupt.*/
++int __init arch_probe_nr_irqs(void)
++{
++ return 1;
++}
++
++#define DEFAULT_NUM_IVECS (0xfffU)
++static unsigned int nr_ivec = DEFAULT_NUM_IVECS;
++#define NUM_IVECS (nr_ivec)
++
++static unsigned int __init size_nr_ivec(void)
++{
++ if (tlb_type == hypervisor) {
++ switch (sun4v_chip_type) {
++ /* Athena's devhandle|devino is large.*/
++ case SUN4V_CHIP_SPARC64X:
++ nr_ivec = 0xffff;
+ break;
++ }
+ }
+- if (ent >= NR_IRQS) {
+- printk(KERN_ERR "IRQ: Out of virtual IRQs.\n");
+- ent = 0;
+- } else {
+- irq_table[ent].dev_handle = dev_handle;
+- irq_table[ent].dev_ino = dev_ino;
+- irq_table[ent].in_use = 1;
+- }
++ return nr_ivec;
++}
++
++struct irq_handler_data {
++ union {
++ struct {
++ unsigned int dev_handle;
++ unsigned int dev_ino;
++ };
++ unsigned long sysino;
++ };
++ struct ino_bucket bucket;
++ unsigned long iclr;
++ unsigned long imap;
++};
++
++static inline unsigned int irq_data_to_handle(struct irq_data *data)
++{
++ struct irq_handler_data *ihd = data->handler_data;
++
++ return ihd->dev_handle;
++}
++
++static inline unsigned int irq_data_to_ino(struct irq_data *data)
++{
++ struct irq_handler_data *ihd = data->handler_data;
+
+- spin_unlock_irqrestore(&irq_alloc_lock, flags);
++ return ihd->dev_ino;
++}
++
++static inline unsigned long irq_data_to_sysino(struct irq_data *data)
++{
++ struct irq_handler_data *ihd = data->handler_data;
+
+- return ent;
++ return ihd->sysino;
+ }
+
+-#ifdef CONFIG_PCI_MSI
+ void irq_free(unsigned int irq)
+ {
+- unsigned long flags;
++ void *data = irq_get_handler_data(irq);
+
+- if (irq >= NR_IRQS)
+- return;
++ kfree(data);
++ irq_set_handler_data(irq, NULL);
++ irq_free_descs(irq, 1);
++}
+
+- spin_lock_irqsave(&irq_alloc_lock, flags);
++unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino)
++{
++ int irq;
+
+- irq_table[irq].in_use = 0;
++ irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL);
++ if (irq <= 0)
++ goto out;
+
+- spin_unlock_irqrestore(&irq_alloc_lock, flags);
++ return irq;
++out:
++ return 0;
++}
++
++static unsigned int cookie_exists(u32 devhandle, unsigned int devino)
++{
++ unsigned long hv_err, cookie;
++ struct ino_bucket *bucket;
++ unsigned int irq = 0U;
++
++ hv_err = sun4v_vintr_get_cookie(devhandle, devino, &cookie);
++ if (hv_err) {
++ pr_err("HV get cookie failed hv_err = %ld\n", hv_err);
++ goto out;
++ }
++
++ if (cookie & ((1UL << 63UL))) {
++ cookie = ~cookie;
++ bucket = (struct ino_bucket *) __va(cookie);
++ irq = bucket->__irq;
++ }
++out:
++ return irq;
++}
++
++static unsigned int sysino_exists(u32 devhandle, unsigned int devino)
++{
++ unsigned long sysino = sun4v_devino_to_sysino(devhandle, devino);
++ struct ino_bucket *bucket;
++ unsigned int irq;
++
++ bucket = &ivector_table[sysino];
++ irq = bucket_get_irq(__pa(bucket));
++
++ return irq;
++}
++
++void ack_bad_irq(unsigned int irq)
++{
++ pr_crit("BAD IRQ ack %d\n", irq);
++}
++
++void irq_install_pre_handler(int irq,
++ void (*func)(unsigned int, void *, void *),
++ void *arg1, void *arg2)
++{
++ pr_warn("IRQ pre handler NOT supported.\n");
+ }
+-#endif
+
+ /*
+ * /proc/interrupts printing:
+@@ -206,15 +345,6 @@ static unsigned int sun4u_compute_tid(unsigned long imap, unsigned long cpuid)
+ return tid;
+ }
+
+-struct irq_handler_data {
+- unsigned long iclr;
+- unsigned long imap;
+-
+- void (*pre_handler)(unsigned int, void *, void *);
+- void *arg1;
+- void *arg2;
+-};
+-
+ #ifdef CONFIG_SMP
+ static int irq_choose_cpu(unsigned int irq, const struct cpumask *affinity)
+ {
+@@ -316,8 +446,8 @@ static void sun4u_irq_eoi(struct irq_data *data)
+
+ static void sun4v_irq_enable(struct irq_data *data)
+ {
+- unsigned int ino = irq_table[data->irq].dev_ino;
+ unsigned long cpuid = irq_choose_cpu(data->irq, data->affinity);
++ unsigned int ino = irq_data_to_sysino(data);
+ int err;
+
+ err = sun4v_intr_settarget(ino, cpuid);
+@@ -337,8 +467,8 @@ static void sun4v_irq_enable(struct irq_data *data)
+ static int sun4v_set_affinity(struct irq_data *data,
+ const struct cpumask *mask, bool force)
+ {
+- unsigned int ino = irq_table[data->irq].dev_ino;
+ unsigned long cpuid = irq_choose_cpu(data->irq, mask);
++ unsigned int ino = irq_data_to_sysino(data);
+ int err;
+
+ err = sun4v_intr_settarget(ino, cpuid);
+@@ -351,7 +481,7 @@ static int sun4v_set_affinity(struct irq_data *data,
+
+ static void sun4v_irq_disable(struct irq_data *data)
+ {
+- unsigned int ino = irq_table[data->irq].dev_ino;
++ unsigned int ino = irq_data_to_sysino(data);
+ int err;
+
+ err = sun4v_intr_setenabled(ino, HV_INTR_DISABLED);
+@@ -362,7 +492,7 @@ static void sun4v_irq_disable(struct irq_data *data)
+
+ static void sun4v_irq_eoi(struct irq_data *data)
+ {
+- unsigned int ino = irq_table[data->irq].dev_ino;
++ unsigned int ino = irq_data_to_sysino(data);
+ int err;
+
+ err = sun4v_intr_setstate(ino, HV_INTR_STATE_IDLE);
+@@ -373,14 +503,13 @@ static void sun4v_irq_eoi(struct irq_data *data)
+
+ static void sun4v_virq_enable(struct irq_data *data)
+ {
+- unsigned long cpuid, dev_handle, dev_ino;
++ unsigned long dev_handle = irq_data_to_handle(data);
++ unsigned long dev_ino = irq_data_to_ino(data);
++ unsigned long cpuid;
+ int err;
+
+ cpuid = irq_choose_cpu(data->irq, data->affinity);
+
+- dev_handle = irq_table[data->irq].dev_handle;
+- dev_ino = irq_table[data->irq].dev_ino;
+-
+ err = sun4v_vintr_set_target(dev_handle, dev_ino, cpuid);
+ if (err != HV_EOK)
+ printk(KERN_ERR "sun4v_vintr_set_target(%lx,%lx,%lu): "
+@@ -403,14 +532,13 @@ static void sun4v_virq_enable(struct irq_data *data)
+ static int sun4v_virt_set_affinity(struct irq_data *data,
+ const struct cpumask *mask, bool force)
+ {
+- unsigned long cpuid, dev_handle, dev_ino;
++ unsigned long dev_handle = irq_data_to_handle(data);
++ unsigned long dev_ino = irq_data_to_ino(data);
++ unsigned long cpuid;
+ int err;
+
+ cpuid = irq_choose_cpu(data->irq, mask);
+
+- dev_handle = irq_table[data->irq].dev_handle;
+- dev_ino = irq_table[data->irq].dev_ino;
+-
+ err = sun4v_vintr_set_target(dev_handle, dev_ino, cpuid);
+ if (err != HV_EOK)
+ printk(KERN_ERR "sun4v_vintr_set_target(%lx,%lx,%lu): "
+@@ -422,11 +550,10 @@ static int sun4v_virt_set_affinity(struct irq_data *data,
+
+ static void sun4v_virq_disable(struct irq_data *data)
+ {
+- unsigned long dev_handle, dev_ino;
++ unsigned long dev_handle = irq_data_to_handle(data);
++ unsigned long dev_ino = irq_data_to_ino(data);
+ int err;
+
+- dev_handle = irq_table[data->irq].dev_handle;
+- dev_ino = irq_table[data->irq].dev_ino;
+
+ err = sun4v_vintr_set_valid(dev_handle, dev_ino,
+ HV_INTR_DISABLED);
+@@ -438,12 +565,10 @@ static void sun4v_virq_disable(struct irq_data *data)
+
+ static void sun4v_virq_eoi(struct irq_data *data)
+ {
+- unsigned long dev_handle, dev_ino;
++ unsigned long dev_handle = irq_data_to_handle(data);
++ unsigned long dev_ino = irq_data_to_ino(data);
+ int err;
+
+- dev_handle = irq_table[data->irq].dev_handle;
+- dev_ino = irq_table[data->irq].dev_ino;
+-
+ err = sun4v_vintr_set_state(dev_handle, dev_ino,
+ HV_INTR_STATE_IDLE);
+ if (err != HV_EOK)
+@@ -479,31 +604,10 @@ static struct irq_chip sun4v_virq = {
+ .flags = IRQCHIP_EOI_IF_HANDLED,
+ };
+
+-static void pre_flow_handler(struct irq_data *d)
+-{
+- struct irq_handler_data *handler_data = irq_data_get_irq_handler_data(d);
+- unsigned int ino = irq_table[d->irq].dev_ino;
+-
+- handler_data->pre_handler(ino, handler_data->arg1, handler_data->arg2);
+-}
+-
+-void irq_install_pre_handler(int irq,
+- void (*func)(unsigned int, void *, void *),
+- void *arg1, void *arg2)
+-{
+- struct irq_handler_data *handler_data = irq_get_handler_data(irq);
+-
+- handler_data->pre_handler = func;
+- handler_data->arg1 = arg1;
+- handler_data->arg2 = arg2;
+-
+- __irq_set_preflow_handler(irq, pre_flow_handler);
+-}
+-
+ unsigned int build_irq(int inofixup, unsigned long iclr, unsigned long imap)
+ {
+- struct ino_bucket *bucket;
+ struct irq_handler_data *handler_data;
++ struct ino_bucket *bucket;
+ unsigned int irq;
+ int ino;
+
+@@ -537,119 +641,166 @@ out:
+ return irq;
+ }
+
+-static unsigned int sun4v_build_common(unsigned long sysino,
+- struct irq_chip *chip)
++static unsigned int sun4v_build_common(u32 devhandle, unsigned int devino,
++ void (*handler_data_init)(struct irq_handler_data *data,
++ u32 devhandle, unsigned int devino),
++ struct irq_chip *chip)
+ {
+- struct ino_bucket *bucket;
+- struct irq_handler_data *handler_data;
++ struct irq_handler_data *data;
+ unsigned int irq;
+
+- BUG_ON(tlb_type != hypervisor);
++ irq = irq_alloc(devhandle, devino);
++ if (!irq)
++ goto out;
+
+- bucket = &ivector_table[sysino];
+- irq = bucket_get_irq(__pa(bucket));
+- if (!irq) {
+- irq = irq_alloc(0, sysino);
+- bucket_set_irq(__pa(bucket), irq);
+- irq_set_chip_and_handler_name(irq, chip, handle_fasteoi_irq,
+- "IVEC");
++ data = kzalloc(sizeof(struct irq_handler_data), GFP_ATOMIC);
++ if (unlikely(!data)) {
++ pr_err("IRQ handler data allocation failed.\n");
++ irq_free(irq);
++ irq = 0;
++ goto out;
+ }
+
+- handler_data = irq_get_handler_data(irq);
+- if (unlikely(handler_data))
+- goto out;
++ irq_set_handler_data(irq, data);
++ handler_data_init(data, devhandle, devino);
++ irq_set_chip_and_handler_name(irq, chip, handle_fasteoi_irq, "IVEC");
++ data->imap = ~0UL;
++ data->iclr = ~0UL;
++out:
++ return irq;
++}
+
+- handler_data = kzalloc(sizeof(struct irq_handler_data), GFP_ATOMIC);
+- if (unlikely(!handler_data)) {
+- prom_printf("IRQ: kzalloc(irq_handler_data) failed.\n");
+- prom_halt();
+- }
+- irq_set_handler_data(irq, handler_data);
++static unsigned long cookie_assign(unsigned int irq, u32 devhandle,
++ unsigned int devino)
++{
++ struct irq_handler_data *ihd = irq_get_handler_data(irq);
++ unsigned long hv_error, cookie;
+
+- /* Catch accidental accesses to these things. IMAP/ICLR handling
+- * is done by hypervisor calls on sun4v platforms, not by direct
+- * register accesses.
++ /* handler_irq needs to find the irq. cookie is seen signed in
++ * sun4v_dev_mondo and treated as a non ivector_table delivery.
+ */
+- handler_data->imap = ~0UL;
+- handler_data->iclr = ~0UL;
++ ihd->bucket.__irq = irq;
++ cookie = ~__pa(&ihd->bucket);
+
+-out:
+- return irq;
++ hv_error = sun4v_vintr_set_cookie(devhandle, devino, cookie);
++ if (hv_error)
++ pr_err("HV vintr set cookie failed = %ld\n", hv_error);
++
++ return hv_error;
+ }
+
+-unsigned int sun4v_build_irq(u32 devhandle, unsigned int devino)
++static void cookie_handler_data(struct irq_handler_data *data,
++ u32 devhandle, unsigned int devino)
+ {
+- unsigned long sysino = sun4v_devino_to_sysino(devhandle, devino);
++ data->dev_handle = devhandle;
++ data->dev_ino = devino;
++}
+
+- return sun4v_build_common(sysino, &sun4v_irq);
++static unsigned int cookie_build_irq(u32 devhandle, unsigned int devino,
++ struct irq_chip *chip)
++{
++ unsigned long hv_error;
++ unsigned int irq;
++
++ irq = sun4v_build_common(devhandle, devino, cookie_handler_data, chip);
++
++ hv_error = cookie_assign(irq, devhandle, devino);
++ if (hv_error) {
++ irq_free(irq);
++ irq = 0;
++ }
++
++ return irq;
+ }
+
+-unsigned int sun4v_build_virq(u32 devhandle, unsigned int devino)
++static unsigned int sun4v_build_cookie(u32 devhandle, unsigned int devino)
+ {
+- struct irq_handler_data *handler_data;
+- unsigned long hv_err, cookie;
+- struct ino_bucket *bucket;
+ unsigned int irq;
+
+- bucket = kzalloc(sizeof(struct ino_bucket), GFP_ATOMIC);
+- if (unlikely(!bucket))
+- return 0;
++ irq = cookie_exists(devhandle, devino);
++ if (irq)
++ goto out;
+
+- /* The only reference we store to the IRQ bucket is
+- * by physical address which kmemleak can't see, tell
+- * it that this object explicitly is not a leak and
+- * should be scanned.
+- */
+- kmemleak_not_leak(bucket);
++ irq = cookie_build_irq(devhandle, devino, &sun4v_virq);
+
+- __flush_dcache_range((unsigned long) bucket,
+- ((unsigned long) bucket +
+- sizeof(struct ino_bucket)));
++out:
++ return irq;
++}
+
+- irq = irq_alloc(devhandle, devino);
++static void sysino_set_bucket(unsigned int irq)
++{
++ struct irq_handler_data *ihd = irq_get_handler_data(irq);
++ struct ino_bucket *bucket;
++ unsigned long sysino;
++
++ sysino = sun4v_devino_to_sysino(ihd->dev_handle, ihd->dev_ino);
++ BUG_ON(sysino >= nr_ivec);
++ bucket = &ivector_table[sysino];
+ bucket_set_irq(__pa(bucket), irq);
++}
+
+- irq_set_chip_and_handler_name(irq, &sun4v_virq, handle_fasteoi_irq,
+- "IVEC");
++static void sysino_handler_data(struct irq_handler_data *data,
++ u32 devhandle, unsigned int devino)
++{
++ unsigned long sysino;
+
+- handler_data = kzalloc(sizeof(struct irq_handler_data), GFP_ATOMIC);
+- if (unlikely(!handler_data))
+- return 0;
++ sysino = sun4v_devino_to_sysino(devhandle, devino);
++ data->sysino = sysino;
++}
+
+- /* In order to make the LDC channel startup sequence easier,
+- * especially wrt. locking, we do not let request_irq() enable
+- * the interrupt.
+- */
+- irq_set_status_flags(irq, IRQ_NOAUTOEN);
+- irq_set_handler_data(irq, handler_data);
++static unsigned int sysino_build_irq(u32 devhandle, unsigned int devino,
++ struct irq_chip *chip)
++{
++ unsigned int irq;
+
+- /* Catch accidental accesses to these things. IMAP/ICLR handling
+- * is done by hypervisor calls on sun4v platforms, not by direct
+- * register accesses.
+- */
+- handler_data->imap = ~0UL;
+- handler_data->iclr = ~0UL;
++ irq = sun4v_build_common(devhandle, devino, sysino_handler_data, chip);
++ if (!irq)
++ goto out;
+
+- cookie = ~__pa(bucket);
+- hv_err = sun4v_vintr_set_cookie(devhandle, devino, cookie);
+- if (hv_err) {
+- prom_printf("IRQ: Fatal, cannot set cookie for [%x:%x] "
+- "err=%lu\n", devhandle, devino, hv_err);
+- prom_halt();
+- }
++ sysino_set_bucket(irq);
++out:
++ return irq;
++}
+
++static int sun4v_build_sysino(u32 devhandle, unsigned int devino)
++{
++ int irq;
++
++ irq = sysino_exists(devhandle, devino);
++ if (irq)
++ goto out;
++
++ irq = sysino_build_irq(devhandle, devino, &sun4v_irq);
++out:
+ return irq;
+ }
+
+-void ack_bad_irq(unsigned int irq)
++unsigned int sun4v_build_irq(u32 devhandle, unsigned int devino)
+ {
+- unsigned int ino = irq_table[irq].dev_ino;
++ unsigned int irq;
+
+- if (!ino)
+- ino = 0xdeadbeef;
++ if (sun4v_cookie_only_virqs())
++ irq = sun4v_build_cookie(devhandle, devino);
++ else
++ irq = sun4v_build_sysino(devhandle, devino);
+
+- printk(KERN_CRIT "Unexpected IRQ from ino[%x] irq[%u]\n",
+- ino, irq);
++ return irq;
++}
++
++unsigned int sun4v_build_virq(u32 devhandle, unsigned int devino)
++{
++ int irq;
++
++ irq = cookie_build_irq(devhandle, devino, &sun4v_virq);
++ if (!irq)
++ goto out;
++
++ /* This is borrowed from the original function.
++ */
++ irq_set_status_flags(irq, IRQ_NOAUTOEN);
++
++out:
++ return irq;
+ }
+
+ void *hardirq_stack[NR_CPUS];
+@@ -720,9 +871,12 @@ void fixup_irqs(void)
+
+ for (irq = 0; irq < NR_IRQS; irq++) {
+ struct irq_desc *desc = irq_to_desc(irq);
+- struct irq_data *data = irq_desc_get_irq_data(desc);
++ struct irq_data *data;
+ unsigned long flags;
+
++ if (!desc)
++ continue;
++ data = irq_desc_get_irq_data(desc);
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ if (desc->action && !irqd_is_per_cpu(data)) {
+ if (data->chip->irq_set_affinity)
+@@ -922,16 +1076,22 @@ static struct irqaction timer_irq_action = {
+ .name = "timer",
+ };
+
+-/* Only invoked on boot processor. */
+-void __init init_IRQ(void)
++static void __init irq_ivector_init(void)
+ {
+- unsigned long size;
++ unsigned long size, order;
++ unsigned int ivecs;
+
+- map_prom_timers();
+- kill_prom_timer();
++ /* If we are doing cookie only VIRQs then we do not need the ivector
++ * table to process interrupts.
++ */
++ if (sun4v_cookie_only_virqs())
++ return;
+
+- size = sizeof(struct ino_bucket) * NUM_IVECS;
+- ivector_table = kzalloc(size, GFP_KERNEL);
++ ivecs = size_nr_ivec();
++ size = sizeof(struct ino_bucket) * ivecs;
++ order = get_order(size);
++ ivector_table = (struct ino_bucket *)
++ __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+ if (!ivector_table) {
+ prom_printf("Fatal error, cannot allocate ivector_table\n");
+ prom_halt();
+@@ -940,6 +1100,15 @@ void __init init_IRQ(void)
+ ((unsigned long) ivector_table) + size);
+
+ ivector_table_pa = __pa(ivector_table);
++}
++
++/* Only invoked on boot processor.*/
++void __init init_IRQ(void)
++{
++ irq_init_hv();
++ irq_ivector_init();
++ map_prom_timers();
++ kill_prom_timer();
+
+ if (tlb_type == hypervisor)
+ sun4v_init_mondo_queues();
+diff --git a/arch/sparc/kernel/ktlb.S b/arch/sparc/kernel/ktlb.S
+index 605d49204580..ef0d8e9e1210 100644
+--- a/arch/sparc/kernel/ktlb.S
++++ b/arch/sparc/kernel/ktlb.S
+@@ -47,14 +47,6 @@ kvmap_itlb_vmalloc_addr:
+ KERN_PGTABLE_WALK(%g4, %g5, %g2, kvmap_itlb_longpath)
+
+ TSB_LOCK_TAG(%g1, %g2, %g7)
+-
+- /* Load and check PTE. */
+- ldxa [%g5] ASI_PHYS_USE_EC, %g5
+- mov 1, %g7
+- sllx %g7, TSB_TAG_INVALID_BIT, %g7
+- brgez,a,pn %g5, kvmap_itlb_longpath
+- TSB_STORE(%g1, %g7)
+-
+ TSB_WRITE(%g1, %g5, %g6)
+
+ /* fallthrough to TLB load */
+@@ -118,6 +110,12 @@ kvmap_dtlb_obp:
+ ba,pt %xcc, kvmap_dtlb_load
+ nop
+
++kvmap_linear_early:
++ sethi %hi(kern_linear_pte_xor), %g7
++ ldx [%g7 + %lo(kern_linear_pte_xor)], %g2
++ ba,pt %xcc, kvmap_dtlb_tsb4m_load
++ xor %g2, %g4, %g5
++
+ .align 32
+ kvmap_dtlb_tsb4m_load:
+ TSB_LOCK_TAG(%g1, %g2, %g7)
+@@ -146,105 +144,17 @@ kvmap_dtlb_4v:
+ /* Correct TAG_TARGET is already in %g6, check 4mb TSB. */
+ KERN_TSB4M_LOOKUP_TL1(%g6, %g5, %g1, %g2, %g3, kvmap_dtlb_load)
+ #endif
+- /* TSB entry address left in %g1, lookup linear PTE.
+- * Must preserve %g1 and %g6 (TAG).
+- */
+-kvmap_dtlb_tsb4m_miss:
+- /* Clear the PAGE_OFFSET top virtual bits, shift
+- * down to get PFN, and make sure PFN is in range.
+- */
+-661: sllx %g4, 0, %g5
+- .section .page_offset_shift_patch, "ax"
+- .word 661b
+- .previous
+-
+- /* Check to see if we know about valid memory at the 4MB
+- * chunk this physical address will reside within.
++ /* Linear mapping TSB lookup failed. Fallthrough to kernel
++ * page table based lookup.
+ */
+-661: srlx %g5, MAX_PHYS_ADDRESS_BITS, %g2
+- .section .page_offset_shift_patch, "ax"
+- .word 661b
+- .previous
+-
+- brnz,pn %g2, kvmap_dtlb_longpath
+- nop
+-
+- /* This unconditional branch and delay-slot nop gets patched
+- * by the sethi sequence once the bitmap is properly setup.
+- */
+- .globl valid_addr_bitmap_insn
+-valid_addr_bitmap_insn:
+- ba,pt %xcc, 2f
+- nop
+- .subsection 2
+- .globl valid_addr_bitmap_patch
+-valid_addr_bitmap_patch:
+- sethi %hi(sparc64_valid_addr_bitmap), %g7
+- or %g7, %lo(sparc64_valid_addr_bitmap), %g7
+- .previous
+-
+-661: srlx %g5, ILOG2_4MB, %g2
+- .section .page_offset_shift_patch, "ax"
+- .word 661b
+- .previous
+-
+- srlx %g2, 6, %g5
+- and %g2, 63, %g2
+- sllx %g5, 3, %g5
+- ldx [%g7 + %g5], %g5
+- mov 1, %g7
+- sllx %g7, %g2, %g7
+- andcc %g5, %g7, %g0
+- be,pn %xcc, kvmap_dtlb_longpath
+-
+-2: sethi %hi(kpte_linear_bitmap), %g2
+-
+- /* Get the 256MB physical address index. */
+-661: sllx %g4, 0, %g5
+- .section .page_offset_shift_patch, "ax"
+- .word 661b
+- .previous
+-
+- or %g2, %lo(kpte_linear_bitmap), %g2
+-
+-661: srlx %g5, ILOG2_256MB, %g5
+- .section .page_offset_shift_patch, "ax"
+- .word 661b
+- .previous
+-
+- and %g5, (32 - 1), %g7
+-
+- /* Divide by 32 to get the offset into the bitmask. */
+- srlx %g5, 5, %g5
+- add %g7, %g7, %g7
+- sllx %g5, 3, %g5
+-
+- /* kern_linear_pte_xor[(mask >> shift) & 3)] */
+- ldx [%g2 + %g5], %g2
+- srlx %g2, %g7, %g7
+- sethi %hi(kern_linear_pte_xor), %g5
+- and %g7, 3, %g7
+- or %g5, %lo(kern_linear_pte_xor), %g5
+- sllx %g7, 3, %g7
+- ldx [%g5 + %g7], %g2
+-
+ .globl kvmap_linear_patch
+ kvmap_linear_patch:
+- ba,pt %xcc, kvmap_dtlb_tsb4m_load
+- xor %g2, %g4, %g5
++ ba,a,pt %xcc, kvmap_linear_early
+
+ kvmap_dtlb_vmalloc_addr:
+ KERN_PGTABLE_WALK(%g4, %g5, %g2, kvmap_dtlb_longpath)
+
+ TSB_LOCK_TAG(%g1, %g2, %g7)
+-
+- /* Load and check PTE. */
+- ldxa [%g5] ASI_PHYS_USE_EC, %g5
+- mov 1, %g7
+- sllx %g7, TSB_TAG_INVALID_BIT, %g7
+- brgez,a,pn %g5, kvmap_dtlb_longpath
+- TSB_STORE(%g1, %g7)
+-
+ TSB_WRITE(%g1, %g5, %g6)
+
+ /* fallthrough to TLB load */
+@@ -276,13 +186,8 @@ kvmap_dtlb_load:
+
+ #ifdef CONFIG_SPARSEMEM_VMEMMAP
+ kvmap_vmemmap:
+- sub %g4, %g5, %g5
+- srlx %g5, ILOG2_4MB, %g5
+- sethi %hi(vmemmap_table), %g1
+- sllx %g5, 3, %g5
+- or %g1, %lo(vmemmap_table), %g1
+- ba,pt %xcc, kvmap_dtlb_load
+- ldx [%g1 + %g5], %g5
++ KERN_PGTABLE_WALK(%g4, %g5, %g2, kvmap_dtlb_longpath)
++ ba,a,pt %xcc, kvmap_dtlb_load
+ #endif
+
+ kvmap_dtlb_nonlinear:
+@@ -294,8 +199,8 @@ kvmap_dtlb_nonlinear:
+
+ #ifdef CONFIG_SPARSEMEM_VMEMMAP
+ /* Do not use the TSB for vmemmap. */
+- mov (VMEMMAP_BASE >> 40), %g5
+- sllx %g5, 40, %g5
++ sethi %hi(VMEMMAP_BASE), %g5
++ ldx [%g5 + %lo(VMEMMAP_BASE)], %g5
+ cmp %g4,%g5
+ bgeu,pn %xcc, kvmap_vmemmap
+ nop
+@@ -307,8 +212,8 @@ kvmap_dtlb_tsbmiss:
+ sethi %hi(MODULES_VADDR), %g5
+ cmp %g4, %g5
+ blu,pn %xcc, kvmap_dtlb_longpath
+- mov (VMALLOC_END >> 40), %g5
+- sllx %g5, 40, %g5
++ sethi %hi(VMALLOC_END), %g5
++ ldx [%g5 + %lo(VMALLOC_END)], %g5
+ cmp %g4, %g5
+ bgeu,pn %xcc, kvmap_dtlb_longpath
+ nop
+diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
+index 66dacd56bb10..27bb55485472 100644
+--- a/arch/sparc/kernel/ldc.c
++++ b/arch/sparc/kernel/ldc.c
+@@ -1078,7 +1078,8 @@ static void ldc_iommu_release(struct ldc_channel *lp)
+
+ struct ldc_channel *ldc_alloc(unsigned long id,
+ const struct ldc_channel_config *cfgp,
+- void *event_arg)
++ void *event_arg,
++ const char *name)
+ {
+ struct ldc_channel *lp;
+ const struct ldc_mode_ops *mops;
+@@ -1093,6 +1094,8 @@ struct ldc_channel *ldc_alloc(unsigned long id,
+ err = -EINVAL;
+ if (!cfgp)
+ goto out_err;
++ if (!name)
++ goto out_err;
+
+ switch (cfgp->mode) {
+ case LDC_MODE_RAW:
+@@ -1185,6 +1188,21 @@ struct ldc_channel *ldc_alloc(unsigned long id,
+
+ INIT_HLIST_HEAD(&lp->mh_list);
+
++ snprintf(lp->rx_irq_name, LDC_IRQ_NAME_MAX, "%s RX", name);
++ snprintf(lp->tx_irq_name, LDC_IRQ_NAME_MAX, "%s TX", name);
++
++ err = request_irq(lp->cfg.rx_irq, ldc_rx, 0,
++ lp->rx_irq_name, lp);
++ if (err)
++ goto out_free_txq;
++
++ err = request_irq(lp->cfg.tx_irq, ldc_tx, 0,
++ lp->tx_irq_name, lp);
++ if (err) {
++ free_irq(lp->cfg.rx_irq, lp);
++ goto out_free_txq;
++ }
++
+ return lp;
+
+ out_free_txq:
+@@ -1237,31 +1255,14 @@ EXPORT_SYMBOL(ldc_free);
+ * state. This does not initiate a handshake, ldc_connect() does
+ * that.
+ */
+-int ldc_bind(struct ldc_channel *lp, const char *name)
++int ldc_bind(struct ldc_channel *lp)
+ {
+ unsigned long hv_err, flags;
+ int err = -EINVAL;
+
+- if (!name ||
+- (lp->state != LDC_STATE_INIT))
++ if (lp->state != LDC_STATE_INIT)
+ return -EINVAL;
+
+- snprintf(lp->rx_irq_name, LDC_IRQ_NAME_MAX, "%s RX", name);
+- snprintf(lp->tx_irq_name, LDC_IRQ_NAME_MAX, "%s TX", name);
+-
+- err = request_irq(lp->cfg.rx_irq, ldc_rx, 0,
+- lp->rx_irq_name, lp);
+- if (err)
+- return err;
+-
+- err = request_irq(lp->cfg.tx_irq, ldc_tx, 0,
+- lp->tx_irq_name, lp);
+- if (err) {
+- free_irq(lp->cfg.rx_irq, lp);
+- return err;
+- }
+-
+-
+ spin_lock_irqsave(&lp->lock, flags);
+
+ enable_irq(lp->cfg.rx_irq);
+diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c
+index 337094556916..5b1151dcba13 100644
+--- a/arch/sparc/kernel/nmi.c
++++ b/arch/sparc/kernel/nmi.c
+@@ -130,7 +130,6 @@ static inline unsigned int get_nmi_count(int cpu)
+
+ static __init void nmi_cpu_busy(void *data)
+ {
+- local_irq_enable_in_hardirq();
+ while (endflag == 0)
+ mb();
+ }
+diff --git a/arch/sparc/kernel/pcr.c b/arch/sparc/kernel/pcr.c
+index 269af58497aa..7e967c8018c8 100644
+--- a/arch/sparc/kernel/pcr.c
++++ b/arch/sparc/kernel/pcr.c
+@@ -191,12 +191,41 @@ static const struct pcr_ops n4_pcr_ops = {
+ .pcr_nmi_disable = PCR_N4_PICNPT,
+ };
+
++static u64 n5_pcr_read(unsigned long reg_num)
++{
++ unsigned long val;
++
++ (void) sun4v_t5_get_perfreg(reg_num, &val);
++
++ return val;
++}
++
++static void n5_pcr_write(unsigned long reg_num, u64 val)
++{
++ (void) sun4v_t5_set_perfreg(reg_num, val);
++}
++
++static const struct pcr_ops n5_pcr_ops = {
++ .read_pcr = n5_pcr_read,
++ .write_pcr = n5_pcr_write,
++ .read_pic = n4_pic_read,
++ .write_pic = n4_pic_write,
++ .nmi_picl_value = n4_picl_value,
++ .pcr_nmi_enable = (PCR_N4_PICNPT | PCR_N4_STRACE |
++ PCR_N4_UTRACE | PCR_N4_TOE |
++ (26 << PCR_N4_SL_SHIFT)),
++ .pcr_nmi_disable = PCR_N4_PICNPT,
++};
++
++
+ static unsigned long perf_hsvc_group;
+ static unsigned long perf_hsvc_major;
+ static unsigned long perf_hsvc_minor;
+
+ static int __init register_perf_hsvc(void)
+ {
++ unsigned long hverror;
++
+ if (tlb_type == hypervisor) {
+ switch (sun4v_chip_type) {
+ case SUN4V_CHIP_NIAGARA1:
+@@ -215,6 +244,10 @@ static int __init register_perf_hsvc(void)
+ perf_hsvc_group = HV_GRP_VT_CPU;
+ break;
+
++ case SUN4V_CHIP_NIAGARA5:
++ perf_hsvc_group = HV_GRP_T5_CPU;
++ break;
++
+ default:
+ return -ENODEV;
+ }
+@@ -222,10 +255,12 @@ static int __init register_perf_hsvc(void)
+
+ perf_hsvc_major = 1;
+ perf_hsvc_minor = 0;
+- if (sun4v_hvapi_register(perf_hsvc_group,
+- perf_hsvc_major,
+- &perf_hsvc_minor)) {
+- printk("perfmon: Could not register hvapi.\n");
++ hverror = sun4v_hvapi_register(perf_hsvc_group,
++ perf_hsvc_major,
++ &perf_hsvc_minor);
++ if (hverror) {
++ pr_err("perfmon: Could not register hvapi(0x%lx).\n",
++ hverror);
+ return -ENODEV;
+ }
+ }
+@@ -254,6 +289,10 @@ static int __init setup_sun4v_pcr_ops(void)
+ pcr_ops = &n4_pcr_ops;
+ break;
+
++ case SUN4V_CHIP_NIAGARA5:
++ pcr_ops = &n5_pcr_ops;
++ break;
++
+ default:
+ ret = -ENODEV;
+ break;
+diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
+index 8efd33753ad3..c9759ad3f34a 100644
+--- a/arch/sparc/kernel/perf_event.c
++++ b/arch/sparc/kernel/perf_event.c
+@@ -1662,7 +1662,8 @@ static bool __init supported_pmu(void)
+ sparc_pmu = &niagara2_pmu;
+ return true;
+ }
+- if (!strcmp(sparc_pmu_type, "niagara4")) {
++ if (!strcmp(sparc_pmu_type, "niagara4") ||
++ !strcmp(sparc_pmu_type, "niagara5")) {
+ sparc_pmu = &niagara4_pmu;
+ return true;
+ }
+@@ -1671,9 +1672,12 @@ static bool __init supported_pmu(void)
+
+ static int __init init_hw_perf_events(void)
+ {
++ int err;
++
+ pr_info("Performance events: ");
+
+- if (!supported_pmu()) {
++ err = pcr_arch_init();
++ if (err || !supported_pmu()) {
+ pr_cont("No support for PMU type '%s'\n", sparc_pmu_type);
+ return 0;
+ }
+@@ -1685,7 +1689,7 @@ static int __init init_hw_perf_events(void)
+
+ return 0;
+ }
+-early_initcall(init_hw_perf_events);
++pure_initcall(init_hw_perf_events);
+
+ void perf_callchain_kernel(struct perf_callchain_entry *entry,
+ struct pt_regs *regs)
+diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
+index 027e09986194..0be7bf978cb1 100644
+--- a/arch/sparc/kernel/process_64.c
++++ b/arch/sparc/kernel/process_64.c
+@@ -312,6 +312,9 @@ static void __global_pmu_self(int this_cpu)
+ struct global_pmu_snapshot *pp;
+ int i, num;
+
++ if (!pcr_ops)
++ return;
++
+ pp = &global_cpu_snapshot[this_cpu].pmu;
+
+ num = 1;
+diff --git a/arch/sparc/kernel/setup_64.c b/arch/sparc/kernel/setup_64.c
+index 3fdb455e3318..61a519808cb7 100644
+--- a/arch/sparc/kernel/setup_64.c
++++ b/arch/sparc/kernel/setup_64.c
+@@ -30,6 +30,7 @@
+ #include <linux/cpu.h>
+ #include <linux/initrd.h>
+ #include <linux/module.h>
++#include <linux/start_kernel.h>
+
+ #include <asm/io.h>
+ #include <asm/processor.h>
+@@ -174,7 +175,7 @@ char reboot_command[COMMAND_LINE_SIZE];
+
+ static struct pt_regs fake_swapper_regs = { { 0, }, 0, 0, 0, 0 };
+
+-void __init per_cpu_patch(void)
++static void __init per_cpu_patch(void)
+ {
+ struct cpuid_patch_entry *p;
+ unsigned long ver;
+@@ -266,7 +267,7 @@ void sun4v_patch_2insn_range(struct sun4v_2insn_patch_entry *start,
+ }
+ }
+
+-void __init sun4v_patch(void)
++static void __init sun4v_patch(void)
+ {
+ extern void sun4v_hvapi_init(void);
+
+@@ -335,14 +336,25 @@ static void __init pause_patch(void)
+ }
+ }
+
+-#ifdef CONFIG_SMP
+-void __init boot_cpu_id_too_large(int cpu)
++void __init start_early_boot(void)
+ {
+- prom_printf("Serious problem, boot cpu id (%d) >= NR_CPUS (%d)\n",
+- cpu, NR_CPUS);
+- prom_halt();
++ int cpu;
++
++ check_if_starfire();
++ per_cpu_patch();
++ sun4v_patch();
++
++ cpu = hard_smp_processor_id();
++ if (cpu >= NR_CPUS) {
++ prom_printf("Serious problem, boot cpu id (%d) >= NR_CPUS (%d)\n",
++ cpu, NR_CPUS);
++ prom_halt();
++ }
++ current_thread_info()->cpu = cpu;
++
++ prom_init_report();
++ start_kernel();
+ }
+-#endif
+
+ /* On Ultra, we support all of the v8 capabilities. */
+ unsigned long sparc64_elf_hwcap = (HWCAP_SPARC_FLUSH | HWCAP_SPARC_STBAR |
+@@ -500,12 +512,16 @@ static void __init init_sparc64_elf_hwcap(void)
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA3 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA4 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA5 ||
++ sun4v_chip_type == SUN4V_CHIP_SPARC_M6 ||
++ sun4v_chip_type == SUN4V_CHIP_SPARC_M7 ||
+ sun4v_chip_type == SUN4V_CHIP_SPARC64X)
+ cap |= HWCAP_SPARC_BLKINIT;
+ if (sun4v_chip_type == SUN4V_CHIP_NIAGARA2 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA3 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA4 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA5 ||
++ sun4v_chip_type == SUN4V_CHIP_SPARC_M6 ||
++ sun4v_chip_type == SUN4V_CHIP_SPARC_M7 ||
+ sun4v_chip_type == SUN4V_CHIP_SPARC64X)
+ cap |= HWCAP_SPARC_N2;
+ }
+@@ -533,6 +549,8 @@ static void __init init_sparc64_elf_hwcap(void)
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA3 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA4 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA5 ||
++ sun4v_chip_type == SUN4V_CHIP_SPARC_M6 ||
++ sun4v_chip_type == SUN4V_CHIP_SPARC_M7 ||
+ sun4v_chip_type == SUN4V_CHIP_SPARC64X)
+ cap |= (AV_SPARC_VIS | AV_SPARC_VIS2 |
+ AV_SPARC_ASI_BLK_INIT |
+@@ -540,6 +558,8 @@ static void __init init_sparc64_elf_hwcap(void)
+ if (sun4v_chip_type == SUN4V_CHIP_NIAGARA3 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA4 ||
+ sun4v_chip_type == SUN4V_CHIP_NIAGARA5 ||
++ sun4v_chip_type == SUN4V_CHIP_SPARC_M6 ||
++ sun4v_chip_type == SUN4V_CHIP_SPARC_M7 ||
+ sun4v_chip_type == SUN4V_CHIP_SPARC64X)
+ cap |= (AV_SPARC_VIS3 | AV_SPARC_HPC |
+ AV_SPARC_FMAF);
+diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
+index 41aa2478f3ca..c9300bfaee5a 100644
+--- a/arch/sparc/kernel/smp_64.c
++++ b/arch/sparc/kernel/smp_64.c
+@@ -1383,7 +1383,6 @@ void __cpu_die(unsigned int cpu)
+
+ void __init smp_cpus_done(unsigned int max_cpus)
+ {
+- pcr_arch_init();
+ }
+
+ void smp_send_reschedule(int cpu)
+@@ -1468,6 +1467,13 @@ static void __init pcpu_populate_pte(unsigned long addr)
+ pud_t *pud;
+ pmd_t *pmd;
+
++ if (pgd_none(*pgd)) {
++ pud_t *new;
++
++ new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
++ pgd_populate(&init_mm, pgd, new);
++ }
++
+ pud = pud_offset(pgd, addr);
+ if (pud_none(*pud)) {
+ pmd_t *new;
+diff --git a/arch/sparc/kernel/sun4v_tlb_miss.S b/arch/sparc/kernel/sun4v_tlb_miss.S
+index e0c09bf85610..6179e19bc9b9 100644
+--- a/arch/sparc/kernel/sun4v_tlb_miss.S
++++ b/arch/sparc/kernel/sun4v_tlb_miss.S
+@@ -195,6 +195,11 @@ sun4v_tsb_miss_common:
+ ldx [%g2 + TRAP_PER_CPU_PGD_PADDR], %g7
+
+ sun4v_itlb_error:
++ rdpr %tl, %g1
++ cmp %g1, 1
++ ble,pt %icc, sun4v_bad_ra
++ or %g0, FAULT_CODE_BAD_RA | FAULT_CODE_ITLB, %g1
++
+ sethi %hi(sun4v_err_itlb_vaddr), %g1
+ stx %g4, [%g1 + %lo(sun4v_err_itlb_vaddr)]
+ sethi %hi(sun4v_err_itlb_ctx), %g1
+@@ -206,15 +211,10 @@ sun4v_itlb_error:
+ sethi %hi(sun4v_err_itlb_error), %g1
+ stx %o0, [%g1 + %lo(sun4v_err_itlb_error)]
+
++ sethi %hi(1f), %g7
+ rdpr %tl, %g4
+- cmp %g4, 1
+- ble,pt %icc, 1f
+- sethi %hi(2f), %g7
+ ba,pt %xcc, etraptl1
+- or %g7, %lo(2f), %g7
+-
+-1: ba,pt %xcc, etrap
+-2: or %g7, %lo(2b), %g7
++1: or %g7, %lo(1f), %g7
+ mov %l4, %o1
+ call sun4v_itlb_error_report
+ add %sp, PTREGS_OFF, %o0
+@@ -222,6 +222,11 @@ sun4v_itlb_error:
+ /* NOTREACHED */
+
+ sun4v_dtlb_error:
++ rdpr %tl, %g1
++ cmp %g1, 1
++ ble,pt %icc, sun4v_bad_ra
++ or %g0, FAULT_CODE_BAD_RA | FAULT_CODE_DTLB, %g1
++
+ sethi %hi(sun4v_err_dtlb_vaddr), %g1
+ stx %g4, [%g1 + %lo(sun4v_err_dtlb_vaddr)]
+ sethi %hi(sun4v_err_dtlb_ctx), %g1
+@@ -233,21 +238,23 @@ sun4v_dtlb_error:
+ sethi %hi(sun4v_err_dtlb_error), %g1
+ stx %o0, [%g1 + %lo(sun4v_err_dtlb_error)]
+
++ sethi %hi(1f), %g7
+ rdpr %tl, %g4
+- cmp %g4, 1
+- ble,pt %icc, 1f
+- sethi %hi(2f), %g7
+ ba,pt %xcc, etraptl1
+- or %g7, %lo(2f), %g7
+-
+-1: ba,pt %xcc, etrap
+-2: or %g7, %lo(2b), %g7
++1: or %g7, %lo(1f), %g7
+ mov %l4, %o1
+ call sun4v_dtlb_error_report
+ add %sp, PTREGS_OFF, %o0
+
+ /* NOTREACHED */
+
++sun4v_bad_ra:
++ or %g0, %g4, %g5
++ ba,pt %xcc, sparc64_realfault_common
++ or %g1, %g0, %g4
++
++ /* NOTREACHED */
++
+ /* Instruction Access Exception, tl0. */
+ sun4v_iacc:
+ ldxa [%g0] ASI_SCRATCHPAD, %g2
+diff --git a/arch/sparc/kernel/trampoline_64.S b/arch/sparc/kernel/trampoline_64.S
+index 737f8cbc7d56..88ede1d53b4c 100644
+--- a/arch/sparc/kernel/trampoline_64.S
++++ b/arch/sparc/kernel/trampoline_64.S
+@@ -109,10 +109,13 @@ startup_continue:
+ brnz,pn %g1, 1b
+ nop
+
+- sethi %hi(p1275buf), %g2
+- or %g2, %lo(p1275buf), %g2
+- ldx [%g2 + 0x10], %l2
+- add %l2, -(192 + 128), %sp
++ /* Get onto temporary stack which will be in the locked
++ * kernel image.
++ */
++ sethi %hi(tramp_stack), %g1
++ or %g1, %lo(tramp_stack), %g1
++ add %g1, TRAMP_STACK_SIZE, %g1
++ sub %g1, STACKFRAME_SZ + STACK_BIAS + 256, %sp
+ flushw
+
+ /* Setup the loop variables:
+@@ -394,7 +397,6 @@ after_lock_tlb:
+ sllx %g5, THREAD_SHIFT, %g5
+ sub %g5, (STACKFRAME_SZ + STACK_BIAS), %g5
+ add %g6, %g5, %sp
+- mov 0, %fp
+
+ rdpr %pstate, %o1
+ or %o1, PSTATE_IE, %o1
+diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
+index fb6640ec8557..981a769b9558 100644
+--- a/arch/sparc/kernel/traps_64.c
++++ b/arch/sparc/kernel/traps_64.c
+@@ -2104,6 +2104,11 @@ void sun4v_nonresum_overflow(struct pt_regs *regs)
+ atomic_inc(&sun4v_nonresum_oflow_cnt);
+ }
+
++static void sun4v_tlb_error(struct pt_regs *regs)
++{
++ die_if_kernel("TLB/TSB error", regs);
++}
++
+ unsigned long sun4v_err_itlb_vaddr;
+ unsigned long sun4v_err_itlb_ctx;
+ unsigned long sun4v_err_itlb_pte;
+@@ -2111,8 +2116,7 @@ unsigned long sun4v_err_itlb_error;
+
+ void sun4v_itlb_error_report(struct pt_regs *regs, int tl)
+ {
+- if (tl > 1)
+- dump_tl1_traplog((struct tl1_traplog *)(regs + 1));
++ dump_tl1_traplog((struct tl1_traplog *)(regs + 1));
+
+ printk(KERN_EMERG "SUN4V-ITLB: Error at TPC[%lx], tl %d\n",
+ regs->tpc, tl);
+@@ -2125,7 +2129,7 @@ void sun4v_itlb_error_report(struct pt_regs *regs, int tl)
+ sun4v_err_itlb_vaddr, sun4v_err_itlb_ctx,
+ sun4v_err_itlb_pte, sun4v_err_itlb_error);
+
+- prom_halt();
++ sun4v_tlb_error(regs);
+ }
+
+ unsigned long sun4v_err_dtlb_vaddr;
+@@ -2135,8 +2139,7 @@ unsigned long sun4v_err_dtlb_error;
+
+ void sun4v_dtlb_error_report(struct pt_regs *regs, int tl)
+ {
+- if (tl > 1)
+- dump_tl1_traplog((struct tl1_traplog *)(regs + 1));
++ dump_tl1_traplog((struct tl1_traplog *)(regs + 1));
+
+ printk(KERN_EMERG "SUN4V-DTLB: Error at TPC[%lx], tl %d\n",
+ regs->tpc, tl);
+@@ -2149,7 +2152,7 @@ void sun4v_dtlb_error_report(struct pt_regs *regs, int tl)
+ sun4v_err_dtlb_vaddr, sun4v_err_dtlb_ctx,
+ sun4v_err_dtlb_pte, sun4v_err_dtlb_error);
+
+- prom_halt();
++ sun4v_tlb_error(regs);
+ }
+
+ void hypervisor_tlbop_error(unsigned long err, unsigned long op)
+diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
+index 14158d40ba76..be98685c14c6 100644
+--- a/arch/sparc/kernel/tsb.S
++++ b/arch/sparc/kernel/tsb.S
+@@ -162,10 +162,10 @@ tsb_miss_page_table_walk_sun4v_fastpath:
+ nop
+ .previous
+
+- rdpr %tl, %g3
+- cmp %g3, 1
++ rdpr %tl, %g7
++ cmp %g7, 1
+ bne,pn %xcc, winfix_trampoline
+- nop
++ mov %g3, %g4
+ ba,pt %xcc, etrap
+ rd %pc, %g7
+ call hugetlb_setup
+diff --git a/arch/sparc/kernel/viohs.c b/arch/sparc/kernel/viohs.c
+index f8e7dd53e1c7..9c5fbd0b8a04 100644
+--- a/arch/sparc/kernel/viohs.c
++++ b/arch/sparc/kernel/viohs.c
+@@ -714,7 +714,7 @@ int vio_ldc_alloc(struct vio_driver_state *vio,
+ cfg.tx_irq = vio->vdev->tx_irq;
+ cfg.rx_irq = vio->vdev->rx_irq;
+
+- lp = ldc_alloc(vio->vdev->channel_id, &cfg, event_arg);
++ lp = ldc_alloc(vio->vdev->channel_id, &cfg, event_arg, vio->name);
+ if (IS_ERR(lp))
+ return PTR_ERR(lp);
+
+@@ -746,7 +746,7 @@ void vio_port_up(struct vio_driver_state *vio)
+
+ err = 0;
+ if (state == LDC_STATE_INIT) {
+- err = ldc_bind(vio->lp, vio->name);
++ err = ldc_bind(vio->lp);
+ if (err)
+ printk(KERN_WARNING "%s: Port %lu bind failed, "
+ "err=%d\n",
+diff --git a/arch/sparc/kernel/vmlinux.lds.S b/arch/sparc/kernel/vmlinux.lds.S
+index 932ff90fd760..09243057cb0b 100644
+--- a/arch/sparc/kernel/vmlinux.lds.S
++++ b/arch/sparc/kernel/vmlinux.lds.S
+@@ -35,8 +35,9 @@ jiffies = jiffies_64;
+
+ SECTIONS
+ {
+- /* swapper_low_pmd_dir is sparc64 only */
+- swapper_low_pmd_dir = 0x0000000000402000;
++#ifdef CONFIG_SPARC64
++ swapper_pg_dir = 0x0000000000402000;
++#endif
+ . = INITIAL_ADDRESS;
+ .text TEXTSTART :
+ {
+@@ -122,11 +123,6 @@ SECTIONS
+ *(.swapper_4m_tsb_phys_patch)
+ __swapper_4m_tsb_phys_patch_end = .;
+ }
+- .page_offset_shift_patch : {
+- __page_offset_shift_patch = .;
+- *(.page_offset_shift_patch)
+- __page_offset_shift_patch_end = .;
+- }
+ .popc_3insn_patch : {
+ __popc_3insn_patch = .;
+ *(.popc_3insn_patch)
+diff --git a/arch/sparc/lib/NG4memcpy.S b/arch/sparc/lib/NG4memcpy.S
+index 9cf2ee01cee3..140527a20e7d 100644
+--- a/arch/sparc/lib/NG4memcpy.S
++++ b/arch/sparc/lib/NG4memcpy.S
+@@ -41,6 +41,10 @@
+ #endif
+ #endif
+
++#if !defined(EX_LD) && !defined(EX_ST)
++#define NON_USER_COPY
++#endif
++
+ #ifndef EX_LD
+ #define EX_LD(x) x
+ #endif
+@@ -197,9 +201,13 @@ FUNC_NAME: /* %o0=dst, %o1=src, %o2=len */
+ mov EX_RETVAL(%o3), %o0
+
+ .Llarge_src_unaligned:
++#ifdef NON_USER_COPY
++ VISEntryHalfFast(.Lmedium_vis_entry_fail)
++#else
++ VISEntryHalf
++#endif
+ andn %o2, 0x3f, %o4
+ sub %o2, %o4, %o2
+- VISEntryHalf
+ alignaddr %o1, %g0, %g1
+ add %o1, %o4, %o1
+ EX_LD(LOAD(ldd, %g1 + 0x00, %f0))
+@@ -240,6 +248,10 @@ FUNC_NAME: /* %o0=dst, %o1=src, %o2=len */
+ nop
+ ba,a,pt %icc, .Lmedium_unaligned
+
++#ifdef NON_USER_COPY
++.Lmedium_vis_entry_fail:
++ or %o0, %o1, %g2
++#endif
+ .Lmedium:
+ LOAD(prefetch, %o1 + 0x40, #n_reads_strong)
+ andcc %g2, 0x7, %g0
+diff --git a/arch/sparc/lib/memset.S b/arch/sparc/lib/memset.S
+index 99c017be8719..f75e6906df14 100644
+--- a/arch/sparc/lib/memset.S
++++ b/arch/sparc/lib/memset.S
+@@ -3,8 +3,9 @@
+ * Copyright (C) 1996,1997 Jakub Jelinek (jj@sunsite.mff.cuni.cz)
+ * Copyright (C) 1996 David S. Miller (davem@caip.rutgers.edu)
+ *
+- * Returns 0, if ok, and number of bytes not yet set if exception
+- * occurs and we were called as clear_user.
++ * Calls to memset returns initial %o0. Calls to bzero returns 0, if ok, and
++ * number of bytes not yet set if exception occurs and we were called as
++ * clear_user.
+ */
+
+ #include <asm/ptrace.h>
+@@ -65,6 +66,8 @@ __bzero_begin:
+ .globl __memset_start, __memset_end
+ __memset_start:
+ memset:
++ mov %o0, %g1
++ mov 1, %g4
+ and %o1, 0xff, %g3
+ sll %g3, 8, %g2
+ or %g3, %g2, %g3
+@@ -89,6 +92,7 @@ memset:
+ sub %o0, %o2, %o0
+
+ __bzero:
++ clr %g4
+ mov %g0, %g3
+ 1:
+ cmp %o1, 7
+@@ -151,8 +155,8 @@ __bzero:
+ bne,a 8f
+ EX(stb %g3, [%o0], and %o1, 1)
+ 8:
+- retl
+- clr %o0
++ b 0f
++ nop
+ 7:
+ be 13b
+ orcc %o1, 0, %g0
+@@ -164,6 +168,12 @@ __bzero:
+ bne 8b
+ EX(stb %g3, [%o0 - 1], add %o1, 1)
+ 0:
++ andcc %g4, 1, %g0
++ be 5f
++ nop
++ retl
++ mov %g1, %o0
++5:
+ retl
+ clr %o0
+ __memset_end:
+diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
+index 587cd0565128..18fcd7167095 100644
+--- a/arch/sparc/mm/fault_64.c
++++ b/arch/sparc/mm/fault_64.c
+@@ -346,6 +346,9 @@ retry:
+ down_read(&mm->mmap_sem);
+ }
+
++ if (fault_code & FAULT_CODE_BAD_RA)
++ goto do_sigbus;
++
+ vma = find_vma(mm, address);
+ if (!vma)
+ goto bad_area;
+diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
+index 1aed0432c64b..ae6ce383d4df 100644
+--- a/arch/sparc/mm/gup.c
++++ b/arch/sparc/mm/gup.c
+@@ -160,6 +160,36 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+ return 1;
+ }
+
++int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
++ struct page **pages)
++{
++ struct mm_struct *mm = current->mm;
++ unsigned long addr, len, end;
++ unsigned long next, flags;
++ pgd_t *pgdp;
++ int nr = 0;
++
++ start &= PAGE_MASK;
++ addr = start;
++ len = (unsigned long) nr_pages << PAGE_SHIFT;
++ end = start + len;
++
++ local_irq_save(flags);
++ pgdp = pgd_offset(mm, addr);
++ do {
++ pgd_t pgd = *pgdp;
++
++ next = pgd_addr_end(addr, end);
++ if (pgd_none(pgd))
++ break;
++ if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
++ break;
++ } while (pgdp++, addr = next, addr != end);
++ local_irq_restore(flags);
++
++ return nr;
++}
++
+ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+ struct page **pages)
+ {
+diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
+index 2cfb0f25e0ed..bbb9371f519b 100644
+--- a/arch/sparc/mm/init_64.c
++++ b/arch/sparc/mm/init_64.c
+@@ -74,7 +74,6 @@ unsigned long kern_linear_pte_xor[4] __read_mostly;
+ * 'cpu' properties, but we need to have this table setup before the
+ * MDESC is initialized.
+ */
+-unsigned long kpte_linear_bitmap[KPTE_BITMAP_BYTES / sizeof(unsigned long)];
+
+ #ifndef CONFIG_DEBUG_PAGEALLOC
+ /* A special kernel TSB for 4MB, 256MB, 2GB and 16GB linear mappings.
+@@ -83,10 +82,11 @@ unsigned long kpte_linear_bitmap[KPTE_BITMAP_BYTES / sizeof(unsigned long)];
+ */
+ extern struct tsb swapper_4m_tsb[KERNEL_TSB4M_NENTRIES];
+ #endif
++extern struct tsb swapper_tsb[KERNEL_TSB_NENTRIES];
+
+ static unsigned long cpu_pgsz_mask;
+
+-#define MAX_BANKS 32
++#define MAX_BANKS 1024
+
+ static struct linux_prom64_registers pavail[MAX_BANKS];
+ static int pavail_ents;
+@@ -164,10 +164,6 @@ static void __init read_obp_memory(const char *property,
+ cmp_p64, NULL);
+ }
+
+-unsigned long sparc64_valid_addr_bitmap[VALID_ADDR_BITMAP_BYTES /
+- sizeof(unsigned long)];
+-EXPORT_SYMBOL(sparc64_valid_addr_bitmap);
+-
+ /* Kernel physical address base and size in bytes. */
+ unsigned long kern_base __read_mostly;
+ unsigned long kern_size __read_mostly;
+@@ -839,7 +835,10 @@ static int find_node(unsigned long addr)
+ if ((addr & p->mask) == p->val)
+ return i;
+ }
+- return -1;
++ /* The following condition has been observed on LDOM guests.*/
++ WARN_ONCE(1, "find_node: A physical address doesn't match a NUMA node"
++ " rule. Some physical memory will be owned by node 0.");
++ return 0;
+ }
+
+ static u64 memblock_nid_range(u64 start, u64 end, int *nid)
+@@ -1365,9 +1364,144 @@ static unsigned long __init bootmem_init(unsigned long phys_base)
+ static struct linux_prom64_registers pall[MAX_BANKS] __initdata;
+ static int pall_ents __initdata;
+
+-#ifdef CONFIG_DEBUG_PAGEALLOC
++static unsigned long max_phys_bits = 40;
++
++bool kern_addr_valid(unsigned long addr)
++{
++ pgd_t *pgd;
++ pud_t *pud;
++ pmd_t *pmd;
++ pte_t *pte;
++
++ if ((long)addr < 0L) {
++ unsigned long pa = __pa(addr);
++
++ if ((addr >> max_phys_bits) != 0UL)
++ return false;
++
++ return pfn_valid(pa >> PAGE_SHIFT);
++ }
++
++ if (addr >= (unsigned long) KERNBASE &&
++ addr < (unsigned long)&_end)
++ return true;
++
++ pgd = pgd_offset_k(addr);
++ if (pgd_none(*pgd))
++ return 0;
++
++ pud = pud_offset(pgd, addr);
++ if (pud_none(*pud))
++ return 0;
++
++ if (pud_large(*pud))
++ return pfn_valid(pud_pfn(*pud));
++
++ pmd = pmd_offset(pud, addr);
++ if (pmd_none(*pmd))
++ return 0;
++
++ if (pmd_large(*pmd))
++ return pfn_valid(pmd_pfn(*pmd));
++
++ pte = pte_offset_kernel(pmd, addr);
++ if (pte_none(*pte))
++ return 0;
++
++ return pfn_valid(pte_pfn(*pte));
++}
++EXPORT_SYMBOL(kern_addr_valid);
++
++static unsigned long __ref kernel_map_hugepud(unsigned long vstart,
++ unsigned long vend,
++ pud_t *pud)
++{
++ const unsigned long mask16gb = (1UL << 34) - 1UL;
++ u64 pte_val = vstart;
++
++ /* Each PUD is 8GB */
++ if ((vstart & mask16gb) ||
++ (vend - vstart <= mask16gb)) {
++ pte_val ^= kern_linear_pte_xor[2];
++ pud_val(*pud) = pte_val | _PAGE_PUD_HUGE;
++
++ return vstart + PUD_SIZE;
++ }
++
++ pte_val ^= kern_linear_pte_xor[3];
++ pte_val |= _PAGE_PUD_HUGE;
++
++ vend = vstart + mask16gb + 1UL;
++ while (vstart < vend) {
++ pud_val(*pud) = pte_val;
++
++ pte_val += PUD_SIZE;
++ vstart += PUD_SIZE;
++ pud++;
++ }
++ return vstart;
++}
++
++static bool kernel_can_map_hugepud(unsigned long vstart, unsigned long vend,
++ bool guard)
++{
++ if (guard && !(vstart & ~PUD_MASK) && (vend - vstart) >= PUD_SIZE)
++ return true;
++
++ return false;
++}
++
++static unsigned long __ref kernel_map_hugepmd(unsigned long vstart,
++ unsigned long vend,
++ pmd_t *pmd)
++{
++ const unsigned long mask256mb = (1UL << 28) - 1UL;
++ const unsigned long mask2gb = (1UL << 31) - 1UL;
++ u64 pte_val = vstart;
++
++ /* Each PMD is 8MB */
++ if ((vstart & mask256mb) ||
++ (vend - vstart <= mask256mb)) {
++ pte_val ^= kern_linear_pte_xor[0];
++ pmd_val(*pmd) = pte_val | _PAGE_PMD_HUGE;
++
++ return vstart + PMD_SIZE;
++ }
++
++ if ((vstart & mask2gb) ||
++ (vend - vstart <= mask2gb)) {
++ pte_val ^= kern_linear_pte_xor[1];
++ pte_val |= _PAGE_PMD_HUGE;
++ vend = vstart + mask256mb + 1UL;
++ } else {
++ pte_val ^= kern_linear_pte_xor[2];
++ pte_val |= _PAGE_PMD_HUGE;
++ vend = vstart + mask2gb + 1UL;
++ }
++
++ while (vstart < vend) {
++ pmd_val(*pmd) = pte_val;
++
++ pte_val += PMD_SIZE;
++ vstart += PMD_SIZE;
++ pmd++;
++ }
++
++ return vstart;
++}
++
++static bool kernel_can_map_hugepmd(unsigned long vstart, unsigned long vend,
++ bool guard)
++{
++ if (guard && !(vstart & ~PMD_MASK) && (vend - vstart) >= PMD_SIZE)
++ return true;
++
++ return false;
++}
++
+ static unsigned long __ref kernel_map_range(unsigned long pstart,
+- unsigned long pend, pgprot_t prot)
++ unsigned long pend, pgprot_t prot,
++ bool use_huge)
+ {
+ unsigned long vstart = PAGE_OFFSET + pstart;
+ unsigned long vend = PAGE_OFFSET + pend;
+@@ -1386,19 +1520,34 @@ static unsigned long __ref kernel_map_range(unsigned long pstart,
+ pmd_t *pmd;
+ pte_t *pte;
+
++ if (pgd_none(*pgd)) {
++ pud_t *new;
++
++ new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
++ alloc_bytes += PAGE_SIZE;
++ pgd_populate(&init_mm, pgd, new);
++ }
+ pud = pud_offset(pgd, vstart);
+ if (pud_none(*pud)) {
+ pmd_t *new;
+
++ if (kernel_can_map_hugepud(vstart, vend, use_huge)) {
++ vstart = kernel_map_hugepud(vstart, vend, pud);
++ continue;
++ }
+ new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
+ alloc_bytes += PAGE_SIZE;
+ pud_populate(&init_mm, pud, new);
+ }
+
+ pmd = pmd_offset(pud, vstart);
+- if (!pmd_present(*pmd)) {
++ if (pmd_none(*pmd)) {
+ pte_t *new;
+
++ if (kernel_can_map_hugepmd(vstart, vend, use_huge)) {
++ vstart = kernel_map_hugepmd(vstart, vend, pmd);
++ continue;
++ }
+ new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
+ alloc_bytes += PAGE_SIZE;
+ pmd_populate_kernel(&init_mm, pmd, new);
+@@ -1421,100 +1570,34 @@ static unsigned long __ref kernel_map_range(unsigned long pstart,
+ return alloc_bytes;
+ }
+
+-extern unsigned int kvmap_linear_patch[1];
+-#endif /* CONFIG_DEBUG_PAGEALLOC */
+-
+-static void __init kpte_set_val(unsigned long index, unsigned long val)
++static void __init flush_all_kernel_tsbs(void)
+ {
+- unsigned long *ptr = kpte_linear_bitmap;
+-
+- val <<= ((index % (BITS_PER_LONG / 2)) * 2);
+- ptr += (index / (BITS_PER_LONG / 2));
+-
+- *ptr |= val;
+-}
+-
+-static const unsigned long kpte_shift_min = 28; /* 256MB */
+-static const unsigned long kpte_shift_max = 34; /* 16GB */
+-static const unsigned long kpte_shift_incr = 3;
+-
+-static unsigned long kpte_mark_using_shift(unsigned long start, unsigned long end,
+- unsigned long shift)
+-{
+- unsigned long size = (1UL << shift);
+- unsigned long mask = (size - 1UL);
+- unsigned long remains = end - start;
+- unsigned long val;
+-
+- if (remains < size || (start & mask))
+- return start;
+-
+- /* VAL maps:
+- *
+- * shift 28 --> kern_linear_pte_xor index 1
+- * shift 31 --> kern_linear_pte_xor index 2
+- * shift 34 --> kern_linear_pte_xor index 3
+- */
+- val = ((shift - kpte_shift_min) / kpte_shift_incr) + 1;
+-
+- remains &= ~mask;
+- if (shift != kpte_shift_max)
+- remains = size;
+-
+- while (remains) {
+- unsigned long index = start >> kpte_shift_min;
++ int i;
+
+- kpte_set_val(index, val);
++ for (i = 0; i < KERNEL_TSB_NENTRIES; i++) {
++ struct tsb *ent = &swapper_tsb[i];
+
+- start += 1UL << kpte_shift_min;
+- remains -= 1UL << kpte_shift_min;
++ ent->tag = (1UL << TSB_TAG_INVALID_BIT);
+ }
++#ifndef CONFIG_DEBUG_PAGEALLOC
++ for (i = 0; i < KERNEL_TSB4M_NENTRIES; i++) {
++ struct tsb *ent = &swapper_4m_tsb[i];
+
+- return start;
+-}
+-
+-static void __init mark_kpte_bitmap(unsigned long start, unsigned long end)
+-{
+- unsigned long smallest_size, smallest_mask;
+- unsigned long s;
+-
+- smallest_size = (1UL << kpte_shift_min);
+- smallest_mask = (smallest_size - 1UL);
+-
+- while (start < end) {
+- unsigned long orig_start = start;
+-
+- for (s = kpte_shift_max; s >= kpte_shift_min; s -= kpte_shift_incr) {
+- start = kpte_mark_using_shift(start, end, s);
+-
+- if (start != orig_start)
+- break;
+- }
+-
+- if (start == orig_start)
+- start = (start + smallest_size) & ~smallest_mask;
++ ent->tag = (1UL << TSB_TAG_INVALID_BIT);
+ }
++#endif
+ }
+
+-static void __init init_kpte_bitmap(void)
+-{
+- unsigned long i;
+-
+- for (i = 0; i < pall_ents; i++) {
+- unsigned long phys_start, phys_end;
+-
+- phys_start = pall[i].phys_addr;
+- phys_end = phys_start + pall[i].reg_size;
+-
+- mark_kpte_bitmap(phys_start, phys_end);
+- }
+-}
++extern unsigned int kvmap_linear_patch[1];
+
+ static void __init kernel_physical_mapping_init(void)
+ {
+-#ifdef CONFIG_DEBUG_PAGEALLOC
+ unsigned long i, mem_alloced = 0UL;
++ bool use_huge = true;
+
++#ifdef CONFIG_DEBUG_PAGEALLOC
++ use_huge = false;
++#endif
+ for (i = 0; i < pall_ents; i++) {
+ unsigned long phys_start, phys_end;
+
+@@ -1522,7 +1605,7 @@ static void __init kernel_physical_mapping_init(void)
+ phys_end = phys_start + pall[i].reg_size;
+
+ mem_alloced += kernel_map_range(phys_start, phys_end,
+- PAGE_KERNEL);
++ PAGE_KERNEL, use_huge);
+ }
+
+ printk("Allocated %ld bytes for kernel page tables.\n",
+@@ -1531,8 +1614,9 @@ static void __init kernel_physical_mapping_init(void)
+ kvmap_linear_patch[0] = 0x01000000; /* nop */
+ flushi(&kvmap_linear_patch[0]);
+
++ flush_all_kernel_tsbs();
++
+ __flush_tlb_all();
+-#endif
+ }
+
+ #ifdef CONFIG_DEBUG_PAGEALLOC
+@@ -1542,7 +1626,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
+ unsigned long phys_end = phys_start + (numpages * PAGE_SIZE);
+
+ kernel_map_range(phys_start, phys_end,
+- (enable ? PAGE_KERNEL : __pgprot(0)));
++ (enable ? PAGE_KERNEL : __pgprot(0)), false);
+
+ flush_tsb_kernel_range(PAGE_OFFSET + phys_start,
+ PAGE_OFFSET + phys_end);
+@@ -1570,76 +1654,56 @@ unsigned long __init find_ecache_flush_span(unsigned long size)
+ unsigned long PAGE_OFFSET;
+ EXPORT_SYMBOL(PAGE_OFFSET);
+
+-static void __init page_offset_shift_patch_one(unsigned int *insn, unsigned long phys_bits)
+-{
+- unsigned long final_shift;
+- unsigned int val = *insn;
+- unsigned int cnt;
+-
+- /* We are patching in ilog2(max_supported_phys_address), and
+- * we are doing so in a manner similar to a relocation addend.
+- * That is, we are adding the shift value to whatever value
+- * is in the shift instruction count field already.
+- */
+- cnt = (val & 0x3f);
+- val &= ~0x3f;
+-
+- /* If we are trying to shift >= 64 bits, clear the destination
+- * register. This can happen when phys_bits ends up being equal
+- * to MAX_PHYS_ADDRESS_BITS.
+- */
+- final_shift = (cnt + (64 - phys_bits));
+- if (final_shift >= 64) {
+- unsigned int rd = (val >> 25) & 0x1f;
+-
+- val = 0x80100000 | (rd << 25);
+- } else {
+- val |= final_shift;
+- }
+- *insn = val;
+-
+- __asm__ __volatile__("flush %0"
+- : /* no outputs */
+- : "r" (insn));
+-}
+-
+-static void __init page_offset_shift_patch(unsigned long phys_bits)
+-{
+- extern unsigned int __page_offset_shift_patch;
+- extern unsigned int __page_offset_shift_patch_end;
+- unsigned int *p;
+-
+- p = &__page_offset_shift_patch;
+- while (p < &__page_offset_shift_patch_end) {
+- unsigned int *insn = (unsigned int *)(unsigned long)*p;
++unsigned long VMALLOC_END = 0x0000010000000000UL;
++EXPORT_SYMBOL(VMALLOC_END);
+
+- page_offset_shift_patch_one(insn, phys_bits);
+-
+- p++;
+- }
+-}
++unsigned long sparc64_va_hole_top = 0xfffff80000000000UL;
++unsigned long sparc64_va_hole_bottom = 0x0000080000000000UL;
+
+ static void __init setup_page_offset(void)
+ {
+- unsigned long max_phys_bits = 40;
+-
+ if (tlb_type == cheetah || tlb_type == cheetah_plus) {
++ /* Cheetah/Panther support a full 64-bit virtual
++ * address, so we can use all that our page tables
++ * support.
++ */
++ sparc64_va_hole_top = 0xfff0000000000000UL;
++ sparc64_va_hole_bottom = 0x0010000000000000UL;
++
+ max_phys_bits = 42;
+ } else if (tlb_type == hypervisor) {
+ switch (sun4v_chip_type) {
+ case SUN4V_CHIP_NIAGARA1:
+ case SUN4V_CHIP_NIAGARA2:
++ /* T1 and T2 support 48-bit virtual addresses. */
++ sparc64_va_hole_top = 0xffff800000000000UL;
++ sparc64_va_hole_bottom = 0x0000800000000000UL;
++
+ max_phys_bits = 39;
+ break;
+ case SUN4V_CHIP_NIAGARA3:
++ /* T3 supports 48-bit virtual addresses. */
++ sparc64_va_hole_top = 0xffff800000000000UL;
++ sparc64_va_hole_bottom = 0x0000800000000000UL;
++
+ max_phys_bits = 43;
+ break;
+ case SUN4V_CHIP_NIAGARA4:
+ case SUN4V_CHIP_NIAGARA5:
+ case SUN4V_CHIP_SPARC64X:
+- default:
++ case SUN4V_CHIP_SPARC_M6:
++ /* T4 and later support 52-bit virtual addresses. */
++ sparc64_va_hole_top = 0xfff8000000000000UL;
++ sparc64_va_hole_bottom = 0x0008000000000000UL;
+ max_phys_bits = 47;
+ break;
++ case SUN4V_CHIP_SPARC_M7:
++ default:
++ /* M7 and later support 52-bit virtual addresses. */
++ sparc64_va_hole_top = 0xfff8000000000000UL;
++ sparc64_va_hole_bottom = 0x0008000000000000UL;
++ max_phys_bits = 49;
++ break;
+ }
+ }
+
+@@ -1649,12 +1713,16 @@ static void __init setup_page_offset(void)
+ prom_halt();
+ }
+
+- PAGE_OFFSET = PAGE_OFFSET_BY_BITS(max_phys_bits);
++ PAGE_OFFSET = sparc64_va_hole_top;
++ VMALLOC_END = ((sparc64_va_hole_bottom >> 1) +
++ (sparc64_va_hole_bottom >> 2));
+
+- pr_info("PAGE_OFFSET is 0x%016lx (max_phys_bits == %lu)\n",
++ pr_info("MM: PAGE_OFFSET is 0x%016lx (max_phys_bits == %lu)\n",
+ PAGE_OFFSET, max_phys_bits);
+-
+- page_offset_shift_patch(max_phys_bits);
++ pr_info("MM: VMALLOC [0x%016lx --> 0x%016lx]\n",
++ VMALLOC_START, VMALLOC_END);
++ pr_info("MM: VMEMMAP [0x%016lx --> 0x%016lx]\n",
++ VMEMMAP_BASE, VMEMMAP_BASE << 1);
+ }
+
+ static void __init tsb_phys_patch(void)
+@@ -1699,21 +1767,42 @@ static void __init tsb_phys_patch(void)
+ #define NUM_KTSB_DESCR 1
+ #endif
+ static struct hv_tsb_descr ktsb_descr[NUM_KTSB_DESCR];
+-extern struct tsb swapper_tsb[KERNEL_TSB_NENTRIES];
++
++/* The swapper TSBs are loaded with a base sequence of:
++ *
++ * sethi %uhi(SYMBOL), REG1
++ * sethi %hi(SYMBOL), REG2
++ * or REG1, %ulo(SYMBOL), REG1
++ * or REG2, %lo(SYMBOL), REG2
++ * sllx REG1, 32, REG1
++ * or REG1, REG2, REG1
++ *
++ * When we use physical addressing for the TSB accesses, we patch the
++ * first four instructions in the above sequence.
++ */
+
+ static void patch_one_ktsb_phys(unsigned int *start, unsigned int *end, unsigned long pa)
+ {
+- pa >>= KTSB_PHYS_SHIFT;
++ unsigned long high_bits, low_bits;
++
++ high_bits = (pa >> 32) & 0xffffffff;
++ low_bits = (pa >> 0) & 0xffffffff;
+
+ while (start < end) {
+ unsigned int *ia = (unsigned int *)(unsigned long)*start;
+
+- ia[0] = (ia[0] & ~0x3fffff) | (pa >> 10);
++ ia[0] = (ia[0] & ~0x3fffff) | (high_bits >> 10);
+ __asm__ __volatile__("flush %0" : : "r" (ia));
+
+- ia[1] = (ia[1] & ~0x3ff) | (pa & 0x3ff);
++ ia[1] = (ia[1] & ~0x3fffff) | (low_bits >> 10);
+ __asm__ __volatile__("flush %0" : : "r" (ia + 1));
+
++ ia[2] = (ia[2] & ~0x1fff) | (high_bits & 0x3ff);
++ __asm__ __volatile__("flush %0" : : "r" (ia + 2));
++
++ ia[3] = (ia[3] & ~0x1fff) | (low_bits & 0x3ff);
++ __asm__ __volatile__("flush %0" : : "r" (ia + 3));
++
+ start++;
+ }
+ }
+@@ -1852,7 +1941,6 @@ static void __init sun4v_linear_pte_xor_finalize(void)
+ /* paging_init() sets up the page tables */
+
+ static unsigned long last_valid_pfn;
+-pgd_t swapper_pg_dir[PTRS_PER_PGD];
+
+ static void sun4u_pgprot_init(void);
+ static void sun4v_pgprot_init(void);
+@@ -1955,16 +2043,10 @@ void __init paging_init(void)
+ */
+ init_mm.pgd += ((shift) / (sizeof(pgd_t)));
+
+- memset(swapper_low_pmd_dir, 0, sizeof(swapper_low_pmd_dir));
++ memset(swapper_pg_dir, 0, sizeof(swapper_pg_dir));
+
+- /* Now can init the kernel/bad page tables. */
+- pud_set(pud_offset(&swapper_pg_dir[0], 0),
+- swapper_low_pmd_dir + (shift / sizeof(pgd_t)));
+-
+ inherit_prom_mappings();
+
+- init_kpte_bitmap();
+-
+ /* Ok, we can use our TLB miss and window trap handlers safely. */
+ setup_tba();
+
+@@ -2071,70 +2153,6 @@ int page_in_phys_avail(unsigned long paddr)
+ return 0;
+ }
+
+-static struct linux_prom64_registers pavail_rescan[MAX_BANKS] __initdata;
+-static int pavail_rescan_ents __initdata;
+-
+-/* Certain OBP calls, such as fetching "available" properties, can
+- * claim physical memory. So, along with initializing the valid
+- * address bitmap, what we do here is refetch the physical available
+- * memory list again, and make sure it provides at least as much
+- * memory as 'pavail' does.
+- */
+-static void __init setup_valid_addr_bitmap_from_pavail(unsigned long *bitmap)
+-{
+- int i;
+-
+- read_obp_memory("available", &pavail_rescan[0], &pavail_rescan_ents);
+-
+- for (i = 0; i < pavail_ents; i++) {
+- unsigned long old_start, old_end;
+-
+- old_start = pavail[i].phys_addr;
+- old_end = old_start + pavail[i].reg_size;
+- while (old_start < old_end) {
+- int n;
+-
+- for (n = 0; n < pavail_rescan_ents; n++) {
+- unsigned long new_start, new_end;
+-
+- new_start = pavail_rescan[n].phys_addr;
+- new_end = new_start +
+- pavail_rescan[n].reg_size;
+-
+- if (new_start <= old_start &&
+- new_end >= (old_start + PAGE_SIZE)) {
+- set_bit(old_start >> ILOG2_4MB, bitmap);
+- goto do_next_page;
+- }
+- }
+-
+- prom_printf("mem_init: Lost memory in pavail\n");
+- prom_printf("mem_init: OLD start[%lx] size[%lx]\n",
+- pavail[i].phys_addr,
+- pavail[i].reg_size);
+- prom_printf("mem_init: NEW start[%lx] size[%lx]\n",
+- pavail_rescan[i].phys_addr,
+- pavail_rescan[i].reg_size);
+- prom_printf("mem_init: Cannot continue, aborting.\n");
+- prom_halt();
+-
+- do_next_page:
+- old_start += PAGE_SIZE;
+- }
+- }
+-}
+-
+-static void __init patch_tlb_miss_handler_bitmap(void)
+-{
+- extern unsigned int valid_addr_bitmap_insn[];
+- extern unsigned int valid_addr_bitmap_patch[];
+-
+- valid_addr_bitmap_insn[1] = valid_addr_bitmap_patch[1];
+- mb();
+- valid_addr_bitmap_insn[0] = valid_addr_bitmap_patch[0];
+- flushi(&valid_addr_bitmap_insn[0]);
+-}
+-
+ static void __init register_page_bootmem_info(void)
+ {
+ #ifdef CONFIG_NEED_MULTIPLE_NODES
+@@ -2147,18 +2165,6 @@ static void __init register_page_bootmem_info(void)
+ }
+ void __init mem_init(void)
+ {
+- unsigned long addr, last;
+-
+- addr = PAGE_OFFSET + kern_base;
+- last = PAGE_ALIGN(kern_size) + addr;
+- while (addr < last) {
+- set_bit(__pa(addr) >> ILOG2_4MB, sparc64_valid_addr_bitmap);
+- addr += PAGE_SIZE;
+- }
+-
+- setup_valid_addr_bitmap_from_pavail(sparc64_valid_addr_bitmap);
+- patch_tlb_miss_handler_bitmap();
+-
+ high_memory = __va(last_valid_pfn << PAGE_SHIFT);
+
+ register_page_bootmem_info();
+@@ -2248,18 +2254,9 @@ unsigned long _PAGE_CACHE __read_mostly;
+ EXPORT_SYMBOL(_PAGE_CACHE);
+
+ #ifdef CONFIG_SPARSEMEM_VMEMMAP
+-unsigned long vmemmap_table[VMEMMAP_SIZE];
+-
+-static long __meminitdata addr_start, addr_end;
+-static int __meminitdata node_start;
+-
+ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
+ int node)
+ {
+- unsigned long phys_start = (vstart - VMEMMAP_BASE);
+- unsigned long phys_end = (vend - VMEMMAP_BASE);
+- unsigned long addr = phys_start & VMEMMAP_CHUNK_MASK;
+- unsigned long end = VMEMMAP_ALIGN(phys_end);
+ unsigned long pte_base;
+
+ pte_base = (_PAGE_VALID | _PAGE_SZ4MB_4U |
+@@ -2270,47 +2267,52 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
+ _PAGE_CP_4V | _PAGE_CV_4V |
+ _PAGE_P_4V | _PAGE_W_4V);
+
+- for (; addr < end; addr += VMEMMAP_CHUNK) {
+- unsigned long *vmem_pp =
+- vmemmap_table + (addr >> VMEMMAP_CHUNK_SHIFT);
+- void *block;
++ pte_base |= _PAGE_PMD_HUGE;
+
+- if (!(*vmem_pp & _PAGE_VALID)) {
+- block = vmemmap_alloc_block(1UL << ILOG2_4MB, node);
+- if (!block)
++ vstart = vstart & PMD_MASK;
++ vend = ALIGN(vend, PMD_SIZE);
++ for (; vstart < vend; vstart += PMD_SIZE) {
++ pgd_t *pgd = pgd_offset_k(vstart);
++ unsigned long pte;
++ pud_t *pud;
++ pmd_t *pmd;
++
++ if (pgd_none(*pgd)) {
++ pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
++
++ if (!new)
+ return -ENOMEM;
++ pgd_populate(&init_mm, pgd, new);
++ }
+
+- *vmem_pp = pte_base | __pa(block);
++ pud = pud_offset(pgd, vstart);
++ if (pud_none(*pud)) {
++ pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+
+- /* check to see if we have contiguous blocks */
+- if (addr_end != addr || node_start != node) {
+- if (addr_start)
+- printk(KERN_DEBUG " [%lx-%lx] on node %d\n",
+- addr_start, addr_end-1, node_start);
+- addr_start = addr;
+- node_start = node;
+- }
+- addr_end = addr + VMEMMAP_CHUNK;
++ if (!new)
++ return -ENOMEM;
++ pud_populate(&init_mm, pud, new);
+ }
+- }
+- return 0;
+-}
+
+-void __meminit vmemmap_populate_print_last(void)
+-{
+- if (addr_start) {
+- printk(KERN_DEBUG " [%lx-%lx] on node %d\n",
+- addr_start, addr_end-1, node_start);
+- addr_start = 0;
+- addr_end = 0;
+- node_start = 0;
++ pmd = pmd_offset(pud, vstart);
++
++ pte = pmd_val(*pmd);
++ if (!(pte & _PAGE_VALID)) {
++ void *block = vmemmap_alloc_block(PMD_SIZE, node);
++
++ if (!block)
++ return -ENOMEM;
++
++ pmd_val(*pmd) = pte_base | __pa(block);
++ }
+ }
++
++ return 0;
+ }
+
+ void vmemmap_free(unsigned long start, unsigned long end)
+ {
+ }
+-
+ #endif /* CONFIG_SPARSEMEM_VMEMMAP */
+
+ static void prot_init_common(unsigned long page_none,
+@@ -2722,8 +2724,8 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+ do_flush_tlb_kernel_range(start, LOW_OBP_ADDRESS);
+ }
+ if (end > HI_OBP_ADDRESS) {
+- flush_tsb_kernel_range(end, HI_OBP_ADDRESS);
+- do_flush_tlb_kernel_range(end, HI_OBP_ADDRESS);
++ flush_tsb_kernel_range(HI_OBP_ADDRESS, end);
++ do_flush_tlb_kernel_range(HI_OBP_ADDRESS, end);
+ }
+ } else {
+ flush_tsb_kernel_range(start, end);
+diff --git a/arch/sparc/mm/init_64.h b/arch/sparc/mm/init_64.h
+index 0668b364f44d..a4c09603b05c 100644
+--- a/arch/sparc/mm/init_64.h
++++ b/arch/sparc/mm/init_64.h
+@@ -8,15 +8,8 @@
+ */
+
+ #define MAX_PHYS_ADDRESS (1UL << MAX_PHYS_ADDRESS_BITS)
+-#define KPTE_BITMAP_CHUNK_SZ (256UL * 1024UL * 1024UL)
+-#define KPTE_BITMAP_BYTES \
+- ((MAX_PHYS_ADDRESS / KPTE_BITMAP_CHUNK_SZ) / 4)
+-#define VALID_ADDR_BITMAP_CHUNK_SZ (4UL * 1024UL * 1024UL)
+-#define VALID_ADDR_BITMAP_BYTES \
+- ((MAX_PHYS_ADDRESS / VALID_ADDR_BITMAP_CHUNK_SZ) / 8)
+
+ extern unsigned long kern_linear_pte_xor[4];
+-extern unsigned long kpte_linear_bitmap[KPTE_BITMAP_BYTES / sizeof(unsigned long)];
+ extern unsigned int sparc64_highest_unlocked_tlb_ent;
+ extern unsigned long sparc64_kern_pri_context;
+ extern unsigned long sparc64_kern_pri_nuc_bits;
+@@ -38,15 +31,4 @@ extern unsigned long kern_locked_tte_data;
+
+ void prom_world(int enter);
+
+-#ifdef CONFIG_SPARSEMEM_VMEMMAP
+-#define VMEMMAP_CHUNK_SHIFT 22
+-#define VMEMMAP_CHUNK (1UL << VMEMMAP_CHUNK_SHIFT)
+-#define VMEMMAP_CHUNK_MASK ~(VMEMMAP_CHUNK - 1UL)
+-#define VMEMMAP_ALIGN(x) (((x)+VMEMMAP_CHUNK-1UL)&VMEMMAP_CHUNK_MASK)
+-
+-#define VMEMMAP_SIZE ((((1UL << MAX_PHYSADDR_BITS) >> PAGE_SHIFT) * \
+- sizeof(struct page)) >> VMEMMAP_CHUNK_SHIFT)
+-extern unsigned long vmemmap_table[VMEMMAP_SIZE];
+-#endif
+-
+ #endif /* _SPARC64_MM_INIT_H */
+diff --git a/arch/sparc/net/bpf_jit_asm.S b/arch/sparc/net/bpf_jit_asm.S
+index 9d016c7017f7..8c83f4b8eb15 100644
+--- a/arch/sparc/net/bpf_jit_asm.S
++++ b/arch/sparc/net/bpf_jit_asm.S
+@@ -6,10 +6,12 @@
+ #define SAVE_SZ 176
+ #define SCRATCH_OFF STACK_BIAS + 128
+ #define BE_PTR(label) be,pn %xcc, label
++#define SIGN_EXTEND(reg) sra reg, 0, reg
+ #else
+ #define SAVE_SZ 96
+ #define SCRATCH_OFF 72
+ #define BE_PTR(label) be label
++#define SIGN_EXTEND(reg)
+ #endif
+
+ #define SKF_MAX_NEG_OFF (-0x200000) /* SKF_LL_OFF from filter.h */
+@@ -135,6 +137,7 @@ bpf_slow_path_byte_msh:
+ save %sp, -SAVE_SZ, %sp; \
+ mov %i0, %o0; \
+ mov r_OFF, %o1; \
++ SIGN_EXTEND(%o1); \
+ call bpf_internal_load_pointer_neg_helper; \
+ mov (LEN), %o2; \
+ mov %o0, r_TMP; \
+diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c
+index 892a102671ad..8d4152f94c5a 100644
+--- a/arch/sparc/net/bpf_jit_comp.c
++++ b/arch/sparc/net/bpf_jit_comp.c
+@@ -184,7 +184,7 @@ do { \
+ */
+ #define emit_alu_K(OPCODE, K) \
+ do { \
+- if (K) { \
++ if (K || OPCODE == AND || OPCODE == MUL) { \
+ unsigned int _insn = OPCODE; \
+ _insn |= RS1(r_A) | RD(r_A); \
+ if (is_simm13(K)) { \
+@@ -234,12 +234,18 @@ do { BUILD_BUG_ON(FIELD_SIZEOF(STRUCT, FIELD) != sizeof(u8)); \
+ __emit_load8(BASE, STRUCT, FIELD, DEST); \
+ } while (0)
+
+-#define emit_ldmem(OFF, DEST) \
+-do { *prog++ = LD32I | RS1(FP) | S13(-(OFF)) | RD(DEST); \
++#ifdef CONFIG_SPARC64
++#define BIAS (STACK_BIAS - 4)
++#else
++#define BIAS (-4)
++#endif
++
++#define emit_ldmem(OFF, DEST) \
++do { *prog++ = LD32I | RS1(SP) | S13(BIAS - (OFF)) | RD(DEST); \
+ } while (0)
+
+-#define emit_stmem(OFF, SRC) \
+-do { *prog++ = LD32I | RS1(FP) | S13(-(OFF)) | RD(SRC); \
++#define emit_stmem(OFF, SRC) \
++do { *prog++ = ST32I | RS1(SP) | S13(BIAS - (OFF)) | RD(SRC); \
+ } while (0)
+
+ #ifdef CONFIG_SMP
+@@ -615,10 +621,11 @@ void bpf_jit_compile(struct sk_filter *fp)
+ case BPF_ANC | SKF_AD_VLAN_TAG:
+ case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
+ emit_skb_load16(vlan_tci, r_A);
+- if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) {
+- emit_andi(r_A, VLAN_VID_MASK, r_A);
++ if (code != (BPF_ANC | SKF_AD_VLAN_TAG)) {
++ emit_alu_K(SRL, 12);
++ emit_andi(r_A, 1, r_A);
+ } else {
+- emit_loadimm(VLAN_TAG_PRESENT, r_TMP);
++ emit_loadimm(~VLAN_TAG_PRESENT, r_TMP);
+ emit_and(r_A, r_TMP, r_A);
+ }
+ break;
+@@ -630,15 +637,19 @@ void bpf_jit_compile(struct sk_filter *fp)
+ emit_loadimm(K, r_X);
+ break;
+ case BPF_LD | BPF_MEM:
++ seen |= SEEN_MEM;
+ emit_ldmem(K * 4, r_A);
+ break;
+ case BPF_LDX | BPF_MEM:
++ seen |= SEEN_MEM | SEEN_XREG;
+ emit_ldmem(K * 4, r_X);
+ break;
+ case BPF_ST:
++ seen |= SEEN_MEM;
+ emit_stmem(K * 4, r_A);
+ break;
+ case BPF_STX:
++ seen |= SEEN_MEM | SEEN_XREG;
+ emit_stmem(K * 4, r_X);
+ break;
+
+diff --git a/arch/sparc/power/hibernate_asm.S b/arch/sparc/power/hibernate_asm.S
+index 79942166df84..d7d9017dcb15 100644
+--- a/arch/sparc/power/hibernate_asm.S
++++ b/arch/sparc/power/hibernate_asm.S
+@@ -54,8 +54,8 @@ ENTRY(swsusp_arch_resume)
+ nop
+
+ /* Write PAGE_OFFSET to %g7 */
+- sethi %uhi(PAGE_OFFSET), %g7
+- sllx %g7, 32, %g7
++ sethi %hi(PAGE_OFFSET), %g7
++ ldx [%g7 + %lo(PAGE_OFFSET)], %g7
+
+ setuw (PAGE_SIZE-8), %g3
+
+diff --git a/arch/sparc/prom/bootstr_64.c b/arch/sparc/prom/bootstr_64.c
+index ab9ccc63b388..7149e77714a4 100644
+--- a/arch/sparc/prom/bootstr_64.c
++++ b/arch/sparc/prom/bootstr_64.c
+@@ -14,7 +14,10 @@
+ * the .bss section or it will break things.
+ */
+
+-#define BARG_LEN 256
++/* We limit BARG_LEN to 1024 because this is the size of the
++ * 'barg_out' command line buffer in the SILO bootloader.
++ */
++#define BARG_LEN 1024
+ struct {
+ int bootstr_len;
+ int bootstr_valid;
+diff --git a/arch/sparc/prom/cif.S b/arch/sparc/prom/cif.S
+index 9c86b4b7d429..8050f381f518 100644
+--- a/arch/sparc/prom/cif.S
++++ b/arch/sparc/prom/cif.S
+@@ -11,11 +11,10 @@
+ .text
+ .globl prom_cif_direct
+ prom_cif_direct:
++ save %sp, -192, %sp
+ sethi %hi(p1275buf), %o1
+ or %o1, %lo(p1275buf), %o1
+- ldx [%o1 + 0x0010], %o2 ! prom_cif_stack
+- save %o2, -192, %sp
+- ldx [%i1 + 0x0008], %l2 ! prom_cif_handler
++ ldx [%o1 + 0x0008], %l2 ! prom_cif_handler
+ mov %g4, %l0
+ mov %g5, %l1
+ mov %g6, %l3
+diff --git a/arch/sparc/prom/init_64.c b/arch/sparc/prom/init_64.c
+index d95db755828f..110b0d78b864 100644
+--- a/arch/sparc/prom/init_64.c
++++ b/arch/sparc/prom/init_64.c
+@@ -26,13 +26,13 @@ phandle prom_chosen_node;
+ * It gets passed the pointer to the PROM vector.
+ */
+
+-extern void prom_cif_init(void *, void *);
++extern void prom_cif_init(void *);
+
+-void __init prom_init(void *cif_handler, void *cif_stack)
++void __init prom_init(void *cif_handler)
+ {
+ phandle node;
+
+- prom_cif_init(cif_handler, cif_stack);
++ prom_cif_init(cif_handler);
+
+ prom_chosen_node = prom_finddevice(prom_chosen_path);
+ if (!prom_chosen_node || (s32)prom_chosen_node == -1)
+diff --git a/arch/sparc/prom/p1275.c b/arch/sparc/prom/p1275.c
+index e58b81726319..545d8bb79b65 100644
+--- a/arch/sparc/prom/p1275.c
++++ b/arch/sparc/prom/p1275.c
+@@ -9,6 +9,7 @@
+ #include <linux/smp.h>
+ #include <linux/string.h>
+ #include <linux/spinlock.h>
++#include <linux/irqflags.h>
+
+ #include <asm/openprom.h>
+ #include <asm/oplib.h>
+@@ -19,7 +20,6 @@
+ struct {
+ long prom_callback; /* 0x00 */
+ void (*prom_cif_handler)(long *); /* 0x08 */
+- unsigned long prom_cif_stack; /* 0x10 */
+ } p1275buf;
+
+ extern void prom_world(int);
+@@ -36,8 +36,8 @@ void p1275_cmd_direct(unsigned long *args)
+ {
+ unsigned long flags;
+
+- raw_local_save_flags(flags);
+- raw_local_irq_restore((unsigned long)PIL_NMI);
++ local_save_flags(flags);
++ local_irq_restore((unsigned long)PIL_NMI);
+ raw_spin_lock(&prom_entry_lock);
+
+ prom_world(1);
+@@ -45,11 +45,10 @@ void p1275_cmd_direct(unsigned long *args)
+ prom_world(0);
+
+ raw_spin_unlock(&prom_entry_lock);
+- raw_local_irq_restore(flags);
++ local_irq_restore(flags);
+ }
+
+ void prom_cif_init(void *cif_handler, void *cif_stack)
+ {
+ p1275buf.prom_cif_handler = (void (*)(long *))cif_handler;
+- p1275buf.prom_cif_stack = (unsigned long)cif_stack;
+ }
+diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
+index 9f83c171ac18..db1ce1e90a5b 100644
+--- a/arch/x86/include/asm/kvm_host.h
++++ b/arch/x86/include/asm/kvm_host.h
+@@ -479,6 +479,7 @@ struct kvm_vcpu_arch {
+ u64 mmio_gva;
+ unsigned access;
+ gfn_t mmio_gfn;
++ u64 mmio_gen;
+
+ struct kvm_pmu pmu;
+
+diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
+index f9e4fdd3b877..21337cd58b6b 100644
+--- a/arch/x86/kernel/cpu/intel.c
++++ b/arch/x86/kernel/cpu/intel.c
+@@ -144,6 +144,21 @@ static void early_init_intel(struct cpuinfo_x86 *c)
+ setup_clear_cpu_cap(X86_FEATURE_ERMS);
+ }
+ }
++
++ /*
++ * Intel Quark Core DevMan_001.pdf section 6.4.11
++ * "The operating system also is required to invalidate (i.e., flush)
++ * the TLB when any changes are made to any of the page table entries.
++ * The operating system must reload CR3 to cause the TLB to be flushed"
++ *
++ * As a result cpu_has_pge() in arch/x86/include/asm/tlbflush.h should
++ * be false so that __flush_tlb_all() causes CR3 insted of CR4.PGE
++ * to be modified
++ */
++ if (c->x86 == 5 && c->x86_model == 9) {
++ pr_info("Disabling PGE capability bit\n");
++ setup_clear_cpu_cap(X86_FEATURE_PGE);
++ }
+ }
+
+ #ifdef CONFIG_X86_32
+diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
+index 931467881da7..1cd2a5fbde07 100644
+--- a/arch/x86/kvm/mmu.c
++++ b/arch/x86/kvm/mmu.c
+@@ -199,16 +199,20 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
+ EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
+
+ /*
+- * spte bits of bit 3 ~ bit 11 are used as low 9 bits of generation number,
+- * the bits of bits 52 ~ bit 61 are used as high 10 bits of generation
+- * number.
++ * the low bit of the generation number is always presumed to be zero.
++ * This disables mmio caching during memslot updates. The concept is
++ * similar to a seqcount but instead of retrying the access we just punt
++ * and ignore the cache.
++ *
++ * spte bits 3-11 are used as bits 1-9 of the generation number,
++ * the bits 52-61 are used as bits 10-19 of the generation number.
+ */
+-#define MMIO_SPTE_GEN_LOW_SHIFT 3
++#define MMIO_SPTE_GEN_LOW_SHIFT 2
+ #define MMIO_SPTE_GEN_HIGH_SHIFT 52
+
+-#define MMIO_GEN_SHIFT 19
+-#define MMIO_GEN_LOW_SHIFT 9
+-#define MMIO_GEN_LOW_MASK ((1 << MMIO_GEN_LOW_SHIFT) - 1)
++#define MMIO_GEN_SHIFT 20
++#define MMIO_GEN_LOW_SHIFT 10
++#define MMIO_GEN_LOW_MASK ((1 << MMIO_GEN_LOW_SHIFT) - 2)
+ #define MMIO_GEN_MASK ((1 << MMIO_GEN_SHIFT) - 1)
+ #define MMIO_MAX_GEN ((1 << MMIO_GEN_SHIFT) - 1)
+
+@@ -236,12 +240,7 @@ static unsigned int get_mmio_spte_generation(u64 spte)
+
+ static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
+ {
+- /*
+- * Init kvm generation close to MMIO_MAX_GEN to easily test the
+- * code of handling generation number wrap-around.
+- */
+- return (kvm_memslots(kvm)->generation +
+- MMIO_MAX_GEN - 150) & MMIO_GEN_MASK;
++ return kvm_memslots(kvm)->generation & MMIO_GEN_MASK;
+ }
+
+ static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
+@@ -3163,7 +3162,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
+ if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
+ return;
+
+- vcpu_clear_mmio_info(vcpu, ~0ul);
++ vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
+ kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
+ if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+ hpa_t root = vcpu->arch.mmu.root_hpa;
+@@ -4433,7 +4432,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
+ * The very rare case: if the generation-number is round,
+ * zap all shadow pages.
+ */
+- if (unlikely(kvm_current_mmio_generation(kvm) >= MMIO_MAX_GEN)) {
++ if (unlikely(kvm_current_mmio_generation(kvm) == 0)) {
+ printk_ratelimited(KERN_INFO "kvm: zapping shadow pages for mmio generation wraparound\n");
+ kvm_mmu_invalidate_zap_all_pages(kvm);
+ }
+diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
+index 801332edefc3..6c437ed00dcf 100644
+--- a/arch/x86/kvm/vmx.c
++++ b/arch/x86/kvm/vmx.c
+@@ -450,6 +450,7 @@ struct vcpu_vmx {
+ int gs_ldt_reload_needed;
+ int fs_reload_needed;
+ u64 msr_host_bndcfgs;
++ unsigned long vmcs_host_cr4; /* May not match real cr4 */
+ } host_state;
+ struct {
+ int vm86_active;
+@@ -4218,11 +4219,16 @@ static void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
+ u32 low32, high32;
+ unsigned long tmpl;
+ struct desc_ptr dt;
++ unsigned long cr4;
+
+ vmcs_writel(HOST_CR0, read_cr0() & ~X86_CR0_TS); /* 22.2.3 */
+- vmcs_writel(HOST_CR4, read_cr4()); /* 22.2.3, 22.2.5 */
+ vmcs_writel(HOST_CR3, read_cr3()); /* 22.2.3 FIXME: shadow tables */
+
++ /* Save the most likely value for this task's CR4 in the VMCS. */
++ cr4 = read_cr4();
++ vmcs_writel(HOST_CR4, cr4); /* 22.2.3, 22.2.5 */
++ vmx->host_state.vmcs_host_cr4 = cr4;
++
+ vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS); /* 22.2.4 */
+ #ifdef CONFIG_X86_64
+ /*
+@@ -7336,7 +7342,7 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
+ {
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+- unsigned long debugctlmsr;
++ unsigned long debugctlmsr, cr4;
+
+ /* Record the guest's net vcpu time for enforced NMI injections. */
+ if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
+@@ -7357,6 +7363,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
+ if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
+ vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+
++ cr4 = read_cr4();
++ if (unlikely(cr4 != vmx->host_state.vmcs_host_cr4)) {
++ vmcs_writel(HOST_CR4, cr4);
++ vmx->host_state.vmcs_host_cr4 = cr4;
++ }
++
+ /* When single-stepping over STI and MOV SS, we must clear the
+ * corresponding interruptibility bits in the guest state. Otherwise
+ * vmentry fails as it then expects bit 14 (BS) in pending debug
+diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
+index 8c97bac9a895..b0b17e6f0431 100644
+--- a/arch/x86/kvm/x86.h
++++ b/arch/x86/kvm/x86.h
+@@ -78,15 +78,23 @@ static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
+ vcpu->arch.mmio_gva = gva & PAGE_MASK;
+ vcpu->arch.access = access;
+ vcpu->arch.mmio_gfn = gfn;
++ vcpu->arch.mmio_gen = kvm_memslots(vcpu->kvm)->generation;
++}
++
++static inline bool vcpu_match_mmio_gen(struct kvm_vcpu *vcpu)
++{
++ return vcpu->arch.mmio_gen == kvm_memslots(vcpu->kvm)->generation;
+ }
+
+ /*
+- * Clear the mmio cache info for the given gva,
+- * specially, if gva is ~0ul, we clear all mmio cache info.
++ * Clear the mmio cache info for the given gva. If gva is MMIO_GVA_ANY, we
++ * clear all mmio cache info.
+ */
++#define MMIO_GVA_ANY (~(gva_t)0)
++
+ static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
+ {
+- if (gva != (~0ul) && vcpu->arch.mmio_gva != (gva & PAGE_MASK))
++ if (gva != MMIO_GVA_ANY && vcpu->arch.mmio_gva != (gva & PAGE_MASK))
+ return;
+
+ vcpu->arch.mmio_gva = 0;
+@@ -94,7 +102,8 @@ static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
+
+ static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long gva)
+ {
+- if (vcpu->arch.mmio_gva && vcpu->arch.mmio_gva == (gva & PAGE_MASK))
++ if (vcpu_match_mmio_gen(vcpu) && vcpu->arch.mmio_gva &&
++ vcpu->arch.mmio_gva == (gva & PAGE_MASK))
+ return true;
+
+ return false;
+@@ -102,7 +111,8 @@ static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long gva)
+
+ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+ {
+- if (vcpu->arch.mmio_gfn && vcpu->arch.mmio_gfn == gpa >> PAGE_SHIFT)
++ if (vcpu_match_mmio_gen(vcpu) && vcpu->arch.mmio_gfn &&
++ vcpu->arch.mmio_gfn == gpa >> PAGE_SHIFT)
+ return true;
+
+ return false;
+diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
+index 3c562f5a60bb..e1bce26cd4f9 100644
+--- a/crypto/async_tx/async_xor.c
++++ b/crypto/async_tx/async_xor.c
+@@ -78,8 +78,6 @@ do_async_xor(struct dma_chan *chan, struct dmaengine_unmap_data *unmap,
+ tx = dma->device_prep_dma_xor(chan, dma_dest, src_list,
+ xor_src_cnt, unmap->len,
+ dma_flags);
+- src_list[0] = tmp;
+-
+
+ if (unlikely(!tx))
+ async_tx_quiesce(&submit->depend_tx);
+@@ -92,6 +90,7 @@ do_async_xor(struct dma_chan *chan, struct dmaengine_unmap_data *unmap,
+ xor_src_cnt, unmap->len,
+ dma_flags);
+ }
++ src_list[0] = tmp;
+
+ dma_set_unmap(tx, unmap);
+ async_tx_submit(chan, tx, submit);
+diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
+index d276e33880be..2a1d1ae5c11d 100644
+--- a/drivers/base/firmware_class.c
++++ b/drivers/base/firmware_class.c
+@@ -1086,6 +1086,9 @@ _request_firmware(const struct firmware **firmware_p, const char *name,
+ if (!firmware_p)
+ return -EINVAL;
+
++ if (!name || name[0] == '\0')
++ return -EINVAL;
++
+ ret = _request_firmware_prepare(&fw, name, device);
+ if (ret <= 0) /* error or already assigned */
+ goto out;
+diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
+index 65ea7b256b3e..a3530dadb163 100644
+--- a/drivers/base/regmap/regmap-debugfs.c
++++ b/drivers/base/regmap/regmap-debugfs.c
+@@ -473,6 +473,7 @@ void regmap_debugfs_init(struct regmap *map, const char *name)
+ {
+ struct rb_node *next;
+ struct regmap_range_node *range_node;
++ const char *devname = "dummy";
+
+ /* If we don't have the debugfs root yet, postpone init */
+ if (!regmap_debugfs_root) {
+@@ -491,12 +492,15 @@ void regmap_debugfs_init(struct regmap *map, const char *name)
+ INIT_LIST_HEAD(&map->debugfs_off_cache);
+ mutex_init(&map->cache_lock);
+
++ if (map->dev)
++ devname = dev_name(map->dev);
++
+ if (name) {
+ map->debugfs_name = kasprintf(GFP_KERNEL, "%s-%s",
+- dev_name(map->dev), name);
++ devname, name);
+ name = map->debugfs_name;
+ } else {
+- name = dev_name(map->dev);
++ name = devname;
+ }
+
+ map->debugfs = debugfs_create_dir(name, regmap_debugfs_root);
+diff --git a/drivers/base/regmap/regmap.c b/drivers/base/regmap/regmap.c
+index 283644e5d31f..8cda01590ed2 100644
+--- a/drivers/base/regmap/regmap.c
++++ b/drivers/base/regmap/regmap.c
+@@ -1395,7 +1395,7 @@ int _regmap_write(struct regmap *map, unsigned int reg,
+ }
+
+ #ifdef LOG_DEVICE
+- if (strcmp(dev_name(map->dev), LOG_DEVICE) == 0)
++ if (map->dev && strcmp(dev_name(map->dev), LOG_DEVICE) == 0)
+ dev_info(map->dev, "%x <= %x\n", reg, val);
+ #endif
+
+@@ -1646,6 +1646,9 @@ out:
+ } else {
+ void *wval;
+
++ if (!val_count)
++ return -EINVAL;
++
+ wval = kmemdup(val, val_count * val_bytes, GFP_KERNEL);
+ if (!wval) {
+ dev_err(map->dev, "Error in memory allocation\n");
+@@ -2045,7 +2048,7 @@ static int _regmap_read(struct regmap *map, unsigned int reg,
+ ret = map->reg_read(context, reg, val);
+ if (ret == 0) {
+ #ifdef LOG_DEVICE
+- if (strcmp(dev_name(map->dev), LOG_DEVICE) == 0)
++ if (map->dev && strcmp(dev_name(map->dev), LOG_DEVICE) == 0)
+ dev_info(map->dev, "%x => %x\n", reg, *val);
+ #endif
+
+diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
+index 6250fc2fb93a..0489a946e68d 100644
+--- a/drivers/bluetooth/btusb.c
++++ b/drivers/bluetooth/btusb.c
+@@ -317,6 +317,9 @@ static void btusb_intr_complete(struct urb *urb)
+ BT_ERR("%s corrupted event packet", hdev->name);
+ hdev->stat.err_rx++;
+ }
++ } else if (urb->status == -ENOENT) {
++ /* Avoid suspend failed when usb_kill_urb */
++ return;
+ }
+
+ if (!test_bit(BTUSB_INTR_RUNNING, &data->flags))
+@@ -405,6 +408,9 @@ static void btusb_bulk_complete(struct urb *urb)
+ BT_ERR("%s corrupted ACL packet", hdev->name);
+ hdev->stat.err_rx++;
+ }
++ } else if (urb->status == -ENOENT) {
++ /* Avoid suspend failed when usb_kill_urb */
++ return;
+ }
+
+ if (!test_bit(BTUSB_BULK_RUNNING, &data->flags))
+@@ -499,6 +505,9 @@ static void btusb_isoc_complete(struct urb *urb)
+ hdev->stat.err_rx++;
+ }
+ }
++ } else if (urb->status == -ENOENT) {
++ /* Avoid suspend failed when usb_kill_urb */
++ return;
+ }
+
+ if (!test_bit(BTUSB_ISOC_RUNNING, &data->flags))
+diff --git a/drivers/bluetooth/hci_h5.c b/drivers/bluetooth/hci_h5.c
+index fede8ca7147c..5d9148f8a506 100644
+--- a/drivers/bluetooth/hci_h5.c
++++ b/drivers/bluetooth/hci_h5.c
+@@ -237,7 +237,7 @@ static void h5_pkt_cull(struct h5 *h5)
+ break;
+
+ to_remove--;
+- seq = (seq - 1) % 8;
++ seq = (seq - 1) & 0x07;
+ }
+
+ if (seq != h5->rx_ack)
+diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
+index f4aec2e6ef56..7d3742edbaa2 100644
+--- a/drivers/edac/mpc85xx_edac.c
++++ b/drivers/edac/mpc85xx_edac.c
+@@ -633,7 +633,7 @@ static int mpc85xx_l2_err_probe(struct platform_device *op)
+ if (edac_op_state == EDAC_OPSTATE_INT) {
+ pdata->irq = irq_of_parse_and_map(op->dev.of_node, 0);
+ res = devm_request_irq(&op->dev, pdata->irq,
+- mpc85xx_l2_isr, 0,
++ mpc85xx_l2_isr, IRQF_SHARED,
+ "[EDAC] L2 err", edac_dev);
+ if (res < 0) {
+ printk(KERN_ERR
+diff --git a/drivers/hid/hid-rmi.c b/drivers/hid/hid-rmi.c
+index 578bbe65902b..54966ca9e503 100644
+--- a/drivers/hid/hid-rmi.c
++++ b/drivers/hid/hid-rmi.c
+@@ -320,10 +320,7 @@ static int rmi_f11_input_event(struct hid_device *hdev, u8 irq, u8 *data,
+ int offset;
+ int i;
+
+- if (size < hdata->f11.report_size)
+- return 0;
+-
+- if (!(irq & hdata->f11.irq_mask))
++ if (!(irq & hdata->f11.irq_mask) || size <= 0)
+ return 0;
+
+ offset = (hdata->max_fingers >> 2) + 1;
+@@ -332,9 +329,19 @@ static int rmi_f11_input_event(struct hid_device *hdev, u8 irq, u8 *data,
+ int fs_bit_position = (i & 0x3) << 1;
+ int finger_state = (data[fs_byte_position] >> fs_bit_position) &
+ 0x03;
++ int position = offset + 5 * i;
++
++ if (position + 5 > size) {
++ /* partial report, go on with what we received */
++ printk_once(KERN_WARNING
++ "%s %s: Detected incomplete finger report. Finger reports may occasionally get dropped on this platform.\n",
++ dev_driver_string(&hdev->dev),
++ dev_name(&hdev->dev));
++ hid_dbg(hdev, "Incomplete finger report\n");
++ break;
++ }
+
+- rmi_f11_process_touch(hdata, i, finger_state,
+- &data[offset + 5 * i]);
++ rmi_f11_process_touch(hdata, i, finger_state, &data[position]);
+ }
+ input_mt_sync_frame(hdata->input);
+ input_sync(hdata->input);
+@@ -352,6 +359,11 @@ static int rmi_f30_input_event(struct hid_device *hdev, u8 irq, u8 *data,
+ if (!(irq & hdata->f30.irq_mask))
+ return 0;
+
++ if (size < (int)hdata->f30.report_size) {
++ hid_warn(hdev, "Click Button pressed, but the click data is missing\n");
++ return 0;
++ }
++
+ for (i = 0; i < hdata->gpio_led_count; i++) {
+ if (test_bit(i, &hdata->button_mask)) {
+ value = (data[i / 8] >> (i & 0x07)) & BIT(0);
+@@ -412,9 +424,29 @@ static int rmi_read_data_event(struct hid_device *hdev, u8 *data, int size)
+ return 1;
+ }
+
++static int rmi_check_sanity(struct hid_device *hdev, u8 *data, int size)
++{
++ int valid_size = size;
++ /*
++ * On the Dell XPS 13 9333, the bus sometimes get confused and fills
++ * the report with a sentinel value "ff". Synaptics told us that such
++ * behavior does not comes from the touchpad itself, so we filter out
++ * such reports here.
++ */
++
++ while ((data[valid_size - 1] == 0xff) && valid_size > 0)
++ valid_size--;
++
++ return valid_size;
++}
++
+ static int rmi_raw_event(struct hid_device *hdev,
+ struct hid_report *report, u8 *data, int size)
+ {
++ size = rmi_check_sanity(hdev, data, size);
++ if (size < 2)
++ return 0;
++
+ switch (data[0]) {
+ case RMI_READ_DATA_REPORT_ID:
+ return rmi_read_data_event(hdev, data, size);
+diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
+index 284cf66489f4..bec55ed2917a 100644
+--- a/drivers/hv/channel.c
++++ b/drivers/hv/channel.c
+@@ -165,8 +165,10 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size,
+ ret = vmbus_post_msg(open_msg,
+ sizeof(struct vmbus_channel_open_channel));
+
+- if (ret != 0)
++ if (ret != 0) {
++ err = ret;
+ goto error1;
++ }
+
+ t = wait_for_completion_timeout(&open_info->waitevent, 5*HZ);
+ if (t == 0) {
+@@ -363,7 +365,6 @@ int vmbus_establish_gpadl(struct vmbus_channel *channel, void *kbuffer,
+ u32 next_gpadl_handle;
+ unsigned long flags;
+ int ret = 0;
+- int t;
+
+ next_gpadl_handle = atomic_read(&vmbus_connection.next_gpadl_handle);
+ atomic_inc(&vmbus_connection.next_gpadl_handle);
+@@ -410,9 +411,7 @@ int vmbus_establish_gpadl(struct vmbus_channel *channel, void *kbuffer,
+
+ }
+ }
+- t = wait_for_completion_timeout(&msginfo->waitevent, 5*HZ);
+- BUG_ON(t == 0);
+-
++ wait_for_completion(&msginfo->waitevent);
+
+ /* At this point, we received the gpadl created msg */
+ *gpadl_handle = gpadlmsg->gpadl;
+@@ -435,7 +434,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
+ struct vmbus_channel_gpadl_teardown *msg;
+ struct vmbus_channel_msginfo *info;
+ unsigned long flags;
+- int ret, t;
++ int ret;
+
+ info = kmalloc(sizeof(*info) +
+ sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
+@@ -457,11 +456,12 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
+ ret = vmbus_post_msg(msg,
+ sizeof(struct vmbus_channel_gpadl_teardown));
+
+- BUG_ON(ret != 0);
+- t = wait_for_completion_timeout(&info->waitevent, 5*HZ);
+- BUG_ON(t == 0);
++ if (ret)
++ goto post_msg_err;
++
++ wait_for_completion(&info->waitevent);
+
+- /* Received a torndown response */
++post_msg_err:
+ spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
+ list_del(&info->msglistentry);
+ spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
+@@ -478,7 +478,7 @@ static void reset_channel_cb(void *arg)
+ channel->onchannel_callback = NULL;
+ }
+
+-static void vmbus_close_internal(struct vmbus_channel *channel)
++static int vmbus_close_internal(struct vmbus_channel *channel)
+ {
+ struct vmbus_channel_close_channel *msg;
+ int ret;
+@@ -501,11 +501,28 @@ static void vmbus_close_internal(struct vmbus_channel *channel)
+
+ ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_close_channel));
+
+- BUG_ON(ret != 0);
++ if (ret) {
++ pr_err("Close failed: close post msg return is %d\n", ret);
++ /*
++ * If we failed to post the close msg,
++ * it is perhaps better to leak memory.
++ */
++ return ret;
++ }
++
+ /* Tear down the gpadl for the channel's ring buffer */
+- if (channel->ringbuffer_gpadlhandle)
+- vmbus_teardown_gpadl(channel,
+- channel->ringbuffer_gpadlhandle);
++ if (channel->ringbuffer_gpadlhandle) {
++ ret = vmbus_teardown_gpadl(channel,
++ channel->ringbuffer_gpadlhandle);
++ if (ret) {
++ pr_err("Close failed: teardown gpadl return %d\n", ret);
++ /*
++ * If we failed to teardown gpadl,
++ * it is perhaps better to leak memory.
++ */
++ return ret;
++ }
++ }
+
+ /* Cleanup the ring buffers for this channel */
+ hv_ringbuffer_cleanup(&channel->outbound);
+@@ -514,7 +531,7 @@ static void vmbus_close_internal(struct vmbus_channel *channel)
+ free_pages((unsigned long)channel->ringbuffer_pages,
+ get_order(channel->ringbuffer_pagecount * PAGE_SIZE));
+
+-
++ return ret;
+ }
+
+ /*
+diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
+index ae22e3c1fc4c..e206619b946e 100644
+--- a/drivers/hv/connection.c
++++ b/drivers/hv/connection.c
+@@ -427,10 +427,21 @@ int vmbus_post_msg(void *buffer, size_t buflen)
+ * insufficient resources. Retry the operation a couple of
+ * times before giving up.
+ */
+- while (retries < 3) {
+- ret = hv_post_message(conn_id, 1, buffer, buflen);
+- if (ret != HV_STATUS_INSUFFICIENT_BUFFERS)
++ while (retries < 10) {
++ ret = hv_post_message(conn_id, 1, buffer, buflen);
++
++ switch (ret) {
++ case HV_STATUS_INSUFFICIENT_BUFFERS:
++ ret = -ENOMEM;
++ case -ENOMEM:
++ break;
++ case HV_STATUS_SUCCESS:
+ return ret;
++ default:
++ pr_err("hv_post_msg() failed; error code:%d\n", ret);
++ return -EINVAL;
++ }
++
+ retries++;
+ msleep(100);
+ }
+diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
+index edfc8488cb03..3e4235c7a47f 100644
+--- a/drivers/hv/hv.c
++++ b/drivers/hv/hv.c
+@@ -138,6 +138,8 @@ int hv_init(void)
+ memset(hv_context.synic_event_page, 0, sizeof(void *) * NR_CPUS);
+ memset(hv_context.synic_message_page, 0,
+ sizeof(void *) * NR_CPUS);
++ memset(hv_context.post_msg_page, 0,
++ sizeof(void *) * NR_CPUS);
+ memset(hv_context.vp_index, 0,
+ sizeof(int) * NR_CPUS);
+ memset(hv_context.event_dpc, 0,
+@@ -217,26 +219,18 @@ int hv_post_message(union hv_connection_id connection_id,
+ enum hv_message_type message_type,
+ void *payload, size_t payload_size)
+ {
+- struct aligned_input {
+- u64 alignment8;
+- struct hv_input_post_message msg;
+- };
+
+ struct hv_input_post_message *aligned_msg;
+ u16 status;
+- unsigned long addr;
+
+ if (payload_size > HV_MESSAGE_PAYLOAD_BYTE_COUNT)
+ return -EMSGSIZE;
+
+- addr = (unsigned long)kmalloc(sizeof(struct aligned_input), GFP_ATOMIC);
+- if (!addr)
+- return -ENOMEM;
+-
+ aligned_msg = (struct hv_input_post_message *)
+- (ALIGN(addr, HV_HYPERCALL_PARAM_ALIGN));
++ hv_context.post_msg_page[get_cpu()];
+
+ aligned_msg->connectionid = connection_id;
++ aligned_msg->reserved = 0;
+ aligned_msg->message_type = message_type;
+ aligned_msg->payload_size = payload_size;
+ memcpy((void *)aligned_msg->payload, payload, payload_size);
+@@ -244,8 +238,7 @@ int hv_post_message(union hv_connection_id connection_id,
+ status = do_hypercall(HVCALL_POST_MESSAGE, aligned_msg, NULL)
+ & 0xFFFF;
+
+- kfree((void *)addr);
+-
++ put_cpu();
+ return status;
+ }
+
+@@ -294,6 +287,14 @@ int hv_synic_alloc(void)
+ pr_err("Unable to allocate SYNIC event page\n");
+ goto err;
+ }
++
++ hv_context.post_msg_page[cpu] =
++ (void *)get_zeroed_page(GFP_ATOMIC);
++
++ if (hv_context.post_msg_page[cpu] == NULL) {
++ pr_err("Unable to allocate post msg page\n");
++ goto err;
++ }
+ }
+
+ return 0;
+@@ -308,6 +309,8 @@ static void hv_synic_free_cpu(int cpu)
+ free_page((unsigned long)hv_context.synic_event_page[cpu]);
+ if (hv_context.synic_message_page[cpu])
+ free_page((unsigned long)hv_context.synic_message_page[cpu]);
++ if (hv_context.post_msg_page[cpu])
++ free_page((unsigned long)hv_context.post_msg_page[cpu]);
+ }
+
+ void hv_synic_free(void)
+diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
+index 22b750749a39..c386d8dc7223 100644
+--- a/drivers/hv/hyperv_vmbus.h
++++ b/drivers/hv/hyperv_vmbus.h
+@@ -515,6 +515,10 @@ struct hv_context {
+ * per-cpu list of the channels based on their CPU affinity.
+ */
+ struct list_head percpu_list[NR_CPUS];
++ /*
++ * buffer to post messages to the host.
++ */
++ void *post_msg_page[NR_CPUS];
+ };
+
+ extern struct hv_context hv_context;
+diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
+index 49d11338294b..2fb90e2825c3 100644
+--- a/drivers/message/fusion/mptspi.c
++++ b/drivers/message/fusion/mptspi.c
+@@ -1420,6 +1420,11 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+ goto out_mptspi_probe;
+ }
+
++ /* VMWare emulation doesn't properly implement WRITE_SAME
++ */
++ if (pdev->subsystem_vendor == 0x15AD)
++ sh->no_write_same = 1;
++
+ spin_lock_irqsave(&ioc->FreeQlock, flags);
+
+ /* Attach the SCSI Host to the IOC structure
+diff --git a/drivers/misc/mei/bus.c b/drivers/misc/mei/bus.c
+index 0e993ef28b94..8fd9466266b6 100644
+--- a/drivers/misc/mei/bus.c
++++ b/drivers/misc/mei/bus.c
+@@ -70,7 +70,7 @@ static int mei_cl_device_probe(struct device *dev)
+
+ dev_dbg(dev, "Device probe\n");
+
+- strncpy(id.name, dev_name(dev), sizeof(id.name));
++ strlcpy(id.name, dev_name(dev), sizeof(id.name));
+
+ return driver->probe(device, &id);
+ }
+diff --git a/drivers/net/wireless/ath/ath9k/ar5008_phy.c b/drivers/net/wireless/ath/ath9k/ar5008_phy.c
+index 00fb8badbacc..3b3e91057a4c 100644
+--- a/drivers/net/wireless/ath/ath9k/ar5008_phy.c
++++ b/drivers/net/wireless/ath/ath9k/ar5008_phy.c
+@@ -1004,9 +1004,11 @@ static bool ar5008_hw_ani_control_new(struct ath_hw *ah,
+ case ATH9K_ANI_FIRSTEP_LEVEL:{
+ u32 level = param;
+
+- value = level;
++ value = level * 2;
+ REG_RMW_FIELD(ah, AR_PHY_FIND_SIG,
+ AR_PHY_FIND_SIG_FIRSTEP, value);
++ REG_RMW_FIELD(ah, AR_PHY_FIND_SIG_LOW,
++ AR_PHY_FIND_SIG_FIRSTEP_LOW, value);
+
+ if (level != aniState->firstepLevel) {
+ ath_dbg(common, ANI,
+diff --git a/drivers/net/wireless/iwlwifi/mvm/constants.h b/drivers/net/wireless/iwlwifi/mvm/constants.h
+index 51685693af2e..cb4c06cead2d 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/constants.h
++++ b/drivers/net/wireless/iwlwifi/mvm/constants.h
+@@ -80,7 +80,7 @@
+ #define IWL_MVM_WOWLAN_PS_SNOOZE_WINDOW 25
+ #define IWL_MVM_LOWLAT_QUOTA_MIN_PERCENT 64
+ #define IWL_MVM_BT_COEX_SYNC2SCO 1
+-#define IWL_MVM_BT_COEX_CORUNNING 1
++#define IWL_MVM_BT_COEX_CORUNNING 0
+ #define IWL_MVM_BT_COEX_MPLUT 1
+
+ #endif /* __MVM_CONSTANTS_H */
+diff --git a/drivers/net/wireless/iwlwifi/pcie/drv.c b/drivers/net/wireless/iwlwifi/pcie/drv.c
+index 98950e45c7b0..78eaa4875bd7 100644
+--- a/drivers/net/wireless/iwlwifi/pcie/drv.c
++++ b/drivers/net/wireless/iwlwifi/pcie/drv.c
+@@ -273,6 +273,8 @@ static DEFINE_PCI_DEVICE_TABLE(iwl_hw_card_ids) = {
+ {IWL_PCI_DEVICE(0x08B1, 0x4070, iwl7260_2ac_cfg)},
+ {IWL_PCI_DEVICE(0x08B1, 0x4072, iwl7260_2ac_cfg)},
+ {IWL_PCI_DEVICE(0x08B1, 0x4170, iwl7260_2ac_cfg)},
++ {IWL_PCI_DEVICE(0x08B1, 0x4C60, iwl7260_2ac_cfg)},
++ {IWL_PCI_DEVICE(0x08B1, 0x4C70, iwl7260_2ac_cfg)},
+ {IWL_PCI_DEVICE(0x08B1, 0x4060, iwl7260_2n_cfg)},
+ {IWL_PCI_DEVICE(0x08B1, 0x406A, iwl7260_2n_cfg)},
+ {IWL_PCI_DEVICE(0x08B1, 0x4160, iwl7260_2n_cfg)},
+@@ -316,6 +318,8 @@ static DEFINE_PCI_DEVICE_TABLE(iwl_hw_card_ids) = {
+ {IWL_PCI_DEVICE(0x08B1, 0xC770, iwl7260_2ac_cfg)},
+ {IWL_PCI_DEVICE(0x08B1, 0xC760, iwl7260_2n_cfg)},
+ {IWL_PCI_DEVICE(0x08B2, 0xC270, iwl7260_2ac_cfg)},
++ {IWL_PCI_DEVICE(0x08B1, 0xCC70, iwl7260_2ac_cfg)},
++ {IWL_PCI_DEVICE(0x08B1, 0xCC60, iwl7260_2ac_cfg)},
+ {IWL_PCI_DEVICE(0x08B2, 0xC272, iwl7260_2ac_cfg)},
+ {IWL_PCI_DEVICE(0x08B2, 0xC260, iwl7260_2n_cfg)},
+ {IWL_PCI_DEVICE(0x08B2, 0xC26A, iwl7260_n_cfg)},
+diff --git a/drivers/net/wireless/rt2x00/rt2800.h b/drivers/net/wireless/rt2x00/rt2800.h
+index a394a9a95919..7cf6081a05a1 100644
+--- a/drivers/net/wireless/rt2x00/rt2800.h
++++ b/drivers/net/wireless/rt2x00/rt2800.h
+@@ -2039,7 +2039,7 @@ struct mac_iveiv_entry {
+ * 2 - drop tx power by 12dBm,
+ * 3 - increase tx power by 6dBm
+ */
+-#define BBP1_TX_POWER_CTRL FIELD8(0x07)
++#define BBP1_TX_POWER_CTRL FIELD8(0x03)
+ #define BBP1_TX_ANTENNA FIELD8(0x18)
+
+ /*
+diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c
+index ce23e0f076b6..db5abef6cec0 100644
+--- a/drivers/pci/host/pci-mvebu.c
++++ b/drivers/pci/host/pci-mvebu.c
+@@ -873,7 +873,7 @@ static int mvebu_get_tgt_attr(struct device_node *np, int devfn,
+ rangesz = pna + na + ns;
+ nranges = rlen / sizeof(__be32) / rangesz;
+
+- for (i = 0; i < nranges; i++) {
++ for (i = 0; i < nranges; i++, range += rangesz) {
+ u32 flags = of_read_number(range, 1);
+ u32 slot = of_read_number(range + 1, 1);
+ u64 cpuaddr = of_read_number(range + na, pna);
+@@ -883,14 +883,14 @@ static int mvebu_get_tgt_attr(struct device_node *np, int devfn,
+ rtype = IORESOURCE_IO;
+ else if (DT_FLAGS_TO_TYPE(flags) == DT_TYPE_MEM32)
+ rtype = IORESOURCE_MEM;
++ else
++ continue;
+
+ if (slot == PCI_SLOT(devfn) && type == rtype) {
+ *tgt = DT_CPUADDR_TO_TARGET(cpuaddr);
+ *attr = DT_CPUADDR_TO_ATTR(cpuaddr);
+ return 0;
+ }
+-
+- range += rangesz;
+ }
+
+ return -ENOENT;
+diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
+index 9ff0a901ecf7..76ef7914c9aa 100644
+--- a/drivers/pci/pci-sysfs.c
++++ b/drivers/pci/pci-sysfs.c
+@@ -177,7 +177,7 @@ static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
+ {
+ struct pci_dev *pci_dev = to_pci_dev(dev);
+
+- return sprintf(buf, "pci:v%08Xd%08Xsv%08Xsd%08Xbc%02Xsc%02Xi%02x\n",
++ return sprintf(buf, "pci:v%08Xd%08Xsv%08Xsd%08Xbc%02Xsc%02Xi%02X\n",
+ pci_dev->vendor, pci_dev->device,
+ pci_dev->subsystem_vendor, pci_dev->subsystem_device,
+ (u8)(pci_dev->class >> 16), (u8)(pci_dev->class >> 8),
+diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
+index d0f69269eb6c..cc09b14b8ac1 100644
+--- a/drivers/pci/quirks.c
++++ b/drivers/pci/quirks.c
+@@ -24,6 +24,7 @@
+ #include <linux/ioport.h>
+ #include <linux/sched.h>
+ #include <linux/ktime.h>
++#include <linux/mm.h>
+ #include <asm/dma.h> /* isa_dma_bridge_buggy */
+ #include "pci.h"
+
+@@ -287,6 +288,25 @@ static void quirk_citrine(struct pci_dev *dev)
+ }
+ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_IBM, PCI_DEVICE_ID_IBM_CITRINE, quirk_citrine);
+
++/* On IBM Crocodile ipr SAS adapters, expand BAR to system page size */
++static void quirk_extend_bar_to_page(struct pci_dev *dev)
++{
++ int i;
++
++ for (i = 0; i < PCI_STD_RESOURCE_END; i++) {
++ struct resource *r = &dev->resource[i];
++
++ if (r->flags & IORESOURCE_MEM && resource_size(r) < PAGE_SIZE) {
++ r->end = PAGE_SIZE - 1;
++ r->start = 0;
++ r->flags |= IORESOURCE_UNSET;
++ dev_info(&dev->dev, "expanded BAR %d to page size: %pR\n",
++ i, r);
++ }
++ }
++}
++DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_IBM, 0x034a, quirk_extend_bar_to_page);
++
+ /*
+ * S3 868 and 968 chips report region size equal to 32M, but they decode 64M.
+ * If it's needed, re-allocate the region.
+diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
+index a5a63ecfb628..a70b8715a315 100644
+--- a/drivers/pci/setup-bus.c
++++ b/drivers/pci/setup-bus.c
+@@ -1652,7 +1652,7 @@ void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
+ struct pci_dev_resource *fail_res;
+ int retval;
+ unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
+- IORESOURCE_PREFETCH;
++ IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
+
+ again:
+ __pci_bus_size_bridges(parent, &add_list);
+diff --git a/drivers/regulator/ltc3589.c b/drivers/regulator/ltc3589.c
+index c8105182b8b8..bef5842d0777 100644
+--- a/drivers/regulator/ltc3589.c
++++ b/drivers/regulator/ltc3589.c
+@@ -372,6 +372,7 @@ static bool ltc3589_volatile_reg(struct device *dev, unsigned int reg)
+ switch (reg) {
+ case LTC3589_IRQSTAT:
+ case LTC3589_PGSTAT:
++ case LTC3589_VCCR:
+ return true;
+ }
+ return false;
+diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
+index b0e4a3eb33c7..5b2e76159b41 100644
+--- a/drivers/rtc/rtc-cmos.c
++++ b/drivers/rtc/rtc-cmos.c
+@@ -856,7 +856,7 @@ static void __exit cmos_do_remove(struct device *dev)
+ cmos->dev = NULL;
+ }
+
+-#ifdef CONFIG_PM_SLEEP
++#ifdef CONFIG_PM
+
+ static int cmos_suspend(struct device *dev)
+ {
+@@ -907,6 +907,8 @@ static inline int cmos_poweroff(struct device *dev)
+ return cmos_suspend(dev);
+ }
+
++#ifdef CONFIG_PM_SLEEP
++
+ static int cmos_resume(struct device *dev)
+ {
+ struct cmos_rtc *cmos = dev_get_drvdata(dev);
+@@ -954,6 +956,7 @@ static int cmos_resume(struct device *dev)
+ return 0;
+ }
+
++#endif
+ #else
+
+ static inline int cmos_poweroff(struct device *dev)
+diff --git a/drivers/scsi/be2iscsi/be_mgmt.c b/drivers/scsi/be2iscsi/be_mgmt.c
+index 07934b0b9ee1..accceb57ddbc 100644
+--- a/drivers/scsi/be2iscsi/be_mgmt.c
++++ b/drivers/scsi/be2iscsi/be_mgmt.c
+@@ -944,17 +944,20 @@ mgmt_static_ip_modify(struct beiscsi_hba *phba,
+
+ if (ip_action == IP_ACTION_ADD) {
+ memcpy(req->ip_params.ip_record.ip_addr.addr, ip_param->value,
+- ip_param->len);
++ sizeof(req->ip_params.ip_record.ip_addr.addr));
+
+ if (subnet_param)
+ memcpy(req->ip_params.ip_record.ip_addr.subnet_mask,
+- subnet_param->value, subnet_param->len);
++ subnet_param->value,
++ sizeof(req->ip_params.ip_record.ip_addr.subnet_mask));
+ } else {
+ memcpy(req->ip_params.ip_record.ip_addr.addr,
+- if_info->ip_addr.addr, ip_param->len);
++ if_info->ip_addr.addr,
++ sizeof(req->ip_params.ip_record.ip_addr.addr));
+
+ memcpy(req->ip_params.ip_record.ip_addr.subnet_mask,
+- if_info->ip_addr.subnet_mask, ip_param->len);
++ if_info->ip_addr.subnet_mask,
++ sizeof(req->ip_params.ip_record.ip_addr.subnet_mask));
+ }
+
+ rc = mgmt_exec_nonemb_cmd(phba, &nonemb_cmd, NULL, 0);
+@@ -982,7 +985,7 @@ static int mgmt_modify_gateway(struct beiscsi_hba *phba, uint8_t *gt_addr,
+ req->action = gtway_action;
+ req->ip_addr.ip_type = BE2_IPV4;
+
+- memcpy(req->ip_addr.addr, gt_addr, param_len);
++ memcpy(req->ip_addr.addr, gt_addr, sizeof(req->ip_addr.addr));
+
+ return mgmt_exec_nonemb_cmd(phba, &nonemb_cmd, NULL, 0);
+ }
+diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
+index d96bfb55e57b..5072251cdb8b 100644
+--- a/drivers/scsi/qla2xxx/qla_os.c
++++ b/drivers/scsi/qla2xxx/qla_os.c
+@@ -3111,10 +3111,8 @@ qla2x00_unmap_iobases(struct qla_hw_data *ha)
+ }
+
+ static void
+-qla2x00_clear_drv_active(scsi_qla_host_t *vha)
++qla2x00_clear_drv_active(struct qla_hw_data *ha)
+ {
+- struct qla_hw_data *ha = vha->hw;
+-
+ if (IS_QLA8044(ha)) {
+ qla8044_idc_lock(ha);
+ qla8044_clear_drv_active(ha);
+@@ -3185,7 +3183,7 @@ qla2x00_remove_one(struct pci_dev *pdev)
+
+ scsi_host_put(base_vha->host);
+
+- qla2x00_clear_drv_active(base_vha);
++ qla2x00_clear_drv_active(ha);
+
+ qla2x00_unmap_iobases(ha);
+
+diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c
+index e632e14180cf..bcc449a0c3a7 100644
+--- a/drivers/scsi/qla2xxx/qla_target.c
++++ b/drivers/scsi/qla2xxx/qla_target.c
+@@ -1431,12 +1431,10 @@ static inline void qlt_unmap_sg(struct scsi_qla_host *vha,
+ static int qlt_check_reserve_free_req(struct scsi_qla_host *vha,
+ uint32_t req_cnt)
+ {
+- struct qla_hw_data *ha = vha->hw;
+- device_reg_t __iomem *reg = ha->iobase;
+ uint32_t cnt;
+
+ if (vha->req->cnt < (req_cnt + 2)) {
+- cnt = (uint16_t)RD_REG_DWORD(®->isp24.req_q_out);
++ cnt = (uint16_t)RD_REG_DWORD(vha->req->req_q_out);
+
+ ql_dbg(ql_dbg_tgt, vha, 0xe00a,
+ "Request ring circled: cnt=%d, vha->->ring_index=%d, "
+@@ -3277,6 +3275,7 @@ static int qlt_handle_cmd_for_atio(struct scsi_qla_host *vha,
+ return -ENOMEM;
+
+ memcpy(&op->atio, atio, sizeof(*atio));
++ op->vha = vha;
+ INIT_WORK(&op->work, qlt_create_sess_from_atio);
+ queue_work(qla_tgt_wq, &op->work);
+ return 0;
+diff --git a/drivers/spi/spi-dw-mid.c b/drivers/spi/spi-dw-mid.c
+index 6d207afec8cb..a4c45ea8f688 100644
+--- a/drivers/spi/spi-dw-mid.c
++++ b/drivers/spi/spi-dw-mid.c
+@@ -89,7 +89,13 @@ err_exit:
+
+ static void mid_spi_dma_exit(struct dw_spi *dws)
+ {
++ if (!dws->dma_inited)
++ return;
++
++ dmaengine_terminate_all(dws->txchan);
+ dma_release_channel(dws->txchan);
++
++ dmaengine_terminate_all(dws->rxchan);
+ dma_release_channel(dws->rxchan);
+ }
+
+@@ -136,7 +142,7 @@ static int mid_spi_dma_transfer(struct dw_spi *dws, int cs_change)
+ txconf.dst_addr = dws->dma_addr;
+ txconf.dst_maxburst = LNW_DMA_MSIZE_16;
+ txconf.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+- txconf.dst_addr_width = DMA_SLAVE_BUSWIDTH_2_BYTES;
++ txconf.dst_addr_width = dws->dma_width;
+ txconf.device_fc = false;
+
+ txchan->device->device_control(txchan, DMA_SLAVE_CONFIG,
+@@ -159,7 +165,7 @@ static int mid_spi_dma_transfer(struct dw_spi *dws, int cs_change)
+ rxconf.src_addr = dws->dma_addr;
+ rxconf.src_maxburst = LNW_DMA_MSIZE_16;
+ rxconf.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+- rxconf.src_addr_width = DMA_SLAVE_BUSWIDTH_2_BYTES;
++ rxconf.src_addr_width = dws->dma_width;
+ rxconf.device_fc = false;
+
+ rxchan->device->device_control(rxchan, DMA_SLAVE_CONFIG,
+diff --git a/drivers/tty/serial/omap-serial.c b/drivers/tty/serial/omap-serial.c
+index d017cec8a34a..e454b7c2ecd9 100644
+--- a/drivers/tty/serial/omap-serial.c
++++ b/drivers/tty/serial/omap-serial.c
+@@ -254,8 +254,16 @@ serial_omap_baud_is_mode16(struct uart_port *port, unsigned int baud)
+ {
+ unsigned int n13 = port->uartclk / (13 * baud);
+ unsigned int n16 = port->uartclk / (16 * baud);
+- int baudAbsDiff13 = baud - (port->uartclk / (13 * n13));
+- int baudAbsDiff16 = baud - (port->uartclk / (16 * n16));
++ int baudAbsDiff13;
++ int baudAbsDiff16;
++
++ if (n13 == 0)
++ n13 = 1;
++ if (n16 == 0)
++ n16 = 1;
++
++ baudAbsDiff13 = baud - (port->uartclk / (13 * n13));
++ baudAbsDiff16 = baud - (port->uartclk / (16 * n16));
+ if (baudAbsDiff13 < 0)
+ baudAbsDiff13 = -baudAbsDiff13;
+ if (baudAbsDiff16 < 0)
+diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig
+index ba18e9c110cc..77ad6a944129 100644
+--- a/drivers/usb/gadget/Kconfig
++++ b/drivers/usb/gadget/Kconfig
+@@ -438,7 +438,7 @@ config USB_GOKU
+ gadget drivers to also be dynamically linked.
+
+ config USB_EG20T
+- tristate "Intel EG20T PCH/LAPIS Semiconductor IOH(ML7213/ML7831) UDC"
++ tristate "Intel QUARK X1000/EG20T PCH/LAPIS Semiconductor IOH(ML7213/ML7831) UDC"
+ depends on PCI
+ help
+ This is a USB device driver for EG20T PCH.
+@@ -459,6 +459,7 @@ config USB_EG20T
+ ML7213/ML7831 is companion chip for Intel Atom E6xx series.
+ ML7213/ML7831 is completely compatible for Intel EG20T PCH.
+
++ This driver can be used with Intel's Quark X1000 SOC platform
+ #
+ # LAST -- dummy/emulated controller
+ #
+diff --git a/drivers/usb/gadget/pch_udc.c b/drivers/usb/gadget/pch_udc.c
+index eb8c3bedb57a..460d953c91b6 100644
+--- a/drivers/usb/gadget/pch_udc.c
++++ b/drivers/usb/gadget/pch_udc.c
+@@ -343,6 +343,7 @@ struct pch_vbus_gpio_data {
+ * @setup_data: Received setup data
+ * @phys_addr: of device memory
+ * @base_addr: for mapped device memory
++ * @bar: Indicates which PCI BAR for USB regs
+ * @irq: IRQ line for the device
+ * @cfg_data: current cfg, intf, and alt in use
+ * @vbus_gpio: GPIO informaton for detecting VBUS
+@@ -370,14 +371,17 @@ struct pch_udc_dev {
+ struct usb_ctrlrequest setup_data;
+ unsigned long phys_addr;
+ void __iomem *base_addr;
++ unsigned bar;
+ unsigned irq;
+ struct pch_udc_cfg_data cfg_data;
+ struct pch_vbus_gpio_data vbus_gpio;
+ };
+ #define to_pch_udc(g) (container_of((g), struct pch_udc_dev, gadget))
+
++#define PCH_UDC_PCI_BAR_QUARK_X1000 0
+ #define PCH_UDC_PCI_BAR 1
+ #define PCI_DEVICE_ID_INTEL_EG20T_UDC 0x8808
++#define PCI_DEVICE_ID_INTEL_QUARK_X1000_UDC 0x0939
+ #define PCI_VENDOR_ID_ROHM 0x10DB
+ #define PCI_DEVICE_ID_ML7213_IOH_UDC 0x801D
+ #define PCI_DEVICE_ID_ML7831_IOH_UDC 0x8808
+@@ -3076,7 +3080,7 @@ static void pch_udc_remove(struct pci_dev *pdev)
+ iounmap(dev->base_addr);
+ if (dev->mem_region)
+ release_mem_region(dev->phys_addr,
+- pci_resource_len(pdev, PCH_UDC_PCI_BAR));
++ pci_resource_len(pdev, dev->bar));
+ if (dev->active)
+ pci_disable_device(pdev);
+ kfree(dev);
+@@ -3144,9 +3148,15 @@ static int pch_udc_probe(struct pci_dev *pdev,
+ dev->active = 1;
+ pci_set_drvdata(pdev, dev);
+
++ /* Determine BAR based on PCI ID */
++ if (id->device == PCI_DEVICE_ID_INTEL_QUARK_X1000_UDC)
++ dev->bar = PCH_UDC_PCI_BAR_QUARK_X1000;
++ else
++ dev->bar = PCH_UDC_PCI_BAR;
++
+ /* PCI resource allocation */
+- resource = pci_resource_start(pdev, 1);
+- len = pci_resource_len(pdev, 1);
++ resource = pci_resource_start(pdev, dev->bar);
++ len = pci_resource_len(pdev, dev->bar);
+
+ if (!request_mem_region(resource, len, KBUILD_MODNAME)) {
+ dev_err(&pdev->dev, "%s: pci device used already\n", __func__);
+@@ -3212,6 +3222,12 @@ finished:
+
+ static const struct pci_device_id pch_udc_pcidev_id[] = {
+ {
++ PCI_DEVICE(PCI_VENDOR_ID_INTEL,
++ PCI_DEVICE_ID_INTEL_QUARK_X1000_UDC),
++ .class = (PCI_CLASS_SERIAL_USB << 8) | 0xfe,
++ .class_mask = 0xffffffff,
++ },
++ {
+ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_EG20T_UDC),
+ .class = (PCI_CLASS_SERIAL_USB << 8) | 0xfe,
+ .class_mask = 0xffffffff,
+diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
+index eea26e1b2fda..d738ff8ab81c 100644
+--- a/fs/btrfs/dev-replace.c
++++ b/fs/btrfs/dev-replace.c
+@@ -567,6 +567,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
+ btrfs_kobj_rm_device(fs_info, src_device);
+ btrfs_kobj_add_device(fs_info, tgt_device);
+
++ btrfs_dev_replace_unlock(dev_replace);
++
+ btrfs_rm_dev_replace_blocked(fs_info);
+
+ btrfs_rm_dev_replace_srcdev(fs_info, src_device);
+@@ -580,7 +582,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
+ * superblock is scratched out so that it is no longer marked to
+ * belong to this filesystem.
+ */
+- btrfs_dev_replace_unlock(dev_replace);
+ mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+ mutex_unlock(&root->fs_info->chunk_mutex);
+
+diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
+index 8edb9fcc38d5..feff017a47d9 100644
+--- a/fs/btrfs/extent-tree.c
++++ b/fs/btrfs/extent-tree.c
+@@ -4508,7 +4508,13 @@ again:
+ space_info->flush = 1;
+ } else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
+ used += orig_bytes;
+- if (need_do_async_reclaim(space_info, root->fs_info, used) &&
++ /*
++ * We will do the space reservation dance during log replay,
++ * which means we won't have fs_info->fs_root set, so don't do
++ * the async reclaim as we will panic.
++ */
++ if (!root->fs_info->log_root_recovering &&
++ need_do_async_reclaim(space_info, root->fs_info, used) &&
+ !work_busy(&root->fs_info->async_reclaim_work))
+ queue_work(system_unbound_wq,
+ &root->fs_info->async_reclaim_work);
+diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
+index ab1fd668020d..2a15294f1683 100644
+--- a/fs/btrfs/file.c
++++ b/fs/btrfs/file.c
+@@ -2622,23 +2622,28 @@ static int find_desired_extent(struct inode *inode, loff_t *offset, int whence)
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct extent_map *em = NULL;
+ struct extent_state *cached_state = NULL;
+- u64 lockstart = *offset;
+- u64 lockend = i_size_read(inode);
+- u64 start = *offset;
+- u64 len = i_size_read(inode);
++ u64 lockstart;
++ u64 lockend;
++ u64 start;
++ u64 len;
+ int ret = 0;
+
+- lockend = max_t(u64, root->sectorsize, lockend);
++ if (inode->i_size == 0)
++ return -ENXIO;
++
++ /*
++ * *offset can be negative, in this case we start finding DATA/HOLE from
++ * the very start of the file.
++ */
++ start = max_t(loff_t, 0, *offset);
++
++ lockstart = round_down(start, root->sectorsize);
++ lockend = round_up(i_size_read(inode), root->sectorsize);
+ if (lockend <= lockstart)
+ lockend = lockstart + root->sectorsize;
+-
+ lockend--;
+ len = lockend - lockstart + 1;
+
+- len = max_t(u64, len, root->sectorsize);
+- if (inode->i_size == 0)
+- return -ENXIO;
+-
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0,
+ &cached_state);
+
+diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
+index c6cd34e699d0..a3a8dee8030f 100644
+--- a/fs/btrfs/inode.c
++++ b/fs/btrfs/inode.c
+@@ -3656,7 +3656,8 @@ noinline int btrfs_update_inode(struct btrfs_trans_handle *trans,
+ * without delay
+ */
+ if (!btrfs_is_free_space_inode(inode)
+- && root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
++ && root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
++ && !root->fs_info->log_root_recovering) {
+ btrfs_update_root_times(trans, root);
+
+ ret = btrfs_delayed_update_inode(trans, root, inode);
+diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
+index 47aceb494d1d..4e395f3f251d 100644
+--- a/fs/btrfs/ioctl.c
++++ b/fs/btrfs/ioctl.c
+@@ -332,6 +332,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
+ goto out_drop;
+
+ } else {
++ ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0);
++ if (ret && ret != -ENODATA)
++ goto out_drop;
+ ip->flags &= ~(BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS);
+ }
+
+@@ -5309,6 +5312,12 @@ long btrfs_ioctl(struct file *file, unsigned int
+ if (ret)
+ return ret;
+ ret = btrfs_sync_fs(file->f_dentry->d_sb, 1);
++ /*
++ * The transaction thread may want to do more work,
++ * namely it pokes the cleaner ktread that will start
++ * processing uncleaned subvols.
++ */
++ wake_up_process(root->fs_info->transaction_kthread);
+ return ret;
+ }
+ case BTRFS_IOC_START_SYNC:
+diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
+index 65245a07275b..56fe6ec409ac 100644
+--- a/fs/btrfs/relocation.c
++++ b/fs/btrfs/relocation.c
+@@ -736,7 +736,8 @@ again:
+ err = ret;
+ goto out;
+ }
+- BUG_ON(!ret || !path1->slots[0]);
++ ASSERT(ret);
++ ASSERT(path1->slots[0]);
+
+ path1->slots[0]--;
+
+@@ -746,10 +747,10 @@ again:
+ * the backref was added previously when processing
+ * backref of type BTRFS_TREE_BLOCK_REF_KEY
+ */
+- BUG_ON(!list_is_singular(&cur->upper));
++ ASSERT(list_is_singular(&cur->upper));
+ edge = list_entry(cur->upper.next, struct backref_edge,
+ list[LOWER]);
+- BUG_ON(!list_empty(&edge->list[UPPER]));
++ ASSERT(list_empty(&edge->list[UPPER]));
+ exist = edge->node[UPPER];
+ /*
+ * add the upper level block to pending list if we need
+@@ -831,7 +832,7 @@ again:
+ cur->cowonly = 1;
+ }
+ #else
+- BUG_ON(key.type == BTRFS_EXTENT_REF_V0_KEY);
++ ASSERT(key.type != BTRFS_EXTENT_REF_V0_KEY);
+ if (key.type == BTRFS_SHARED_BLOCK_REF_KEY) {
+ #endif
+ if (key.objectid == key.offset) {
+@@ -840,7 +841,7 @@ again:
+ * backref of this type.
+ */
+ root = find_reloc_root(rc, cur->bytenr);
+- BUG_ON(!root);
++ ASSERT(root);
+ cur->root = root;
+ break;
+ }
+@@ -868,7 +869,7 @@ again:
+ } else {
+ upper = rb_entry(rb_node, struct backref_node,
+ rb_node);
+- BUG_ON(!upper->checked);
++ ASSERT(upper->checked);
+ INIT_LIST_HEAD(&edge->list[UPPER]);
+ }
+ list_add_tail(&edge->list[LOWER], &cur->upper);
+@@ -892,7 +893,7 @@ again:
+
+ if (btrfs_root_level(&root->root_item) == cur->level) {
+ /* tree root */
+- BUG_ON(btrfs_root_bytenr(&root->root_item) !=
++ ASSERT(btrfs_root_bytenr(&root->root_item) ==
+ cur->bytenr);
+ if (should_ignore_root(root))
+ list_add(&cur->list, &useless);
+@@ -927,7 +928,7 @@ again:
+ need_check = true;
+ for (; level < BTRFS_MAX_LEVEL; level++) {
+ if (!path2->nodes[level]) {
+- BUG_ON(btrfs_root_bytenr(&root->root_item) !=
++ ASSERT(btrfs_root_bytenr(&root->root_item) ==
+ lower->bytenr);
+ if (should_ignore_root(root))
+ list_add(&lower->list, &useless);
+@@ -977,12 +978,15 @@ again:
+ need_check = false;
+ list_add_tail(&edge->list[UPPER],
+ &list);
+- } else
++ } else {
++ if (upper->checked)
++ need_check = true;
+ INIT_LIST_HEAD(&edge->list[UPPER]);
++ }
+ } else {
+ upper = rb_entry(rb_node, struct backref_node,
+ rb_node);
+- BUG_ON(!upper->checked);
++ ASSERT(upper->checked);
+ INIT_LIST_HEAD(&edge->list[UPPER]);
+ if (!upper->owner)
+ upper->owner = btrfs_header_owner(eb);
+@@ -1026,7 +1030,7 @@ next:
+ * everything goes well, connect backref nodes and insert backref nodes
+ * into the cache.
+ */
+- BUG_ON(!node->checked);
++ ASSERT(node->checked);
+ cowonly = node->cowonly;
+ if (!cowonly) {
+ rb_node = tree_insert(&cache->rb_root, node->bytenr,
+@@ -1062,8 +1066,21 @@ next:
+ continue;
+ }
+
+- BUG_ON(!upper->checked);
+- BUG_ON(cowonly != upper->cowonly);
++ if (!upper->checked) {
++ /*
++ * Still want to blow up for developers since this is a
++ * logic bug.
++ */
++ ASSERT(0);
++ err = -EINVAL;
++ goto out;
++ }
++ if (cowonly != upper->cowonly) {
++ ASSERT(0);
++ err = -EINVAL;
++ goto out;
++ }
++
+ if (!cowonly) {
+ rb_node = tree_insert(&cache->rb_root, upper->bytenr,
+ &upper->rb_node);
+@@ -1086,7 +1103,7 @@ next:
+ while (!list_empty(&useless)) {
+ upper = list_entry(useless.next, struct backref_node, list);
+ list_del_init(&upper->list);
+- BUG_ON(!list_empty(&upper->upper));
++ ASSERT(list_empty(&upper->upper));
+ if (upper == node)
+ node = NULL;
+ if (upper->lowest) {
+@@ -1119,29 +1136,45 @@ out:
+ if (err) {
+ while (!list_empty(&useless)) {
+ lower = list_entry(useless.next,
+- struct backref_node, upper);
+- list_del_init(&lower->upper);
++ struct backref_node, list);
++ list_del_init(&lower->list);
+ }
+- upper = node;
+- INIT_LIST_HEAD(&list);
+- while (upper) {
+- if (RB_EMPTY_NODE(&upper->rb_node)) {
+- list_splice_tail(&upper->upper, &list);
+- free_backref_node(cache, upper);
+- }
+-
+- if (list_empty(&list))
+- break;
+-
+- edge = list_entry(list.next, struct backref_edge,
+- list[LOWER]);
++ while (!list_empty(&list)) {
++ edge = list_first_entry(&list, struct backref_edge,
++ list[UPPER]);
++ list_del(&edge->list[UPPER]);
+ list_del(&edge->list[LOWER]);
++ lower = edge->node[LOWER];
+ upper = edge->node[UPPER];
+ free_backref_edge(cache, edge);
++
++ /*
++ * Lower is no longer linked to any upper backref nodes
++ * and isn't in the cache, we can free it ourselves.
++ */
++ if (list_empty(&lower->upper) &&
++ RB_EMPTY_NODE(&lower->rb_node))
++ list_add(&lower->list, &useless);
++
++ if (!RB_EMPTY_NODE(&upper->rb_node))
++ continue;
++
++ /* Add this guy's upper edges to the list to proces */
++ list_for_each_entry(edge, &upper->upper, list[LOWER])
++ list_add_tail(&edge->list[UPPER], &list);
++ if (list_empty(&upper->upper))
++ list_add(&upper->list, &useless);
++ }
++
++ while (!list_empty(&useless)) {
++ lower = list_entry(useless.next,
++ struct backref_node, list);
++ list_del_init(&lower->list);
++ free_backref_node(cache, lower);
+ }
+ return ERR_PTR(err);
+ }
+- BUG_ON(node && node->detached);
++ ASSERT(!node || !node->detached);
+ return node;
+ }
+
+diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
+index d89c6d3542ca..98a25df1c430 100644
+--- a/fs/btrfs/transaction.c
++++ b/fs/btrfs/transaction.c
+@@ -609,7 +609,6 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
+ if (transid <= root->fs_info->last_trans_committed)
+ goto out;
+
+- ret = -EINVAL;
+ /* find specified transaction */
+ spin_lock(&root->fs_info->trans_lock);
+ list_for_each_entry(t, &root->fs_info->trans_list, list) {
+@@ -625,9 +624,16 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
+ }
+ }
+ spin_unlock(&root->fs_info->trans_lock);
+- /* The specified transaction doesn't exist */
+- if (!cur_trans)
++
++ /*
++ * The specified transaction doesn't exist, or we
++ * raced with btrfs_commit_transaction
++ */
++ if (!cur_trans) {
++ if (transid > root->fs_info->last_trans_committed)
++ ret = -EINVAL;
+ goto out;
++ }
+ } else {
+ /* find newest transaction that is committing | committed */
+ spin_lock(&root->fs_info->trans_lock);
+diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
+index d4a9431ec73c..57ee4c53b4f8 100644
+--- a/fs/ecryptfs/inode.c
++++ b/fs/ecryptfs/inode.c
+@@ -1039,7 +1039,7 @@ ecryptfs_setxattr(struct dentry *dentry, const char *name, const void *value,
+ }
+
+ rc = vfs_setxattr(lower_dentry, name, value, size, flags);
+- if (!rc)
++ if (!rc && dentry->d_inode)
+ fsstack_copy_attr_all(dentry->d_inode, lower_dentry->d_inode);
+ out:
+ return rc;
+diff --git a/fs/namespace.c b/fs/namespace.c
+index 140d17705683..e544a0680a7c 100644
+--- a/fs/namespace.c
++++ b/fs/namespace.c
+@@ -1374,6 +1374,8 @@ static int do_umount(struct mount *mnt, int flags)
+ * Special case for "unmounting" root ...
+ * we just try to remount it readonly.
+ */
++ if (!capable(CAP_SYS_ADMIN))
++ return -EPERM;
+ down_write(&sb->s_umount);
+ if (!(sb->s_flags & MS_RDONLY))
+ retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
+diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
+index 3275e94538e7..43fd8c557fe9 100644
+--- a/fs/nfs/nfs4proc.c
++++ b/fs/nfs/nfs4proc.c
+@@ -7242,7 +7242,7 @@ static int nfs41_proc_async_sequence(struct nfs_client *clp, struct rpc_cred *cr
+ int ret = 0;
+
+ if ((renew_flags & NFS4_RENEW_TIMEOUT) == 0)
+- return 0;
++ return -EAGAIN;
+ task = _nfs41_proc_sequence(clp, cred, false);
+ if (IS_ERR(task))
+ ret = PTR_ERR(task);
+diff --git a/fs/nfs/nfs4renewd.c b/fs/nfs/nfs4renewd.c
+index 1720d32ffa54..e1ba58c3d1ad 100644
+--- a/fs/nfs/nfs4renewd.c
++++ b/fs/nfs/nfs4renewd.c
+@@ -88,10 +88,18 @@ nfs4_renew_state(struct work_struct *work)
+ }
+ nfs_expire_all_delegations(clp);
+ } else {
++ int ret;
++
+ /* Queue an asynchronous RENEW. */
+- ops->sched_state_renewal(clp, cred, renew_flags);
++ ret = ops->sched_state_renewal(clp, cred, renew_flags);
+ put_rpccred(cred);
+- goto out_exp;
++ switch (ret) {
++ default:
++ goto out_exp;
++ case -EAGAIN:
++ case -ENOMEM:
++ break;
++ }
+ }
+ } else {
+ dprintk("%s: failed to call renewd. Reason: lease not expired \n",
+diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
+index 848f6853c59e..db7792c30462 100644
+--- a/fs/nfs/nfs4state.c
++++ b/fs/nfs/nfs4state.c
+@@ -1732,7 +1732,8 @@ restart:
+ if (status < 0) {
+ set_bit(ops->owner_flag_bit, &sp->so_flags);
+ nfs4_put_state_owner(sp);
+- return nfs4_recovery_handle_error(clp, status);
++ status = nfs4_recovery_handle_error(clp, status);
++ return (status != 0) ? status : -EAGAIN;
+ }
+
+ nfs4_put_state_owner(sp);
+@@ -1741,7 +1742,7 @@ restart:
+ spin_unlock(&clp->cl_lock);
+ }
+ rcu_read_unlock();
+- return status;
++ return 0;
+ }
+
+ static int nfs4_check_lease(struct nfs_client *clp)
+@@ -1788,7 +1789,6 @@ static int nfs4_handle_reclaim_lease_error(struct nfs_client *clp, int status)
+ break;
+ case -NFS4ERR_STALE_CLIENTID:
+ clear_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);
+- nfs4_state_clear_reclaim_reboot(clp);
+ nfs4_state_start_reclaim_reboot(clp);
+ break;
+ case -NFS4ERR_CLID_INUSE:
+@@ -2372,6 +2372,7 @@ static void nfs4_state_manager(struct nfs_client *clp)
+ status = nfs4_check_lease(clp);
+ if (status < 0)
+ goto out_error;
++ continue;
+ }
+
+ if (test_and_clear_bit(NFS4CLNT_MOVED, &clp->cl_state)) {
+@@ -2393,14 +2394,11 @@ static void nfs4_state_manager(struct nfs_client *clp)
+ section = "reclaim reboot";
+ status = nfs4_do_reclaim(clp,
+ clp->cl_mvops->reboot_recovery_ops);
+- if (test_bit(NFS4CLNT_LEASE_EXPIRED, &clp->cl_state) ||
+- test_bit(NFS4CLNT_SESSION_RESET, &clp->cl_state))
+- continue;
+- nfs4_state_end_reclaim_reboot(clp);
+- if (test_bit(NFS4CLNT_RECLAIM_NOGRACE, &clp->cl_state))
++ if (status == -EAGAIN)
+ continue;
+ if (status < 0)
+ goto out_error;
++ nfs4_state_end_reclaim_reboot(clp);
+ }
+
+ /* Now recover expired state... */
+@@ -2408,9 +2406,7 @@ static void nfs4_state_manager(struct nfs_client *clp)
+ section = "reclaim nograce";
+ status = nfs4_do_reclaim(clp,
+ clp->cl_mvops->nograce_recovery_ops);
+- if (test_bit(NFS4CLNT_LEASE_EXPIRED, &clp->cl_state) ||
+- test_bit(NFS4CLNT_SESSION_RESET, &clp->cl_state) ||
+- test_bit(NFS4CLNT_RECLAIM_REBOOT, &clp->cl_state))
++ if (status == -EAGAIN)
+ continue;
+ if (status < 0)
+ goto out_error;
+diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
+index 34136ff5abf0..3a9c34a0f898 100644
+--- a/fs/nfs/pagelist.c
++++ b/fs/nfs/pagelist.c
+@@ -527,7 +527,8 @@ EXPORT_SYMBOL_GPL(nfs_pgio_header_free);
+ */
+ void nfs_pgio_data_destroy(struct nfs_pgio_header *hdr)
+ {
+- put_nfs_open_context(hdr->args.context);
++ if (hdr->args.context)
++ put_nfs_open_context(hdr->args.context);
+ if (hdr->page_array.pagevec != hdr->page_array.page_array)
+ kfree(hdr->page_array.pagevec);
+ }
+@@ -753,12 +754,11 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
+ nfs_list_remove_request(req);
+ nfs_list_add_request(req, &hdr->pages);
+
+- if (WARN_ON_ONCE(pageused >= pagecount))
+- return nfs_pgio_error(desc, hdr);
+-
+ if (!last_page || last_page != req->wb_page) {
+- *pages++ = last_page = req->wb_page;
+ pageused++;
++ if (pageused > pagecount)
++ break;
++ *pages++ = last_page = req->wb_page;
+ }
+ }
+ if (WARN_ON_ONCE(pageused != pagecount))
+diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
+index 1d5103dfc203..96338175a2fe 100644
+--- a/fs/nfsd/nfs4xdr.c
++++ b/fs/nfsd/nfs4xdr.c
+@@ -1675,6 +1675,14 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
+ readbytes += nfsd4_max_reply(argp->rqstp, op);
+ } else
+ max_reply += nfsd4_max_reply(argp->rqstp, op);
++ /*
++ * OP_LOCK may return a conflicting lock. (Special case
++ * because it will just skip encoding this if it runs
++ * out of xdr buffer space, and it is the only operation
++ * that behaves this way.)
++ */
++ if (op->opnum == OP_LOCK)
++ max_reply += NFS4_OPAQUE_LIMIT;
+
+ if (op->status) {
+ argp->opcnt = i+1;
+diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
+index 2685bc9ea2c9..ec50a8385b13 100644
+--- a/fs/notify/fanotify/fanotify_user.c
++++ b/fs/notify/fanotify/fanotify_user.c
+@@ -78,7 +78,7 @@ static int create_fd(struct fsnotify_group *group,
+
+ pr_debug("%s: group=%p event=%p\n", __func__, group, event);
+
+- client_fd = get_unused_fd();
++ client_fd = get_unused_fd_flags(group->fanotify_data.f_flags);
+ if (client_fd < 0)
+ return client_fd;
+
+diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
+index 02614349690d..4ff074bc2a7d 100644
+--- a/fs/xfs/xfs_aops.c
++++ b/fs/xfs/xfs_aops.c
+@@ -434,10 +434,22 @@ xfs_start_page_writeback(
+ {
+ ASSERT(PageLocked(page));
+ ASSERT(!PageWriteback(page));
+- if (clear_dirty)
++
++ /*
++ * if the page was not fully cleaned, we need to ensure that the higher
++ * layers come back to it correctly. That means we need to keep the page
++ * dirty, and for WB_SYNC_ALL writeback we need to ensure the
++ * PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
++ * write this page in this writeback sweep will be made.
++ */
++ if (clear_dirty) {
+ clear_page_dirty_for_io(page);
+- set_page_writeback(page);
++ set_page_writeback(page);
++ } else
++ set_page_writeback_keepwrite(page);
++
+ unlock_page(page);
++
+ /* If no buffers on the page are to be written, finish it here */
+ if (!buffers)
+ end_page_writeback(page);
+diff --git a/include/linux/compiler-gcc5.h b/include/linux/compiler-gcc5.h
+new file mode 100644
+index 000000000000..cdd1cc202d51
+--- /dev/null
++++ b/include/linux/compiler-gcc5.h
+@@ -0,0 +1,66 @@
++#ifndef __LINUX_COMPILER_H
++#error "Please don't include <linux/compiler-gcc5.h> directly, include <linux/compiler.h> instead."
++#endif
++
++#define __used __attribute__((__used__))
++#define __must_check __attribute__((warn_unused_result))
++#define __compiler_offsetof(a, b) __builtin_offsetof(a, b)
++
++/* Mark functions as cold. gcc will assume any path leading to a call
++ to them will be unlikely. This means a lot of manual unlikely()s
++ are unnecessary now for any paths leading to the usual suspects
++ like BUG(), printk(), panic() etc. [but let's keep them for now for
++ older compilers]
++
++ Early snapshots of gcc 4.3 don't support this and we can't detect this
++ in the preprocessor, but we can live with this because they're unreleased.
++ Maketime probing would be overkill here.
++
++ gcc also has a __attribute__((__hot__)) to move hot functions into
++ a special section, but I don't see any sense in this right now in
++ the kernel context */
++#define __cold __attribute__((__cold__))
++
++#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
++
++#ifndef __CHECKER__
++# define __compiletime_warning(message) __attribute__((warning(message)))
++# define __compiletime_error(message) __attribute__((error(message)))
++#endif /* __CHECKER__ */
++
++/*
++ * Mark a position in code as unreachable. This can be used to
++ * suppress control flow warnings after asm blocks that transfer
++ * control elsewhere.
++ *
++ * Early snapshots of gcc 4.5 don't support this and we can't detect
++ * this in the preprocessor, but we can live with this because they're
++ * unreleased. Really, we need to have autoconf for the kernel.
++ */
++#define unreachable() __builtin_unreachable()
++
++/* Mark a function definition as prohibited from being cloned. */
++#define __noclone __attribute__((__noclone__))
++
++/*
++ * Tell the optimizer that something else uses this function or variable.
++ */
++#define __visible __attribute__((externally_visible))
++
++/*
++ * GCC 'asm goto' miscompiles certain code sequences:
++ *
++ * http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670
++ *
++ * Work it around via a compiler barrier quirk suggested by Jakub Jelinek.
++ * Fixed in GCC 4.8.2 and later versions.
++ *
++ * (asm goto is automatically volatile - the naming reflects this.)
++ */
++#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
++
++#ifdef CONFIG_ARCH_USE_BUILTIN_BSWAP
++#define __HAVE_BUILTIN_BSWAP32__
++#define __HAVE_BUILTIN_BSWAP64__
++#define __HAVE_BUILTIN_BSWAP16__
++#endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */
+diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
+index 7fa31731c854..83a76633c03e 100644
+--- a/include/linux/pci_ids.h
++++ b/include/linux/pci_ids.h
+@@ -2555,6 +2555,7 @@
+ #define PCI_DEVICE_ID_INTEL_MFD_EMMC0 0x0823
+ #define PCI_DEVICE_ID_INTEL_MFD_EMMC1 0x0824
+ #define PCI_DEVICE_ID_INTEL_MRST_SD2 0x084F
++#define PCI_DEVICE_ID_INTEL_QUARK_X1000_ILB 0x095E
+ #define PCI_DEVICE_ID_INTEL_I960 0x0960
+ #define PCI_DEVICE_ID_INTEL_I960RM 0x0962
+ #define PCI_DEVICE_ID_INTEL_CENTERTON_ILB 0x0c60
+diff --git a/include/linux/sched.h b/include/linux/sched.h
+index 0376b054a0d0..c5cc872b351d 100644
+--- a/include/linux/sched.h
++++ b/include/linux/sched.h
+@@ -1947,11 +1947,13 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
+ #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
+ #define used_math() tsk_used_math(current)
+
+-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags */
++/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
++ * __GFP_FS is also cleared as it implies __GFP_IO.
++ */
+ static inline gfp_t memalloc_noio_flags(gfp_t flags)
+ {
+ if (unlikely(current->flags & PF_MEMALLOC_NOIO))
+- flags &= ~__GFP_IO;
++ flags &= ~(__GFP_IO | __GFP_FS);
+ return flags;
+ }
+
+diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
+index 78e4a86030dd..0a8e6badb29b 100644
+--- a/include/uapi/linux/hyperv.h
++++ b/include/uapi/linux/hyperv.h
+@@ -137,7 +137,7 @@ struct hv_do_fcopy {
+ __u64 offset;
+ __u32 size;
+ __u8 data[DATA_FRAGMENT];
+-};
++} __attribute__((packed));
+
+ /*
+ * An implementation of HyperV key value pair (KVP) functionality for Linux.
+diff --git a/kernel/futex.c b/kernel/futex.c
+index c20fb395a672..c5909b46af98 100644
+--- a/kernel/futex.c
++++ b/kernel/futex.c
+@@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
+ case FUT_OFF_MMSHARED:
+ futex_get_mm(key); /* implies MB (B) */
+ break;
++ default:
++ smp_mb(); /* explicit MB (B) */
+ }
+ }
+
+diff --git a/lib/lzo/lzo1x_decompress_safe.c b/lib/lzo/lzo1x_decompress_safe.c
+index 8563081e8da3..a1c387f6afba 100644
+--- a/lib/lzo/lzo1x_decompress_safe.c
++++ b/lib/lzo/lzo1x_decompress_safe.c
+@@ -19,31 +19,21 @@
+ #include <linux/lzo.h>
+ #include "lzodefs.h"
+
+-#define HAVE_IP(t, x) \
+- (((size_t)(ip_end - ip) >= (size_t)(t + x)) && \
+- (((t + x) >= t) && ((t + x) >= x)))
++#define HAVE_IP(x) ((size_t)(ip_end - ip) >= (size_t)(x))
++#define HAVE_OP(x) ((size_t)(op_end - op) >= (size_t)(x))
++#define NEED_IP(x) if (!HAVE_IP(x)) goto input_overrun
++#define NEED_OP(x) if (!HAVE_OP(x)) goto output_overrun
++#define TEST_LB(m_pos) if ((m_pos) < out) goto lookbehind_overrun
+
+-#define HAVE_OP(t, x) \
+- (((size_t)(op_end - op) >= (size_t)(t + x)) && \
+- (((t + x) >= t) && ((t + x) >= x)))
+-
+-#define NEED_IP(t, x) \
+- do { \
+- if (!HAVE_IP(t, x)) \
+- goto input_overrun; \
+- } while (0)
+-
+-#define NEED_OP(t, x) \
+- do { \
+- if (!HAVE_OP(t, x)) \
+- goto output_overrun; \
+- } while (0)
+-
+-#define TEST_LB(m_pos) \
+- do { \
+- if ((m_pos) < out) \
+- goto lookbehind_overrun; \
+- } while (0)
++/* This MAX_255_COUNT is the maximum number of times we can add 255 to a base
++ * count without overflowing an integer. The multiply will overflow when
++ * multiplying 255 by more than MAXINT/255. The sum will overflow earlier
++ * depending on the base count. Since the base count is taken from a u8
++ * and a few bits, it is safe to assume that it will always be lower than
++ * or equal to 2*255, thus we can always prevent any overflow by accepting
++ * two less 255 steps. See Documentation/lzo.txt for more information.
++ */
++#define MAX_255_COUNT ((((size_t)~0) / 255) - 2)
+
+ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
+ unsigned char *out, size_t *out_len)
+@@ -75,17 +65,24 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
+ if (t < 16) {
+ if (likely(state == 0)) {
+ if (unlikely(t == 0)) {
++ size_t offset;
++ const unsigned char *ip_last = ip;
++
+ while (unlikely(*ip == 0)) {
+- t += 255;
+ ip++;
+- NEED_IP(1, 0);
++ NEED_IP(1);
+ }
+- t += 15 + *ip++;
++ offset = ip - ip_last;
++ if (unlikely(offset > MAX_255_COUNT))
++ return LZO_E_ERROR;
++
++ offset = (offset << 8) - offset;
++ t += offset + 15 + *ip++;
+ }
+ t += 3;
+ copy_literal_run:
+ #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+- if (likely(HAVE_IP(t, 15) && HAVE_OP(t, 15))) {
++ if (likely(HAVE_IP(t + 15) && HAVE_OP(t + 15))) {
+ const unsigned char *ie = ip + t;
+ unsigned char *oe = op + t;
+ do {
+@@ -101,8 +98,8 @@ copy_literal_run:
+ } else
+ #endif
+ {
+- NEED_OP(t, 0);
+- NEED_IP(t, 3);
++ NEED_OP(t);
++ NEED_IP(t + 3);
+ do {
+ *op++ = *ip++;
+ } while (--t > 0);
+@@ -115,7 +112,7 @@ copy_literal_run:
+ m_pos -= t >> 2;
+ m_pos -= *ip++ << 2;
+ TEST_LB(m_pos);
+- NEED_OP(2, 0);
++ NEED_OP(2);
+ op[0] = m_pos[0];
+ op[1] = m_pos[1];
+ op += 2;
+@@ -136,13 +133,20 @@ copy_literal_run:
+ } else if (t >= 32) {
+ t = (t & 31) + (3 - 1);
+ if (unlikely(t == 2)) {
++ size_t offset;
++ const unsigned char *ip_last = ip;
++
+ while (unlikely(*ip == 0)) {
+- t += 255;
+ ip++;
+- NEED_IP(1, 0);
++ NEED_IP(1);
+ }
+- t += 31 + *ip++;
+- NEED_IP(2, 0);
++ offset = ip - ip_last;
++ if (unlikely(offset > MAX_255_COUNT))
++ return LZO_E_ERROR;
++
++ offset = (offset << 8) - offset;
++ t += offset + 31 + *ip++;
++ NEED_IP(2);
+ }
+ m_pos = op - 1;
+ next = get_unaligned_le16(ip);
+@@ -154,13 +158,20 @@ copy_literal_run:
+ m_pos -= (t & 8) << 11;
+ t = (t & 7) + (3 - 1);
+ if (unlikely(t == 2)) {
++ size_t offset;
++ const unsigned char *ip_last = ip;
++
+ while (unlikely(*ip == 0)) {
+- t += 255;
+ ip++;
+- NEED_IP(1, 0);
++ NEED_IP(1);
+ }
+- t += 7 + *ip++;
+- NEED_IP(2, 0);
++ offset = ip - ip_last;
++ if (unlikely(offset > MAX_255_COUNT))
++ return LZO_E_ERROR;
++
++ offset = (offset << 8) - offset;
++ t += offset + 7 + *ip++;
++ NEED_IP(2);
+ }
+ next = get_unaligned_le16(ip);
+ ip += 2;
+@@ -174,7 +185,7 @@ copy_literal_run:
+ #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+ if (op - m_pos >= 8) {
+ unsigned char *oe = op + t;
+- if (likely(HAVE_OP(t, 15))) {
++ if (likely(HAVE_OP(t + 15))) {
+ do {
+ COPY8(op, m_pos);
+ op += 8;
+@@ -184,7 +195,7 @@ copy_literal_run:
+ m_pos += 8;
+ } while (op < oe);
+ op = oe;
+- if (HAVE_IP(6, 0)) {
++ if (HAVE_IP(6)) {
+ state = next;
+ COPY4(op, ip);
+ op += next;
+@@ -192,7 +203,7 @@ copy_literal_run:
+ continue;
+ }
+ } else {
+- NEED_OP(t, 0);
++ NEED_OP(t);
+ do {
+ *op++ = *m_pos++;
+ } while (op < oe);
+@@ -201,7 +212,7 @@ copy_literal_run:
+ #endif
+ {
+ unsigned char *oe = op + t;
+- NEED_OP(t, 0);
++ NEED_OP(t);
+ op[0] = m_pos[0];
+ op[1] = m_pos[1];
+ op += 2;
+@@ -214,15 +225,15 @@ match_next:
+ state = next;
+ t = next;
+ #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+- if (likely(HAVE_IP(6, 0) && HAVE_OP(4, 0))) {
++ if (likely(HAVE_IP(6) && HAVE_OP(4))) {
+ COPY4(op, ip);
+ op += t;
+ ip += t;
+ } else
+ #endif
+ {
+- NEED_IP(t, 3);
+- NEED_OP(t, 0);
++ NEED_IP(t + 3);
++ NEED_OP(t);
+ while (t > 0) {
+ *op++ = *ip++;
+ t--;
+diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c
+index 323f23cd2c37..84c0a21c1cda 100644
+--- a/net/bluetooth/l2cap_core.c
++++ b/net/bluetooth/l2cap_core.c
+@@ -2400,12 +2400,8 @@ static int l2cap_segment_le_sdu(struct l2cap_chan *chan,
+
+ BT_DBG("chan %p, msg %p, len %zu", chan, msg, len);
+
+- pdu_len = chan->conn->mtu - L2CAP_HDR_SIZE;
+-
+- pdu_len = min_t(size_t, pdu_len, chan->remote_mps);
+-
+ sdu_len = len;
+- pdu_len -= L2CAP_SDULEN_SIZE;
++ pdu_len = chan->remote_mps - L2CAP_SDULEN_SIZE;
+
+ while (len > 0) {
+ if (len <= pdu_len)
+diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
+index e33a982161c1..7b7f3de79db9 100644
+--- a/net/bluetooth/smp.c
++++ b/net/bluetooth/smp.c
+@@ -432,8 +432,11 @@ static int tk_request(struct l2cap_conn *conn, u8 remote_oob, u8 auth,
+ }
+
+ /* Not Just Works/Confirm results in MITM Authentication */
+- if (method != JUST_CFM)
++ if (method != JUST_CFM) {
+ set_bit(SMP_FLAG_MITM_AUTH, &smp->flags);
++ if (hcon->pending_sec_level < BT_SECURITY_HIGH)
++ hcon->pending_sec_level = BT_SECURITY_HIGH;
++ }
+
+ /* If both devices have Keyoard-Display I/O, the master
+ * Confirms and the slave Enters the passkey.
+diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
+index d3113d4aaa3c..bd8cef5b67e4 100644
+--- a/security/integrity/ima/ima_appraise.c
++++ b/security/integrity/ima/ima_appraise.c
+@@ -194,8 +194,11 @@ int ima_appraise_measurement(int func, struct integrity_iint_cache *iint,
+ goto out;
+
+ cause = "missing-hash";
+- status =
+- (inode->i_size == 0) ? INTEGRITY_PASS : INTEGRITY_NOLABEL;
++ status = INTEGRITY_NOLABEL;
++ if (inode->i_size == 0) {
++ iint->flags |= IMA_NEW_FILE;
++ status = INTEGRITY_PASS;
++ }
+ goto out;
+ }
+
+diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c
+index ccd0ac8fa9a0..b126a78d5763 100644
+--- a/security/integrity/ima/ima_crypto.c
++++ b/security/integrity/ima/ima_crypto.c
+@@ -40,19 +40,19 @@ static int ima_kernel_read(struct file *file, loff_t offset,
+ {
+ mm_segment_t old_fs;
+ char __user *buf = addr;
+- ssize_t ret;
++ ssize_t ret = -EINVAL;
+
+ if (!(file->f_mode & FMODE_READ))
+ return -EBADF;
+- if (!file->f_op->read && !file->f_op->aio_read)
+- return -EINVAL;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ if (file->f_op->read)
+ ret = file->f_op->read(file, buf, count, &offset);
+- else
++ else if (file->f_op->aio_read)
+ ret = do_sync_read(file, buf, count, &offset);
++ else if (file->f_op->read_iter)
++ ret = new_sync_read(file, buf, count, &offset);
+ set_fs(old_fs);
+ return ret;
+ }
+diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
+index 09baa335ebc7..e7745a07146d 100644
+--- a/security/integrity/ima/ima_main.c
++++ b/security/integrity/ima/ima_main.c
+@@ -128,11 +128,13 @@ static void ima_check_last_writer(struct integrity_iint_cache *iint,
+ return;
+
+ mutex_lock(&inode->i_mutex);
+- if (atomic_read(&inode->i_writecount) == 1 &&
+- iint->version != inode->i_version) {
+- iint->flags &= ~IMA_DONE_MASK;
+- if (iint->flags & IMA_APPRAISE)
+- ima_update_xattr(iint, file);
++ if (atomic_read(&inode->i_writecount) == 1) {
++ if ((iint->version != inode->i_version) ||
++ (iint->flags & IMA_NEW_FILE)) {
++ iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
++ if (iint->flags & IMA_APPRAISE)
++ ima_update_xattr(iint, file);
++ }
+ }
+ mutex_unlock(&inode->i_mutex);
+ }
+diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h
+index 33c0a70f6b15..2f8715d77a5a 100644
+--- a/security/integrity/integrity.h
++++ b/security/integrity/integrity.h
+@@ -31,6 +31,7 @@
+ #define IMA_DIGSIG 0x01000000
+ #define IMA_DIGSIG_REQUIRED 0x02000000
+ #define IMA_PERMIT_DIRECTIO 0x04000000
++#define IMA_NEW_FILE 0x08000000
+
+ #define IMA_DO_MASK (IMA_MEASURE | IMA_APPRAISE | IMA_AUDIT | \
+ IMA_APPRAISE_SUBMASK)
+diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
+index b653ab001fba..39c572806d0d 100644
+--- a/sound/core/pcm_native.c
++++ b/sound/core/pcm_native.c
+@@ -3190,7 +3190,7 @@ static const struct vm_operations_struct snd_pcm_vm_ops_data_fault = {
+
+ #ifndef ARCH_HAS_DMA_MMAP_COHERENT
+ /* This should be defined / handled globally! */
+-#ifdef CONFIG_ARM
++#if defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+ #define ARCH_HAS_DMA_MMAP_COHERENT
+ #endif
+ #endif
+diff --git a/sound/firewire/bebob/bebob_terratec.c b/sound/firewire/bebob/bebob_terratec.c
+index eef8ea7d9b97..0e4c0bfc463b 100644
+--- a/sound/firewire/bebob/bebob_terratec.c
++++ b/sound/firewire/bebob/bebob_terratec.c
+@@ -17,10 +17,10 @@ phase88_rack_clk_src_get(struct snd_bebob *bebob, unsigned int *id)
+ unsigned int enable_ext, enable_word;
+ int err;
+
+- err = avc_audio_get_selector(bebob->unit, 0, 0, &enable_ext);
++ err = avc_audio_get_selector(bebob->unit, 0, 9, &enable_ext);
+ if (err < 0)
+ goto end;
+- err = avc_audio_get_selector(bebob->unit, 0, 0, &enable_word);
++ err = avc_audio_get_selector(bebob->unit, 0, 8, &enable_word);
+ if (err < 0)
+ goto end;
+
+diff --git a/sound/pci/emu10k1/emu10k1_callback.c b/sound/pci/emu10k1/emu10k1_callback.c
+index 3f3ef38d9b6e..874cd76c7b7f 100644
+--- a/sound/pci/emu10k1/emu10k1_callback.c
++++ b/sound/pci/emu10k1/emu10k1_callback.c
+@@ -85,6 +85,8 @@ snd_emu10k1_ops_setup(struct snd_emux *emux)
+ * get more voice for pcm
+ *
+ * terminate most inactive voice and give it as a pcm voice.
++ *
++ * voice_lock is already held.
+ */
+ int
+ snd_emu10k1_synth_get_voice(struct snd_emu10k1 *hw)
+@@ -92,12 +94,10 @@ snd_emu10k1_synth_get_voice(struct snd_emu10k1 *hw)
+ struct snd_emux *emu;
+ struct snd_emux_voice *vp;
+ struct best_voice best[V_END];
+- unsigned long flags;
+ int i;
+
+ emu = hw->synth;
+
+- spin_lock_irqsave(&emu->voice_lock, flags);
+ lookup_voices(emu, hw, best, 1); /* no OFF voices */
+ for (i = 0; i < V_END; i++) {
+ if (best[i].voice >= 0) {
+@@ -113,11 +113,9 @@ snd_emu10k1_synth_get_voice(struct snd_emu10k1 *hw)
+ vp->emu->num_voices--;
+ vp->ch = -1;
+ vp->state = SNDRV_EMUX_ST_OFF;
+- spin_unlock_irqrestore(&emu->voice_lock, flags);
+ return ch;
+ }
+ }
+- spin_unlock_irqrestore(&emu->voice_lock, flags);
+
+ /* not found */
+ return -ENOMEM;
+diff --git a/sound/pci/hda/hda_local.h b/sound/pci/hda/hda_local.h
+index 4e2d4863daa1..cb06a553b9d9 100644
+--- a/sound/pci/hda/hda_local.h
++++ b/sound/pci/hda/hda_local.h
+@@ -424,7 +424,7 @@ struct snd_hda_pin_quirk {
+ .subvendor = _subvendor,\
+ .name = _name,\
+ .value = _value,\
+- .pins = (const struct hda_pintbl[]) { _pins } \
++ .pins = (const struct hda_pintbl[]) { _pins, {0, 0}} \
+ }
+ #else
+
+@@ -432,7 +432,7 @@ struct snd_hda_pin_quirk {
+ { .codec = _codec,\
+ .subvendor = _subvendor,\
+ .value = _value,\
+- .pins = (const struct hda_pintbl[]) { _pins } \
++ .pins = (const struct hda_pintbl[]) { _pins, {0, 0}} \
+ }
+
+ #endif
+diff --git a/sound/pci/hda/patch_hdmi.c b/sound/pci/hda/patch_hdmi.c
+index ba4ca52072ff..ddd825bce575 100644
+--- a/sound/pci/hda/patch_hdmi.c
++++ b/sound/pci/hda/patch_hdmi.c
+@@ -1574,19 +1574,22 @@ static bool hdmi_present_sense(struct hdmi_spec_per_pin *per_pin, int repoll)
+ }
+ }
+
+- if (pin_eld->eld_valid && !eld->eld_valid) {
+- update_eld = true;
++ if (pin_eld->eld_valid != eld->eld_valid)
+ eld_changed = true;
+- }
++
++ if (pin_eld->eld_valid && !eld->eld_valid)
++ update_eld = true;
++
+ if (update_eld) {
+ bool old_eld_valid = pin_eld->eld_valid;
+ pin_eld->eld_valid = eld->eld_valid;
+- eld_changed = pin_eld->eld_size != eld->eld_size ||
++ if (pin_eld->eld_size != eld->eld_size ||
+ memcmp(pin_eld->eld_buffer, eld->eld_buffer,
+- eld->eld_size) != 0;
+- if (eld_changed)
++ eld->eld_size) != 0) {
+ memcpy(pin_eld->eld_buffer, eld->eld_buffer,
+ eld->eld_size);
++ eld_changed = true;
++ }
+ pin_eld->eld_size = eld->eld_size;
+ pin_eld->info = eld->info;
+
+diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
+index 88e4623d4f97..c8bf72832731 100644
+--- a/sound/pci/hda/patch_realtek.c
++++ b/sound/pci/hda/patch_realtek.c
+@@ -3103,6 +3103,9 @@ static void alc283_shutup(struct hda_codec *codec)
+
+ alc_write_coef_idx(codec, 0x43, 0x9004);
+
++ /*depop hp during suspend*/
++ alc_write_coef_idx(codec, 0x06, 0x2100);
++
+ snd_hda_codec_write(codec, hp_pin, 0,
+ AC_VERB_SET_AMP_GAIN_MUTE, AMP_OUT_MUTE);
+
+@@ -5575,9 +5578,9 @@ static void alc662_led_gpio1_mute_hook(void *private_data, int enabled)
+ unsigned int oldval = spec->gpio_led;
+
+ if (enabled)
+- spec->gpio_led &= ~0x01;
+- else
+ spec->gpio_led |= 0x01;
++ else
++ spec->gpio_led &= ~0x01;
+ if (spec->gpio_led != oldval)
+ snd_hda_codec_write(codec, 0x01, 0, AC_VERB_SET_GPIO_DATA,
+ spec->gpio_led);
+diff --git a/sound/usb/quirks-table.h b/sound/usb/quirks-table.h
+index 223c47b33ba3..c657752a420c 100644
+--- a/sound/usb/quirks-table.h
++++ b/sound/usb/quirks-table.h
+@@ -385,6 +385,36 @@ YAMAHA_DEVICE(0x105d, NULL),
+ }
+ },
+ {
++ USB_DEVICE(0x0499, 0x1509),
++ .driver_info = (unsigned long) & (const struct snd_usb_audio_quirk) {
++ /* .vendor_name = "Yamaha", */
++ /* .product_name = "Steinberg UR22", */
++ .ifnum = QUIRK_ANY_INTERFACE,
++ .type = QUIRK_COMPOSITE,
++ .data = (const struct snd_usb_audio_quirk[]) {
++ {
++ .ifnum = 1,
++ .type = QUIRK_AUDIO_STANDARD_INTERFACE
++ },
++ {
++ .ifnum = 2,
++ .type = QUIRK_AUDIO_STANDARD_INTERFACE
++ },
++ {
++ .ifnum = 3,
++ .type = QUIRK_MIDI_YAMAHA
++ },
++ {
++ .ifnum = 4,
++ .type = QUIRK_IGNORE_INTERFACE
++ },
++ {
++ .ifnum = -1
++ }
++ }
++ }
++},
++{
+ USB_DEVICE(0x0499, 0x150a),
+ .driver_info = (unsigned long) & (const struct snd_usb_audio_quirk) {
+ /* .vendor_name = "Yamaha", */
+diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
+index 4b6c01b477f9..438851c2a797 100644
+--- a/virt/kvm/kvm_main.c
++++ b/virt/kvm/kvm_main.c
+@@ -52,6 +52,7 @@
+
+ #include <asm/processor.h>
+ #include <asm/io.h>
++#include <asm/ioctl.h>
+ #include <asm/uaccess.h>
+ #include <asm/pgtable.h>
+
+@@ -95,8 +96,6 @@ static int hardware_enable_all(void);
+ static void hardware_disable_all(void);
+
+ static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
+-static void update_memslots(struct kvm_memslots *slots,
+- struct kvm_memory_slot *new, u64 last_generation);
+
+ static void kvm_release_pfn_dirty(pfn_t pfn);
+ static void mark_page_dirty_in_slot(struct kvm *kvm,
+@@ -474,6 +473,13 @@ static struct kvm *kvm_create_vm(unsigned long type)
+ kvm->memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
+ if (!kvm->memslots)
+ goto out_err_no_srcu;
++
++ /*
++ * Init kvm generation close to the maximum to easily test the
++ * code of handling generation number wrap-around.
++ */
++ kvm->memslots->generation = -150;
++
+ kvm_init_memslots_id(kvm);
+ if (init_srcu_struct(&kvm->srcu))
+ goto out_err_no_srcu;
+@@ -685,8 +691,7 @@ static void sort_memslots(struct kvm_memslots *slots)
+ }
+
+ static void update_memslots(struct kvm_memslots *slots,
+- struct kvm_memory_slot *new,
+- u64 last_generation)
++ struct kvm_memory_slot *new)
+ {
+ if (new) {
+ int id = new->id;
+@@ -697,8 +702,6 @@ static void update_memslots(struct kvm_memslots *slots,
+ if (new->npages != npages)
+ sort_memslots(slots);
+ }
+-
+- slots->generation = last_generation + 1;
+ }
+
+ static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
+@@ -720,10 +723,24 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
+ {
+ struct kvm_memslots *old_memslots = kvm->memslots;
+
+- update_memslots(slots, new, kvm->memslots->generation);
++ /*
++ * Set the low bit in the generation, which disables SPTE caching
++ * until the end of synchronize_srcu_expedited.
++ */
++ WARN_ON(old_memslots->generation & 1);
++ slots->generation = old_memslots->generation + 1;
++
++ update_memslots(slots, new);
+ rcu_assign_pointer(kvm->memslots, slots);
+ synchronize_srcu_expedited(&kvm->srcu);
+
++ /*
++ * Increment the new memslot generation a second time. This prevents
++ * vm exits that race with memslot updates from caching a memslot
++ * generation that will (potentially) be valid forever.
++ */
++ slots->generation++;
++
+ kvm_arch_memslots_updated(kvm);
+
+ return old_memslots;
+@@ -1973,6 +1990,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
+ if (vcpu->kvm->mm != current->mm)
+ return -EIO;
+
++ if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
++ return -EINVAL;
++
+ #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
+ /*
+ * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-11-29 18:05 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-11-29 18:05 UTC (permalink / raw
To: gentoo-commits
commit: 41cf3e1a269f2ff1d94992251fbc4e65e0c35417
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Nov 29 18:03:46 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Nov 29 18:03:46 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=41cf3e1a
Bump BFQ patchset to v7r6-3.16
---
...-cgroups-kconfig-build-bits-for-v7r6-3.16.patch | 6 +-
...ck-introduce-the-v7r6-I-O-sched-for-3.17.patch1 | 421 ++++++++++++++++++---
...add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch | 194 ++++++----
3 files changed, 474 insertions(+), 147 deletions(-)
diff --git a/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r6-3.16.patch
similarity index 97%
rename from 5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
rename to 5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r6-3.16.patch
index 088bd05..7f6a5f4 100644
--- a/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
+++ b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r6-3.16.patch
@@ -1,7 +1,7 @@
-From 6519e5beef1063a86d3fc917cff2592cb599e824 Mon Sep 17 00:00:00 2001
+From 92ef290b97a50b9d60eb928166413140cd7a4802 Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@unimore.it>
Date: Thu, 22 May 2014 11:59:35 +0200
-Subject: [PATCH 1/3] block: cgroups, kconfig, build bits for BFQ-v7r5-3.16
+Subject: [PATCH 1/3] block: cgroups, kconfig, build bits for BFQ-v7r6-3.16
Update Kconfig.iosched and do the related Makefile changes to include
kernel configuration options for BFQ. Also add the bfqio controller
@@ -100,5 +100,5 @@ index 98c4f9b..13b010d 100644
SUBSYS(perf_event)
#endif
--
-2.0.3
+2.1.2
diff --git a/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1 b/5002_BFQ-2-block-introduce-the-v7r6-I-O-sched-for-3.17.patch1
similarity index 92%
rename from 5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
rename to 5002_BFQ-2-block-introduce-the-v7r6-I-O-sched-for-3.17.patch1
index 6f630ba..7ae3298 100644
--- a/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
+++ b/5002_BFQ-2-block-introduce-the-v7r6-I-O-sched-for-3.17.patch1
@@ -1,9 +1,9 @@
-From c56e6c5db41f7137d3e0b38063ef0c944eec1898 Mon Sep 17 00:00:00 2001
+From e4fcd78909604194d930e38874a9313090b80348 Mon Sep 17 00:00:00 2001
From: Paolo Valente <paolo.valente@unimore.it>
Date: Thu, 9 May 2013 19:10:02 +0200
-Subject: [PATCH 2/3] block: introduce the BFQ-v7r5 I/O sched for 3.16
+Subject: [PATCH 2/3] block: introduce the BFQ-v7r6 I/O sched for 3.16
-Add the BFQ-v7r5 I/O scheduler to 3.16.
+Add the BFQ-v7r6 I/O scheduler to 3.16.
The general structure is borrowed from CFQ, as much of the code for
handling I/O contexts. Over time, several useful features have been
ported from CFQ as well (details in the changelog in README.BFQ). A
@@ -56,12 +56,12 @@ until it expires.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
- block/bfq-cgroup.c | 930 +++++++++++++
+ block/bfq-cgroup.c | 930 ++++++++++++
block/bfq-ioc.c | 36 +
- block/bfq-iosched.c | 3617 +++++++++++++++++++++++++++++++++++++++++++++++++++
- block/bfq-sched.c | 1207 +++++++++++++++++
- block/bfq.h | 742 +++++++++++
- 5 files changed, 6532 insertions(+)
+ block/bfq-iosched.c | 3887 +++++++++++++++++++++++++++++++++++++++++++++++++++
+ block/bfq-sched.c | 1207 ++++++++++++++++
+ block/bfq.h | 773 ++++++++++
+ 5 files changed, 6833 insertions(+)
create mode 100644 block/bfq-cgroup.c
create mode 100644 block/bfq-ioc.c
create mode 100644 block/bfq-iosched.c
@@ -1048,10 +1048,10 @@ index 0000000..7f6b000
+}
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
new file mode 100644
-index 0000000..0a0891b
+index 0000000..b919b03
--- /dev/null
+++ b/block/bfq-iosched.c
-@@ -0,0 +1,3617 @@
+@@ -0,0 +1,3887 @@
+/*
+ * Budget Fair Queueing (BFQ) disk scheduler.
+ *
@@ -1625,6 +1625,220 @@ index 0000000..0a0891b
+ return dur;
+}
+
++/* Empty burst list and add just bfqq (see comments to bfq_handle_burst) */
++static inline void bfq_reset_burst_list(struct bfq_data *bfqd,
++ struct bfq_queue *bfqq)
++{
++ struct bfq_queue *item;
++ struct hlist_node *n;
++
++ hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
++ hlist_del_init(&item->burst_list_node);
++ hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
++ bfqd->burst_size = 1;
++}
++
++/* Add bfqq to the list of queues in current burst (see bfq_handle_burst) */
++static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++ /* Increment burst size to take into account also bfqq */
++ bfqd->burst_size++;
++
++ if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) {
++ struct bfq_queue *pos, *bfqq_item;
++ struct hlist_node *n;
++
++ /*
++ * Enough queues have been activated shortly after each
++ * other to consider this burst as large.
++ */
++ bfqd->large_burst = true;
++
++ /*
++ * We can now mark all queues in the burst list as
++ * belonging to a large burst.
++ */
++ hlist_for_each_entry(bfqq_item, &bfqd->burst_list,
++ burst_list_node)
++ bfq_mark_bfqq_in_large_burst(bfqq_item);
++ bfq_mark_bfqq_in_large_burst(bfqq);
++
++ /*
++ * From now on, and until the current burst finishes, any
++ * new queue being activated shortly after the last queue
++ * was inserted in the burst can be immediately marked as
++ * belonging to a large burst. So the burst list is not
++ * needed any more. Remove it.
++ */
++ hlist_for_each_entry_safe(pos, n, &bfqd->burst_list,
++ burst_list_node)
++ hlist_del_init(&pos->burst_list_node);
++ } else /* burst not yet large: add bfqq to the burst list */
++ hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
++}
++
++/*
++ * If many queues happen to become active shortly after each other, then,
++ * to help the processes associated to these queues get their job done as
++ * soon as possible, it is usually better to not grant either weight-raising
++ * or device idling to these queues. In this comment we describe, firstly,
++ * the reasons why this fact holds, and, secondly, the next function, which
++ * implements the main steps needed to properly mark these queues so that
++ * they can then be treated in a different way.
++ *
++ * As for the terminology, we say that a queue becomes active, i.e.,
++ * switches from idle to backlogged, either when it is created (as a
++ * consequence of the arrival of an I/O request), or, if already existing,
++ * when a new request for the queue arrives while the queue is idle.
++ * Bursts of activations, i.e., activations of different queues occurring
++ * shortly after each other, are typically caused by services or applications
++ * that spawn or reactivate many parallel threads/processes. Examples are
++ * systemd during boot or git grep.
++ *
++ * These services or applications benefit mostly from a high throughput:
++ * the quicker the requests of the activated queues are cumulatively served,
++ * the sooner the target job of these queues gets completed. As a consequence,
++ * weight-raising any of these queues, which also implies idling the device
++ * for it, is almost always counterproductive: in most cases it just lowers
++ * throughput.
++ *
++ * On the other hand, a burst of activations may be also caused by the start
++ * of an application that does not consist in a lot of parallel I/O-bound
++ * threads. In fact, with a complex application, the burst may be just a
++ * consequence of the fact that several processes need to be executed to
++ * start-up the application. To start an application as quickly as possible,
++ * the best thing to do is to privilege the I/O related to the application
++ * with respect to all other I/O. Therefore, the best strategy to start as
++ * quickly as possible an application that causes a burst of activations is
++ * to weight-raise all the queues activated during the burst. This is the
++ * exact opposite of the best strategy for the other type of bursts.
++ *
++ * In the end, to take the best action for each of the two cases, the two
++ * types of bursts need to be distinguished. Fortunately, this seems
++ * relatively easy to do, by looking at the sizes of the bursts. In
++ * particular, we found a threshold such that bursts with a larger size
++ * than that threshold are apparently caused only by services or commands
++ * such as systemd or git grep. For brevity, hereafter we call just 'large'
++ * these bursts. BFQ *does not* weight-raise queues whose activations occur
++ * in a large burst. In addition, for each of these queues BFQ performs or
++ * does not perform idling depending on which choice boosts the throughput
++ * most. The exact choice depends on the device and request pattern at
++ * hand.
++ *
++ * Turning back to the next function, it implements all the steps needed
++ * to detect the occurrence of a large burst and to properly mark all the
++ * queues belonging to it (so that they can then be treated in a different
++ * way). This goal is achieved by maintaining a special "burst list" that
++ * holds, temporarily, the queues that belong to the burst in progress. The
++ * list is then used to mark these queues as belonging to a large burst if
++ * the burst does become large. The main steps are the following.
++ *
++ * . when the very first queue is activated, the queue is inserted into the
++ * list (as it could be the first queue in a possible burst)
++ *
++ * . if the current burst has not yet become large, and a queue Q that does
++ * not yet belong to the burst is activated shortly after the last time
++ * at which a new queue entered the burst list, then the function appends
++ * Q to the burst list
++ *
++ * . if, as a consequence of the previous step, the burst size reaches
++ * the large-burst threshold, then
++ *
++ * . all the queues in the burst list are marked as belonging to a
++ * large burst
++ *
++ * . the burst list is deleted; in fact, the burst list already served
++ * its purpose (keeping temporarily track of the queues in a burst,
++ * so as to be able to mark them as belonging to a large burst in the
++ * previous sub-step), and now is not needed any more
++ *
++ * . the device enters a large-burst mode
++ *
++ * . if a queue Q that does not belong to the burst is activated while
++ * the device is in large-burst mode and shortly after the last time
++ * at which a queue either entered the burst list or was marked as
++ * belonging to the current large burst, then Q is immediately marked
++ * as belonging to a large burst.
++ *
++ * . if a queue Q that does not belong to the burst is activated a while
++ * later, i.e., not shortly after, than the last time at which a queue
++ * either entered the burst list or was marked as belonging to the
++ * current large burst, then the current burst is deemed as finished and:
++ *
++ * . the large-burst mode is reset if set
++ *
++ * . the burst list is emptied
++ *
++ * . Q is inserted in the burst list, as Q may be the first queue
++ * in a possible new burst (then the burst list contains just Q
++ * after this step).
++ */
++static void bfq_handle_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++ bool idle_for_long_time)
++{
++ /*
++ * If bfqq happened to be activated in a burst, but has been idle
++ * for at least as long as an interactive queue, then we assume
++ * that, in the overall I/O initiated in the burst, the I/O
++ * associated to bfqq is finished. So bfqq does not need to be
++ * treated as a queue belonging to a burst anymore. Accordingly,
++ * we reset bfqq's in_large_burst flag if set, and remove bfqq
++ * from the burst list if it's there. We do not decrement instead
++ * burst_size, because the fact that bfqq does not need to belong
++ * to the burst list any more does not invalidate the fact that
++ * bfqq may have been activated during the current burst.
++ */
++ if (idle_for_long_time) {
++ hlist_del_init(&bfqq->burst_list_node);
++ bfq_clear_bfqq_in_large_burst(bfqq);
++ }
++
++ /*
++ * If bfqq is already in the burst list or is part of a large
++ * burst, then there is nothing else to do.
++ */
++ if (!hlist_unhashed(&bfqq->burst_list_node) ||
++ bfq_bfqq_in_large_burst(bfqq))
++ return;
++
++ /*
++ * If bfqq's activation happens late enough, then the current
++ * burst is finished, and related data structures must be reset.
++ *
++ * In this respect, consider the special case where bfqq is the very
++ * first queue being activated. In this case, last_ins_in_burst is
++ * not yet significant when we get here. But it is easy to verify
++ * that, whether or not the following condition is true, bfqq will
++ * end up being inserted into the burst list. In particular the
++ * list will happen to contain only bfqq. And this is exactly what
++ * has to happen, as bfqq may be the first queue in a possible
++ * burst.
++ */
++ if (time_is_before_jiffies(bfqd->last_ins_in_burst +
++ bfqd->bfq_burst_interval)) {
++ bfqd->large_burst = false;
++ bfq_reset_burst_list(bfqd, bfqq);
++ return;
++ }
++
++ /*
++ * If we get here, then bfqq is being activated shortly after the
++ * last queue. So, if the current burst is also large, we can mark
++ * bfqq as belonging to this large burst immediately.
++ */
++ if (bfqd->large_burst) {
++ bfq_mark_bfqq_in_large_burst(bfqq);
++ return;
++ }
++
++ /*
++ * If we get here, then a large-burst state has not yet been
++ * reached, but bfqq is being activated shortly after the last
++ * queue. Then we add bfqq to the burst.
++ */
++ bfq_add_to_burst(bfqd, bfqq);
++}
++
+static void bfq_add_request(struct request *rq)
+{
+ struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -1632,7 +1846,7 @@ index 0000000..0a0891b
+ struct bfq_data *bfqd = bfqq->bfqd;
+ struct request *next_rq, *prev;
+ unsigned long old_wr_coeff = bfqq->wr_coeff;
-+ int idle_for_long_time = 0;
++ bool interactive = false;
+
+ bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
+ bfqq->queued[rq_is_sync(rq)]++;
@@ -1655,11 +1869,35 @@ index 0000000..0a0891b
+ bfq_rq_pos_tree_add(bfqd, bfqq);
+
+ if (!bfq_bfqq_busy(bfqq)) {
-+ int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++ bool soft_rt,
++ idle_for_long_time = time_is_before_jiffies(
++ bfqq->budget_timeout +
++ bfqd->bfq_wr_min_idle_time);
++
++ if (bfq_bfqq_sync(bfqq)) {
++ bool already_in_burst =
++ !hlist_unhashed(&bfqq->burst_list_node) ||
++ bfq_bfqq_in_large_burst(bfqq);
++ bfq_handle_burst(bfqd, bfqq, idle_for_long_time);
++ /*
++ * If bfqq was not already in the current burst,
++ * then, at this point, bfqq either has been
++ * added to the current burst or has caused the
++ * current burst to terminate. In particular, in
++ * the second case, bfqq has become the first
++ * queue in a possible new burst.
++ * In both cases last_ins_in_burst needs to be
++ * moved forward.
++ */
++ if (!already_in_burst)
++ bfqd->last_ins_in_burst = jiffies;
++ }
++
++ soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++ !bfq_bfqq_in_large_burst(bfqq) &&
+ time_is_before_jiffies(bfqq->soft_rt_next_start);
-+ idle_for_long_time = time_is_before_jiffies(
-+ bfqq->budget_timeout +
-+ bfqd->bfq_wr_min_idle_time);
++ interactive = !bfq_bfqq_in_large_burst(bfqq) &&
++ idle_for_long_time;
+ entity->budget = max_t(unsigned long, bfqq->max_budget,
+ bfq_serv_to_charge(next_rq, bfqq));
+
@@ -1682,9 +1920,9 @@ index 0000000..0a0891b
+ * If the queue is not being boosted and has been idle
+ * for enough time, start a weight-raising period
+ */
-+ if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++ if (old_wr_coeff == 1 && (interactive || soft_rt)) {
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff;
-+ if (idle_for_long_time)
++ if (interactive)
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+ else
+ bfqq->wr_cur_max_time =
@@ -1694,11 +1932,12 @@ index 0000000..0a0891b
+ jiffies,
+ jiffies_to_msecs(bfqq->wr_cur_max_time));
+ } else if (old_wr_coeff > 1) {
-+ if (idle_for_long_time)
++ if (interactive)
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-+ else if (bfqq->wr_cur_max_time ==
-+ bfqd->bfq_wr_rt_max_time &&
-+ !soft_rt) {
++ else if (bfq_bfqq_in_large_burst(bfqq) ||
++ (bfqq->wr_cur_max_time ==
++ bfqd->bfq_wr_rt_max_time &&
++ !soft_rt)) {
+ bfqq->wr_coeff = 1;
+ bfq_log_bfqq(bfqd, bfqq,
+ "wrais ending at %lu, rais_max_time %u",
@@ -1787,8 +2026,7 @@ index 0000000..0a0891b
+ }
+
+ if (bfqd->low_latency &&
-+ (old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
-+ idle_for_long_time))
++ (old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))
+ bfqq->last_wr_start_finish = jiffies;
+}
+
@@ -2291,9 +2529,7 @@ index 0000000..0a0891b
+ return rq;
+}
+
-+/*
-+ * Must be called with the queue_lock held.
-+ */
++/* Must be called with the queue_lock held. */
+static int bfqq_process_refs(struct bfq_queue *bfqq)
+{
+ int process_refs, io_refs;
@@ -2896,16 +3132,26 @@ index 0000000..0a0891b
+ * long comment, we try to briefly describe all the details and motivations
+ * behind the components of this logical expression.
+ *
-+ * First, the expression may be true only for sync queues. Besides, if
-+ * bfqq is also being weight-raised, then the expression always evaluates
-+ * to true, as device idling is instrumental for preserving low-latency
-+ * guarantees (see [1]). Otherwise, the expression evaluates to true only
-+ * if bfqq has a non-null idle window and at least one of the following
-+ * two conditions holds. The first condition is that the device is not
-+ * performing NCQ, because idling the device most certainly boosts the
-+ * throughput if this condition holds and bfqq has been granted a non-null
-+ * idle window. The second compound condition is made of the logical AND of
-+ * two components.
++ * First, the expression is false if bfqq is not sync, or if: bfqq happened
++ * to become active during a large burst of queue activations, and the
++ * pattern of requests bfqq contains boosts the throughput if bfqq is
++ * expired. In fact, queues that became active during a large burst benefit
++ * only from throughput, as discussed in the comments to bfq_handle_burst.
++ * In this respect, expiring bfqq certainly boosts the throughput on NCQ-
++ * capable flash-based devices, whereas, on rotational devices, it boosts
++ * the throughput only if bfqq contains random requests.
++ *
++ * On the opposite end, if (a) bfqq is sync, (b) the above burst-related
++ * condition does not hold, and (c) bfqq is being weight-raised, then the
++ * expression always evaluates to true, as device idling is instrumental
++ * for preserving low-latency guarantees (see [1]). If, instead, conditions
++ * (a) and (b) do hold, but (c) does not, then the expression evaluates to
++ * true only if: (1) bfqq is I/O-bound and has a non-null idle window, and
++ * (2) at least one of the following two conditions holds.
++ * The first condition is that the device is not performing NCQ, because
++ * idling the device most certainly boosts the throughput if this condition
++ * holds and bfqq is I/O-bound and has been granted a non-null idle window.
++ * The second compound condition is made of the logical AND of two components.
+ *
+ * The first component is true only if there is no weight-raised busy
+ * queue. This guarantees that the device is not idled for a sync non-
@@ -3022,6 +3268,12 @@ index 0000000..0a0891b
+#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
+ bfqd->busy_in_flight_queues == \
+ bfqd->const_seeky_busy_in_flight_queues)
++
++#define cond_for_expiring_in_burst (bfq_bfqq_in_large_burst(bfqq) && \
++ bfqd->hw_tag && \
++ (blk_queue_nonrot(bfqd->queue) || \
++ bfq_bfqq_constantly_seeky(bfqq)))
++
+/*
+ * Condition for expiring a non-weight-raised queue (and hence not idling
+ * the device).
@@ -3033,9 +3285,9 @@ index 0000000..0a0891b
+ cond_for_seeky_on_ncq_hdd))))
+
+ return bfq_bfqq_sync(bfqq) &&
-+ (bfq_bfqq_IO_bound(bfqq) || bfqq->wr_coeff > 1) &&
++ !cond_for_expiring_in_burst &&
+ (bfqq->wr_coeff > 1 ||
-+ (bfq_bfqq_idle_window(bfqq) &&
++ (bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_idle_window(bfqq) &&
+ !cond_for_expiring_non_wr)
+ );
+}
@@ -3179,10 +3431,12 @@ index 0000000..0a0891b
+ if (entity->ioprio_changed)
+ bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
+ /*
-+ * If too much time has elapsed from the beginning
-+ * of this weight-raising, stop it.
++ * If the queue was activated in a burst, or
++ * too much time has elapsed from the beginning
++ * of this weight-raising, then end weight raising.
+ */
-+ if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++ if (bfq_bfqq_in_large_burst(bfqq) ||
++ time_is_before_jiffies(bfqq->last_wr_start_finish +
+ bfqq->wr_cur_max_time)) {
+ bfqq->last_wr_start_finish = jiffies;
+ bfq_log_bfqq(bfqd, bfqq,
@@ -3387,6 +3641,17 @@ index 0000000..0a0891b
+ BUG_ON(bfq_bfqq_busy(bfqq));
+ BUG_ON(bfqd->in_service_queue == bfqq);
+
++ if (bfq_bfqq_sync(bfqq))
++ /*
++ * The fact that this queue is being destroyed does not
++ * invalidate the fact that this queue may have been
++ * activated during the current burst. As a consequence,
++ * although the queue does not exist anymore, and hence
++ * needs to be removed from the burst list if there,
++ * the burst size has not to be decremented.
++ */
++ hlist_del_init(&bfqq->burst_list_node);
++
+ bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
+
+ kmem_cache_free(bfq_pool, bfqq);
@@ -3540,6 +3805,7 @@ index 0000000..0a0891b
+{
+ RB_CLEAR_NODE(&bfqq->entity.rb_node);
+ INIT_LIST_HEAD(&bfqq->fifo);
++ INIT_HLIST_NODE(&bfqq->burst_list_node);
+
+ atomic_set(&bfqq->ref, 0);
+ bfqq->bfqd = bfqd;
@@ -4298,6 +4564,7 @@ index 0000000..0a0891b
+
+ INIT_LIST_HEAD(&bfqd->active_list);
+ INIT_LIST_HEAD(&bfqd->idle_list);
++ INIT_HLIST_HEAD(&bfqd->burst_list);
+
+ bfqd->hw_tag = -1;
+
@@ -4318,6 +4585,9 @@ index 0000000..0a0891b
+ bfqd->bfq_failed_cooperations = 7000;
+ bfqd->bfq_requests_within_timer = 120;
+
++ bfqd->bfq_large_burst_thresh = 11;
++ bfqd->bfq_burst_interval = msecs_to_jiffies(500);
++
+ bfqd->low_latency = true;
+
+ bfqd->bfq_wr_coeff = 20;
@@ -4653,7 +4923,7 @@ index 0000000..0a0891b
+ device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
+
+ elv_register(&iosched_bfq);
-+ pr_info("BFQ I/O-scheduler version: v7r5");
++ pr_info("BFQ I/O-scheduler version: v7r6");
+
+ return 0;
+}
@@ -5884,12 +6154,12 @@ index 0000000..c4831b7
+}
diff --git a/block/bfq.h b/block/bfq.h
new file mode 100644
-index 0000000..a83e69d
+index 0000000..0378c86
--- /dev/null
+++ b/block/bfq.h
-@@ -0,0 +1,742 @@
+@@ -0,0 +1,773 @@
+/*
-+ * BFQ-v7r5 for 3.16.0: data structures and common functions prototypes.
++ * BFQ-v7r6 for 3.16.0: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
@@ -6086,6 +6356,7 @@ index 0000000..a83e69d
+ * @dispatched: number of requests on the dispatch list or inside driver.
+ * @flags: status flags.
+ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
++ * @burst_list_node: node for the device's burst list.
+ * @seek_samples: number of seeks sampled
+ * @seek_total: sum of the distances of the seeks sampled
+ * @seek_mean: mean seek distance
@@ -6146,6 +6417,8 @@ index 0000000..a83e69d
+
+ struct list_head bfqq_list;
+
++ struct hlist_node burst_list_node;
++
+ unsigned int seek_samples;
+ u64 seek_total;
+ sector_t seek_mean;
@@ -6298,22 +6571,38 @@ index 0000000..a83e69d
+ * again idling to a queue which was marked as
+ * non-I/O-bound (see the definition of the
+ * IO_bound flag for further details).
-+ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
-+ * queue is multiplied
-+ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
-+ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
++ * @last_ins_in_burst: last time at which a queue entered the current
++ * burst of queues being activated shortly after
++ * each other; for more details about this and the
++ * following parameters related to a burst of
++ * activations, see the comments to the function
++ * @bfq_handle_burst.
++ * @bfq_burst_interval: reference time interval used to decide whether a
++ * queue has been activated shortly after
++ * @last_ins_in_burst.
++ * @burst_size: number of queues in the current burst of queue activations.
++ * @bfq_large_burst_thresh: maximum burst size above which the current
++ * queue-activation burst is deemed as 'large'.
++ * @large_burst: true if a large queue-activation burst is in progress.
++ * @burst_list: head of the burst list (as for the above fields, more details
++ * in the comments to the function bfq_handle_burst).
++ * @low_latency: if set to true, low-latency heuristics are enabled.
++ * @bfq_wr_coeff: maximum factor by which the weight of a weight-raised
++ * queue is multiplied.
++ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies).
++ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes.
+ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
-+ * may be reactivated for a queue (in jiffies)
++ * may be reactivated for a queue (in jiffies).
+ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
+ * after which weight-raising may be
+ * reactivated for an already busy queue
-+ * (in jiffies)
++ * (in jiffies).
+ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
-+ * sectors per seconds
++ * sectors per seconds.
+ * @RT_prod: cached value of the product R*T used for computing the maximum
-+ * duration of the weight raising automatically
-+ * @device_speed: device-speed class for the low-latency heuristic
-+ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
++ * duration of the weight raising automatically.
++ * @device_speed: device-speed class for the low-latency heuristic.
++ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions.
+ *
+ * All the fields are protected by the @queue lock.
+ */
@@ -6377,6 +6666,13 @@ index 0000000..a83e69d
+ unsigned int bfq_failed_cooperations;
+ unsigned int bfq_requests_within_timer;
+
++ unsigned long last_ins_in_burst;
++ unsigned long bfq_burst_interval;
++ int burst_size;
++ unsigned long bfq_large_burst_thresh;
++ bool large_burst;
++ struct hlist_head burst_list;
++
+ bool low_latency;
+
+ /* parameters of the low_latency heuristics */
@@ -6406,6 +6702,10 @@ index 0000000..a83e69d
+ * having consumed at most 2/10 of
+ * its budget
+ */
++ BFQ_BFQQ_FLAG_in_large_burst, /*
++ * bfqq activated in a large burst,
++ * see comments to bfq_handle_burst.
++ */
+ BFQ_BFQQ_FLAG_constantly_seeky, /*
+ * bfqq has proved to be slow and
+ * seeky until budget timeout
@@ -6441,6 +6741,7 @@ index 0000000..a83e69d
+BFQ_BFQQ_FNS(sync);
+BFQ_BFQQ_FNS(budget_new);
+BFQ_BFQQ_FNS(IO_bound);
++BFQ_BFQQ_FNS(in_large_burst);
+BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(coop);
+BFQ_BFQQ_FNS(split_coop);
@@ -6561,15 +6862,15 @@ index 0000000..a83e69d
+}
+
+static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
-+ int is_sync)
++ bool is_sync)
+{
-+ return bic->bfqq[!!is_sync];
++ return bic->bfqq[is_sync];
+}
+
+static inline void bic_set_bfqq(struct bfq_io_cq *bic,
-+ struct bfq_queue *bfqq, int is_sync)
++ struct bfq_queue *bfqq, bool is_sync)
+{
-+ bic->bfqq[!!is_sync] = bfqq;
++ bic->bfqq[is_sync] = bfqq;
+}
+
+static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
@@ -6631,5 +6932,5 @@ index 0000000..a83e69d
+
+#endif /* _BFQ_H */
--
-2.0.3
+2.1.2
diff --git a/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
similarity index 87%
rename from 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
rename to 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
index e606f5d..53e7c76 100644
--- a/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
+++ b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
@@ -1,7 +1,7 @@
-From 5b290be286aa74051b4b77a216032b771ceadd23 Mon Sep 17 00:00:00 2001
+From 5428334e0390ccad40fa21dd046eb163025a4f74 Mon Sep 17 00:00:00 2001
From: Mauro Andreolini <mauro.andreolini@unimore.it>
-Date: Wed, 18 Jun 2014 17:38:07 +0200
-Subject: [PATCH 3/3] block, bfq: add Early Queue Merge (EQM) to BFQ-v7r5 for
+Date: Sun, 19 Oct 2014 01:15:59 +0200
+Subject: [PATCH 3/3] block, bfq: add Early Queue Merge (EQM) to BFQ-v7r6 for
3.16.0
A set of processes may happen to perform interleaved reads, i.e.,requests
@@ -34,13 +34,13 @@ Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
---
- block/bfq-iosched.c | 736 ++++++++++++++++++++++++++++++++++++----------------
+ block/bfq-iosched.c | 743 +++++++++++++++++++++++++++++++++++++---------------
block/bfq-sched.c | 28 --
- block/bfq.h | 46 +++-
- 3 files changed, 556 insertions(+), 254 deletions(-)
+ block/bfq.h | 54 +++-
+ 3 files changed, 573 insertions(+), 252 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
-index 0a0891b..d1d8e67 100644
+index b919b03..bbfb4e1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -571,6 +571,57 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
@@ -64,7 +64,9 @@ index 0a0891b..d1d8e67 100644
+ bfq_mark_bfqq_IO_bound(bfqq);
+ else
+ bfq_clear_bfqq_IO_bound(bfqq);
++ /* Assuming that the flag in_large_burst is already correctly set */
+ if (bic->wr_time_left && bfqq->bfqd->low_latency &&
++ !bfq_bfqq_in_large_burst(bfqq) &&
+ bic->cooperations < bfqq->bfqd->bfq_coop_thresh) {
+ /*
+ * Start a weight raising period with the duration given by
@@ -85,9 +87,7 @@ index 0a0891b..d1d8e67 100644
+ bic->wr_time_left = 0;
+}
+
-+/*
-+ * Must be called with the queue_lock held.
-+ */
++/* Must be called with the queue_lock held. */
+static int bfqq_process_refs(struct bfq_queue *bfqq)
+{
+ int process_refs, io_refs;
@@ -98,23 +98,35 @@ index 0a0891b..d1d8e67 100644
+ return process_refs;
+}
+
- static void bfq_add_request(struct request *rq)
- {
- struct bfq_queue *bfqq = RQ_BFQQ(rq);
-@@ -602,8 +653,11 @@ static void bfq_add_request(struct request *rq)
+ /* Empty burst list and add just bfqq (see comments to bfq_handle_burst) */
+ static inline void bfq_reset_burst_list(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+@@ -815,7 +866,7 @@ static void bfq_add_request(struct request *rq)
+ bfq_rq_pos_tree_add(bfqd, bfqq);
if (!bfq_bfqq_busy(bfqq)) {
- int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
-+ bfq_bfqq_cooperations(bfqq) < bfqd->bfq_coop_thresh &&
+- bool soft_rt,
++ bool soft_rt, coop_or_in_burst,
+ idle_for_long_time = time_is_before_jiffies(
+ bfqq->budget_timeout +
+ bfqd->bfq_wr_min_idle_time);
+@@ -839,11 +890,12 @@ static void bfq_add_request(struct request *rq)
+ bfqd->last_ins_in_burst = jiffies;
+ }
+
++ coop_or_in_burst = bfq_bfqq_in_large_burst(bfqq) ||
++ bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh;
+ soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+- !bfq_bfqq_in_large_burst(bfqq) &&
++ !coop_or_in_burst &&
time_is_before_jiffies(bfqq->soft_rt_next_start);
-- idle_for_long_time = time_is_before_jiffies(
-+ idle_for_long_time = bfq_bfqq_cooperations(bfqq) <
-+ bfqd->bfq_coop_thresh &&
-+ time_is_before_jiffies(
- bfqq->budget_timeout +
- bfqd->bfq_wr_min_idle_time);
+- interactive = !bfq_bfqq_in_large_burst(bfqq) &&
+- idle_for_long_time;
++ interactive = !coop_or_in_burst && idle_for_long_time;
entity->budget = max_t(unsigned long, bfqq->max_budget,
-@@ -624,11 +678,20 @@ static void bfq_add_request(struct request *rq)
+ bfq_serv_to_charge(next_rq, bfqq));
+
+@@ -862,11 +914,20 @@ static void bfq_add_request(struct request *rq)
if (!bfqd->low_latency)
goto add_bfqq_busy;
@@ -132,28 +144,22 @@ index 0a0891b..d1d8e67 100644
+ * requests have not been redirected to a shared queue)
+ * start a weight-raising period.
*/
-- if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
-+ if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
+- if (old_wr_coeff == 1 && (interactive || soft_rt)) {
++ if (old_wr_coeff == 1 && (interactive || soft_rt) &&
+ (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
bfqq->wr_coeff = bfqd->bfq_wr_coeff;
- if (idle_for_long_time)
+ if (interactive)
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-@@ -642,9 +705,11 @@ static void bfq_add_request(struct request *rq)
+@@ -880,7 +941,7 @@ static void bfq_add_request(struct request *rq)
} else if (old_wr_coeff > 1) {
- if (idle_for_long_time)
+ if (interactive)
bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-- else if (bfqq->wr_cur_max_time ==
-- bfqd->bfq_wr_rt_max_time &&
-- !soft_rt) {
-+ else if (bfq_bfqq_cooperations(bfqq) >=
-+ bfqd->bfq_coop_thresh ||
-+ (bfqq->wr_cur_max_time ==
-+ bfqd->bfq_wr_rt_max_time &&
-+ !soft_rt)) {
- bfqq->wr_coeff = 1;
- bfq_log_bfqq(bfqd, bfqq,
- "wrais ending at %lu, rais_max_time %u",
-@@ -660,18 +725,18 @@ static void bfq_add_request(struct request *rq)
+- else if (bfq_bfqq_in_large_burst(bfqq) ||
++ else if (coop_or_in_burst ||
+ (bfqq->wr_cur_max_time ==
+ bfqd->bfq_wr_rt_max_time &&
+ !soft_rt)) {
+@@ -899,18 +960,18 @@ static void bfq_add_request(struct request *rq)
/*
*
* The remaining weight-raising time is lower
@@ -184,7 +190,7 @@ index 0a0891b..d1d8e67 100644
*
* In addition, the application is now meeting
* the requirements for being deemed soft rt.
-@@ -706,6 +771,7 @@ static void bfq_add_request(struct request *rq)
+@@ -945,6 +1006,7 @@ static void bfq_add_request(struct request *rq)
bfqd->bfq_wr_rt_max_time;
}
}
@@ -192,7 +198,7 @@ index 0a0891b..d1d8e67 100644
if (old_wr_coeff != bfqq->wr_coeff)
entity->ioprio_changed = 1;
add_bfqq_busy:
-@@ -918,90 +984,35 @@ static void bfq_end_wr(struct bfq_data *bfqd)
+@@ -1156,90 +1218,35 @@ static void bfq_end_wr(struct bfq_data *bfqd)
spin_unlock_irq(bfqd->queue->queue_lock);
}
@@ -297,7 +303,7 @@ index 0a0891b..d1d8e67 100644
if (RB_EMPTY_ROOT(root))
return NULL;
-@@ -1020,7 +1031,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+@@ -1258,7 +1265,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
* next_request position).
*/
__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
@@ -306,7 +312,7 @@ index 0a0891b..d1d8e67 100644
return __bfqq;
if (blk_rq_pos(__bfqq->next_rq) < sector)
-@@ -1031,7 +1042,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+@@ -1269,7 +1276,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
return NULL;
__bfqq = rb_entry(node, struct bfq_queue, pos_node);
@@ -315,7 +321,7 @@ index 0a0891b..d1d8e67 100644
return __bfqq;
return NULL;
-@@ -1040,14 +1051,12 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+@@ -1278,14 +1285,12 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
/*
* bfqd - obvious
* cur_bfqq - passed in so that we don't decide that the current queue
@@ -334,7 +340,7 @@ index 0a0891b..d1d8e67 100644
{
struct bfq_queue *bfqq;
-@@ -1067,7 +1076,7 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+@@ -1305,7 +1310,7 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
* working closely on the same area of the disk. In that case,
* we can group them together and don't waste time idling.
*/
@@ -343,7 +349,7 @@ index 0a0891b..d1d8e67 100644
if (bfqq == NULL || bfqq == cur_bfqq)
return NULL;
-@@ -1094,6 +1103,305 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+@@ -1332,6 +1337,307 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
return bfqq;
}
@@ -508,6 +514,8 @@ index 0a0891b..d1d8e67 100644
+ bfqq->bic->wr_time_left = 0;
+ bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
+ bfqq->bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
++ bfqq->bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
++ bfqq->bic->was_in_burst_list = !hlist_unhashed(&bfqq->burst_list_node);
+ bfqq->bic->cooperations++;
+ bfqq->bic->failed_cooperations = 0;
+}
@@ -649,13 +657,11 @@ index 0a0891b..d1d8e67 100644
/*
* If enough samples have been computed, return the current max budget
* stored in bfqd, which is dynamically updated according to the
-@@ -1237,63 +1545,6 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+@@ -1475,61 +1781,6 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
return rq;
}
--/*
-- * Must be called with the queue_lock held.
-- */
+-/* Must be called with the queue_lock held. */
-static int bfqq_process_refs(struct bfq_queue *bfqq)
-{
- int process_refs, io_refs;
@@ -713,7 +719,7 @@ index 0a0891b..d1d8e67 100644
static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
{
struct bfq_entity *entity = &bfqq->entity;
-@@ -2011,7 +2262,7 @@ static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+@@ -2263,7 +2514,7 @@ static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
*/
static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
{
@@ -722,7 +728,7 @@ index 0a0891b..d1d8e67 100644
struct request *next_rq;
enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
-@@ -2021,17 +2272,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+@@ -2273,17 +2524,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
@@ -740,7 +746,7 @@ index 0a0891b..d1d8e67 100644
if (bfq_may_expire_for_budg_timeout(bfqq) &&
!timer_pending(&bfqd->idle_slice_timer) &&
!bfq_bfqq_must_idle(bfqq))
-@@ -2070,10 +2310,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+@@ -2322,10 +2562,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
bfq_clear_bfqq_wait_request(bfqq);
del_timer(&bfqd->idle_slice_timer);
}
@@ -752,7 +758,7 @@ index 0a0891b..d1d8e67 100644
}
}
-@@ -2082,40 +2319,30 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+@@ -2334,40 +2571,30 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
* in flight (possibly waiting for a completion) or is idling for a
* new request, then keep it.
*/
@@ -800,25 +806,25 @@ index 0a0891b..d1d8e67 100644
jiffies_to_msecs(bfqq->wr_cur_max_time),
bfqq->wr_coeff,
bfqq->entity.weight, bfqq->entity.orig_weight);
-@@ -2124,11 +2351,15 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+@@ -2376,12 +2603,16 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
entity->orig_weight * bfqq->wr_coeff);
if (entity->ioprio_changed)
bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
+
/*
- * If too much time has elapsed from the beginning
-- * of this weight-raising, stop it.
+ * If the queue was activated in a burst, or
+ * too much time has elapsed from the beginning
+- * of this weight-raising, then end weight raising.
+ * of this weight-raising period, or the queue has
+ * exceeded the acceptable number of cooperations,
-+ * stop it.
++ * then end weight raising.
*/
-- if (time_is_before_jiffies(bfqq->last_wr_start_finish +
-+ if (bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh ||
-+ time_is_before_jiffies(bfqq->last_wr_start_finish +
+ if (bfq_bfqq_in_large_burst(bfqq) ||
++ bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh ||
+ time_is_before_jiffies(bfqq->last_wr_start_finish +
bfqq->wr_cur_max_time)) {
bfqq->last_wr_start_finish = jiffies;
- bfq_log_bfqq(bfqd, bfqq,
-@@ -2136,11 +2367,13 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+@@ -2390,11 +2621,13 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
bfqq->last_wr_start_finish,
jiffies_to_msecs(bfqq->wr_cur_max_time));
bfq_bfqq_end_wr(bfqq);
@@ -835,7 +841,7 @@ index 0a0891b..d1d8e67 100644
}
/*
-@@ -2377,6 +2610,25 @@ static inline void bfq_init_icq(struct io_cq *icq)
+@@ -2642,6 +2875,25 @@ static inline void bfq_init_icq(struct io_cq *icq)
struct bfq_io_cq *bic = icq_to_bic(icq);
bic->ttime.last_end_request = jiffies;
@@ -861,7 +867,7 @@ index 0a0891b..d1d8e67 100644
}
static void bfq_exit_icq(struct io_cq *icq)
-@@ -2390,6 +2642,13 @@ static void bfq_exit_icq(struct io_cq *icq)
+@@ -2655,6 +2907,13 @@ static void bfq_exit_icq(struct io_cq *icq)
}
if (bic->bfqq[BLK_RW_SYNC]) {
@@ -875,7 +881,7 @@ index 0a0891b..d1d8e67 100644
bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
bic->bfqq[BLK_RW_SYNC] = NULL;
}
-@@ -2678,6 +2937,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
+@@ -2944,6 +3203,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
return;
@@ -886,7 +892,7 @@ index 0a0891b..d1d8e67 100644
enable_idle = bfq_bfqq_idle_window(bfqq);
if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
-@@ -2725,6 +2988,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+@@ -2991,6 +3254,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
!BFQQ_SEEKY(bfqq))
bfq_update_idle_window(bfqd, bfqq, bic);
@@ -894,7 +900,7 @@ index 0a0891b..d1d8e67 100644
bfq_log_bfqq(bfqd, bfqq,
"rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
-@@ -2785,13 +3049,49 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+@@ -3051,13 +3315,49 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
static void bfq_insert_request(struct request_queue *q, struct request *rq)
{
struct bfq_data *bfqd = q->elevator->elevator_data;
@@ -945,7 +951,7 @@ index 0a0891b..d1d8e67 100644
rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
list_add_tail(&rq->queuelist, &bfqq->fifo);
-@@ -2956,18 +3256,6 @@ static void bfq_put_request(struct request *rq)
+@@ -3222,18 +3522,6 @@ static void bfq_put_request(struct request *rq)
}
}
@@ -964,7 +970,7 @@ index 0a0891b..d1d8e67 100644
/*
* Returns NULL if a new bfqq should be allocated, or the old bfqq if this
* was the last process referring to said bfqq.
-@@ -2976,6 +3264,9 @@ static struct bfq_queue *
+@@ -3242,6 +3530,9 @@ static struct bfq_queue *
bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
{
bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
@@ -974,7 +980,7 @@ index 0a0891b..d1d8e67 100644
if (bfqq_process_refs(bfqq) == 1) {
bfqq->pid = current->pid;
bfq_clear_bfqq_coop(bfqq);
-@@ -3004,6 +3295,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
+@@ -3270,6 +3561,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
struct bfq_queue *bfqq;
struct bfq_group *bfqg;
unsigned long flags;
@@ -982,9 +988,21 @@ index 0a0891b..d1d8e67 100644
might_sleep_if(gfp_mask & __GFP_WAIT);
-@@ -3022,24 +3314,14 @@ new_queue:
+@@ -3287,25 +3579,26 @@ new_queue:
+ if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
bic_set_bfqq(bic, bfqq, is_sync);
++ if (split && is_sync) {
++ if ((bic->was_in_burst_list && bfqd->large_burst) ||
++ bic->saved_in_large_burst)
++ bfq_mark_bfqq_in_large_burst(bfqq);
++ else {
++ bfq_clear_bfqq_in_large_burst(bfqq);
++ if (bic->was_in_burst_list)
++ hlist_add_head(&bfqq->burst_list_node,
++ &bfqd->burst_list);
++ }
++ }
} else {
- /*
- * If the queue was seeky for too long, break it apart.
@@ -1009,7 +1027,7 @@ index 0a0891b..d1d8e67 100644
}
bfqq->allocated[rw]++;
-@@ -3050,6 +3332,26 @@ new_queue:
+@@ -3316,6 +3609,26 @@ new_queue:
rq->elv.priv[0] = bic;
rq->elv.priv[1] = bfqq;
@@ -1076,10 +1094,10 @@ index c4831b7..546a254 100644
{
if (bfqd->in_service_bic != NULL) {
diff --git a/block/bfq.h b/block/bfq.h
-index a83e69d..ebbd040 100644
+index 0378c86..93a2d24 100644
--- a/block/bfq.h
+++ b/block/bfq.h
-@@ -215,18 +215,21 @@ struct bfq_group;
+@@ -216,18 +216,21 @@ struct bfq_group;
* idle @bfq_queue with no outstanding requests, then
* the task associated with the queue it is deemed as
* soft real-time (see the comments to the function
@@ -1107,7 +1125,7 @@ index a83e69d..ebbd040 100644
* All the fields are protected by the queue lock of the containing bfqd.
*/
struct bfq_queue {
-@@ -264,6 +267,7 @@ struct bfq_queue {
+@@ -267,6 +270,7 @@ struct bfq_queue {
unsigned int requests_within_timer;
pid_t pid;
@@ -1115,7 +1133,7 @@ index a83e69d..ebbd040 100644
/* weight-raising fields */
unsigned long wr_cur_max_time;
-@@ -293,12 +297,34 @@ struct bfq_ttime {
+@@ -296,12 +300,42 @@ struct bfq_ttime {
* @icq: associated io_cq structure
* @bfqq: array of two process queues, the sync and the async
* @ttime: associated @bfq_ttime struct
@@ -1130,6 +1148,11 @@ index a83e69d..ebbd040 100644
+ * window
+ * @saved_IO_bound: same purpose as the previous two fields for the I/O
+ * bound classification of a queue
++ * @saved_in_large_burst: same purpose as the previous fields for the
++ * value of the field keeping the queue's belonging
++ * to a large burst
++ * @was_in_burst_list: true if the queue belonged to a burst list
++ * before its merge with another cooperating queue
+ * @cooperations: counter of consecutive successful queue merges underwent
+ * by any of the process' @bfq_queues
+ * @failed_cooperations: counter of consecutive failed queue merges of any
@@ -1142,15 +1165,18 @@ index a83e69d..ebbd040 100644
int ioprio;
+
+ unsigned int wr_time_left;
-+ unsigned int saved_idle_window;
-+ unsigned int saved_IO_bound;
++ bool saved_idle_window;
++ bool saved_IO_bound;
++
++ bool saved_in_large_burst;
++ bool was_in_burst_list;
+
+ unsigned int cooperations;
+ unsigned int failed_cooperations;
};
enum bfq_device_speed {
-@@ -511,7 +537,7 @@ enum bfqq_state_flags {
+@@ -537,7 +571,7 @@ enum bfqq_state_flags {
BFQ_BFQQ_FLAG_prio_changed, /* task priority has changed */
BFQ_BFQQ_FLAG_sync, /* synchronous queue */
BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */
@@ -1159,7 +1185,7 @@ index a83e69d..ebbd040 100644
* bfqq has timed-out at least once
* having consumed at most 2/10 of
* its budget
-@@ -520,12 +546,13 @@ enum bfqq_state_flags {
+@@ -550,12 +584,13 @@ enum bfqq_state_flags {
* bfqq has proved to be slow and
* seeky until budget timeout
*/
@@ -1175,7 +1201,7 @@ index a83e69d..ebbd040 100644
};
#define BFQ_BFQQ_FNS(name) \
-@@ -554,6 +581,7 @@ BFQ_BFQQ_FNS(IO_bound);
+@@ -585,6 +620,7 @@ BFQ_BFQQ_FNS(in_large_burst);
BFQ_BFQQ_FNS(constantly_seeky);
BFQ_BFQQ_FNS(coop);
BFQ_BFQQ_FNS(split_coop);
@@ -1184,5 +1210,5 @@ index a83e69d..ebbd040 100644
#undef BFQ_BFQQ_FNS
--
-2.0.3
+2.1.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-11-29 18:05 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-11-29 18:05 UTC (permalink / raw
To: gentoo-commits
commit: fece5ecf1633709a681cc9b0bca7897a3ec477e1
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Nov 29 18:04:53 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Nov 29 18:04:53 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=fece5ecf
Merge branch '3.16' of git+ssh://git.overlays.gentoo.org/proj/linux-patches into 3.16
update readme file
0000_README | 4 +
1006_linux-3.16.7.patch | 6873 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 6877 insertions(+)
^ permalink raw reply [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-11-29 18:05 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-11-29 18:05 UTC (permalink / raw
To: gentoo-commits
commit: 962dfa012d1b748e4df287f9ba85609a57d18345
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Nov 29 18:04:32 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Nov 29 18:04:32 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=962dfa01
Update readme
---
0000_README | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/0000_README b/0000_README
index a7526a7..b532df4 100644
--- a/0000_README
+++ b/0000_README
@@ -102,17 +102,17 @@ Patch: 5000_enable-additional-cpu-optimizations-for-gcc.patch
From: https://github.com/graysky2/kernel_gcc_patch/
Desc: Kernel patch enables gcc optimizations for additional CPUs.
-Patch: 5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
+Patch: 5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r6-3.16.patch
From: http://algo.ing.unimo.it/people/paolo/disk_sched/
-Desc: BFQ v7r5 patch 1 for 3.16: Build, cgroups and kconfig bits
+Desc: BFQ v7r6 patch 1 for 3.16: Build, cgroups and kconfig bits
-Patch: 5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
+Patch: 5002_BFQ-2-block-introduce-the-v7r6-I-O-sched-for-3.16.patch1
From: http://algo.ing.unimo.it/people/paolo/disk_sched/
-Desc: BFQ v7r5 patch 2 for 3.16: BFQ Scheduler
+Desc: BFQ v7r6 patch 2 for 3.16: BFQ Scheduler
-Patch: 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
+Patch: 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
From: http://algo.ing.unimo.it/people/paolo/disk_sched/
-Desc: BFQ v7r5 patch 3 for 3.16: Early Queue Merge (EQM)
+Desc: BFQ v7r6 patch 3 for 3.16: Early Queue Merge (EQM)
Patch: 5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
From: http://multipath-tcp.org/
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-11-29 18:11 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-11-29 18:11 UTC (permalink / raw
To: gentoo-commits
commit: 3c8127d4ebd36a23547beb8064cbedc12447d782
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Nov 29 18:11:33 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Nov 29 18:11:33 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=3c8127d4
Update multipath patch
---
0000_README | 2 +-
... => 5010_multipath-tcp-v3.16-075df3a63833.patch | 328 +++++++++++++++++++--
2 files changed, 312 insertions(+), 18 deletions(-)
diff --git a/0000_README b/0000_README
index 0ab3968..8719a11 100644
--- a/0000_README
+++ b/0000_README
@@ -118,7 +118,7 @@ Patch: 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
From: http://algo.ing.unimo.it/people/paolo/disk_sched/
Desc: BFQ v7r6 patch 3 for 3.16: Early Queue Merge (EQM)
-Patch: 5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
+Patch: 5010_multipath-tcp-v3.16-075df3a63833.patch
From: http://multipath-tcp.org/
Desc: Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
diff --git a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch b/5010_multipath-tcp-v3.16-075df3a63833.patch
similarity index 98%
rename from 5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
rename to 5010_multipath-tcp-v3.16-075df3a63833.patch
index 3000da3..7520b4a 100644
--- a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
+++ b/5010_multipath-tcp-v3.16-075df3a63833.patch
@@ -2572,10 +2572,10 @@ index 4db3c2a1679c..04cb17d4b0ce 100644
goto drop;
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
-index 05c57f0fcabe..630434db0085 100644
+index 05c57f0fcabe..811286a6aa9c 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
-@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
+@@ -556,6 +556,38 @@ config TCP_CONG_ILLINOIS
For further details see:
http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
@@ -2603,10 +2603,18 @@ index 05c57f0fcabe..630434db0085 100644
+ wVegas congestion control for MPTCP
+ To enable it, just put 'wvegas' in tcp_congestion_control
+
++config TCP_CONG_BALIA
++ tristate "MPTCP BALIA CONGESTION CONTROL"
++ depends on MPTCP
++ default n
++ ---help---
++ Multipath TCP Balanced Linked Adaptation Congestion Control
++ To enable it, just put 'balia' in tcp_congestion_control
++
choice
prompt "Default TCP congestion control"
default DEFAULT_CUBIC
-@@ -584,6 +608,15 @@ choice
+@@ -584,6 +616,18 @@ choice
config DEFAULT_WESTWOOD
bool "Westwood" if TCP_CONG_WESTWOOD=y
@@ -2619,15 +2627,19 @@ index 05c57f0fcabe..630434db0085 100644
+ config DEFAULT_WVEGAS
+ bool "Wvegas" if TCP_CONG_WVEGAS=y
+
++ config DEFAULT_BALIA
++ bool "Balia" if TCP_CONG_BALIA=y
++
config DEFAULT_RENO
bool "Reno"
-@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
+@@ -605,6 +649,9 @@ config DEFAULT_TCP_CONG
default "vegas" if DEFAULT_VEGAS
default "westwood" if DEFAULT_WESTWOOD
default "veno" if DEFAULT_VENO
+ default "coupled" if DEFAULT_COUPLED
+ default "wvegas" if DEFAULT_WVEGAS
++ default "balia" if DEFAULT_BALIA
default "reno" if DEFAULT_RENO
default "cubic"
@@ -7087,10 +7099,10 @@ index 000000000000..cdfc03adabf8
+
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
new file mode 100644
-index 000000000000..35561a7012e3
+index 000000000000..2feb3e873206
--- /dev/null
+++ b/net/mptcp/Makefile
-@@ -0,0 +1,20 @@
+@@ -0,0 +1,21 @@
+#
+## Makefile for MultiPath TCP support code.
+#
@@ -7104,6 +7116,7 @@ index 000000000000..35561a7012e3
+obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
+obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
+obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
++obj-$(CONFIG_TCP_CONG_BALIA) += mptcp_balia.o
+obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
+obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
+obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
@@ -7111,6 +7124,279 @@ index 000000000000..35561a7012e3
+
+mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
+
+diff --git a/net/mptcp/mptcp_balia.c b/net/mptcp/mptcp_balia.c
+new file mode 100644
+index 000000000000..5cc224d80b01
+--- /dev/null
++++ b/net/mptcp/mptcp_balia.c
+@@ -0,0 +1,267 @@
++/*
++ * MPTCP implementation - Balia Congestion Control
++ * (Balanced Linked Adaptation Algorithm)
++ *
++ * Analysis, Design and Implementation:
++ * Qiuyu Peng <qpeng@caltech.edu>
++ * Anwar Walid <anwar@research.bell-labs.com>
++ * Jaehyun Hwang <jh.hwang@alcatel-lucent.com>
++ * Steven H. Low <slow@caltech.edu>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++/* The variable 'rate' (i.e., x_r) will be scaled down
++ * e.g., from B/s to KB/s, MB/s, or GB/s
++ * if max_rate > 2^rate_scale_limit
++ */
++
++static int rate_scale_limit = 30;
++static int scale_num = 10;
++
++struct mptcp_balia {
++ u64 ai;
++ u64 md;
++ bool forced_update;
++};
++
++static inline int mptcp_balia_sk_can_send(const struct sock *sk)
++{
++ return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_get_ai(const struct sock *meta_sk)
++{
++ return ((struct mptcp_balia *)inet_csk_ca(meta_sk))->ai;
++}
++
++static inline void mptcp_set_ai(const struct sock *meta_sk, u64 ai)
++{
++ ((struct mptcp_balia *)inet_csk_ca(meta_sk))->ai = ai;
++}
++
++static inline u64 mptcp_get_md(const struct sock *meta_sk)
++{
++ return ((struct mptcp_balia *)inet_csk_ca(meta_sk))->md;
++}
++
++static inline void mptcp_set_md(const struct sock *meta_sk, u64 md)
++{
++ ((struct mptcp_balia *)inet_csk_ca(meta_sk))->md = md;
++}
++
++static inline u64 mptcp_balia_scale(u64 val, int scale)
++{
++ return (u64) val << scale;
++}
++
++static inline bool mptcp_get_forced(const struct sock *meta_sk)
++{
++ return ((struct mptcp_balia *)inet_csk_ca(meta_sk))->forced_update;
++}
++
++static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
++{
++ ((struct mptcp_balia *)inet_csk_ca(meta_sk))->forced_update = force;
++}
++
++static void mptcp_balia_recalc_ai(const struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++ const struct sock *sub_sk;
++ int can_send = 0;
++ u64 max_rate = 0, rate = 0, sum_rate = 0;
++ u64 alpha = 0, ai = 0, md = 0;
++ int num_scale_down = 0;
++
++ if (!mpcb)
++ return;
++
++ /* Only one subflow left - fall back to normal reno-behavior */
++ if (mpcb->cnt_established <= 1)
++ goto exit;
++
++ /* Find max_rate first */
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++ u64 tmp;
++
++ if (!mptcp_balia_sk_can_send(sub_sk))
++ continue;
++
++ can_send++;
++
++ tmp = div_u64((u64)tp->mss_cache * sub_tp->snd_cwnd
++ * (USEC_PER_SEC << 3), sub_tp->srtt_us);
++ sum_rate += tmp;
++
++ if (tmp >= max_rate)
++ max_rate = tmp;
++ }
++
++ /* No subflow is able to send - we don't care anymore */
++ if (unlikely(!can_send))
++ goto exit;
++
++ rate = div_u64((u64)tp->mss_cache * tp->snd_cwnd *
++ (USEC_PER_SEC << 3), tp->srtt_us);
++ alpha = div64_u64(max_rate, rate);
++
++ /* Scale down max_rate from B/s to KB/s, MB/s, or GB/s
++ * if max_rate is too high (i.e., >2^30)
++ */
++ while (max_rate > mptcp_balia_scale(1, rate_scale_limit)) {
++ max_rate >>= scale_num;
++ num_scale_down++;
++ }
++
++ if (num_scale_down) {
++ sum_rate = 0;
++ mptcp_for_each_sk(mpcb, sub_sk) {
++ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++ u64 tmp;
++
++ tmp = div_u64((u64)tp->mss_cache * sub_tp->snd_cwnd
++ * (USEC_PER_SEC << 3), sub_tp->srtt_us);
++ tmp >>= (scale_num * num_scale_down);
++
++ sum_rate += tmp;
++ }
++ rate >>= (scale_num * num_scale_down);
++ }
++
++ /* (sum_rate)^2 * 10 * w_r
++ * ai = ------------------------------------
++ * (x_r + max_rate) * (4x_r + max_rate)
++ */
++ sum_rate *= sum_rate;
++
++ ai = div64_u64(sum_rate * 10, rate + max_rate);
++ ai = div64_u64(ai * tp->snd_cwnd, (rate << 2) + max_rate);
++
++ if (unlikely(!ai))
++ ai = tp->snd_cwnd;
++
++ md = ((tp->snd_cwnd >> 1) * min(mptcp_balia_scale(alpha, scale_num),
++ mptcp_balia_scale(3, scale_num) >> 1))
++ >> scale_num;
++
++exit:
++ mptcp_set_ai(sk, ai);
++ mptcp_set_md(sk, md);
++}
++
++static void mptcp_balia_init(struct sock *sk)
++{
++ if (mptcp(tcp_sk(sk))) {
++ mptcp_set_forced(sk, 0);
++ mptcp_set_ai(sk, 0);
++ mptcp_set_md(sk, 0);
++ }
++}
++
++static void mptcp_balia_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++ if (event == CA_EVENT_COMPLETE_CWR || event == CA_EVENT_LOSS)
++ mptcp_balia_recalc_ai(sk);
++}
++
++static void mptcp_balia_set_state(struct sock *sk, u8 ca_state)
++{
++ if (!mptcp(tcp_sk(sk)))
++ return;
++
++ mptcp_set_forced(sk, 1);
++}
++
++static void mptcp_balia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++ struct tcp_sock *tp = tcp_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++ int snd_cwnd;
++
++ if (!mptcp(tp)) {
++ tcp_reno_cong_avoid(sk, ack, acked);
++ return;
++ }
++
++ if (!tcp_is_cwnd_limited(sk))
++ return;
++
++ if (tp->snd_cwnd <= tp->snd_ssthresh) {
++ /* In "safe" area, increase. */
++ tcp_slow_start(tp, acked);
++ mptcp_balia_recalc_ai(sk);
++ return;
++ }
++
++ if (mptcp_get_forced(mptcp_meta_sk(sk))) {
++ mptcp_balia_recalc_ai(sk);
++ mptcp_set_forced(sk, 0);
++ }
++
++ if (mpcb->cnt_established > 1)
++ snd_cwnd = (int) mptcp_get_ai(sk);
++ else
++ snd_cwnd = tp->snd_cwnd;
++
++ if (tp->snd_cwnd_cnt >= snd_cwnd) {
++ if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
++ tp->snd_cwnd++;
++ mptcp_balia_recalc_ai(sk);
++ }
++
++ tp->snd_cwnd_cnt = 0;
++ } else {
++ tp->snd_cwnd_cnt++;
++ }
++}
++
++static u32 mptcp_balia_ssthresh(struct sock *sk)
++{
++ const struct tcp_sock *tp = tcp_sk(sk);
++ const struct mptcp_cb *mpcb = tp->mpcb;
++
++ if (unlikely(!mptcp(tp) || mpcb->cnt_established <= 1))
++ return tcp_reno_ssthresh(sk);
++ else
++ return max((u32)(tp->snd_cwnd - mptcp_get_md(sk)), 1U);
++}
++
++static struct tcp_congestion_ops mptcp_balia = {
++ .init = mptcp_balia_init,
++ .ssthresh = mptcp_balia_ssthresh,
++ .cong_avoid = mptcp_balia_cong_avoid,
++ .cwnd_event = mptcp_balia_cwnd_event,
++ .set_state = mptcp_balia_set_state,
++ .owner = THIS_MODULE,
++ .name = "balia",
++};
++
++static int __init mptcp_balia_register(void)
++{
++ BUILD_BUG_ON(sizeof(struct mptcp_balia) > ICSK_CA_PRIV_SIZE);
++ return tcp_register_congestion_control(&mptcp_balia);
++}
++
++static void __exit mptcp_balia_unregister(void)
++{
++ tcp_unregister_congestion_control(&mptcp_balia);
++}
++
++module_init(mptcp_balia_register);
++module_exit(mptcp_balia_unregister);
++
++MODULE_AUTHOR("Jaehyun Hwang, Anwar Walid, Qiuyu Peng, Steven H. Low");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP BALIA CONGESTION CONTROL ALGORITHM");
++MODULE_VERSION("0.1");
diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
new file mode 100644
index 000000000000..95d8da560715
@@ -10289,10 +10575,10 @@ index 000000000000..28dfa0479f5e
+}
diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
new file mode 100644
-index 000000000000..3a54413ce25b
+index 000000000000..2e4895c9e49c
--- /dev/null
+++ b/net/mptcp/mptcp_fullmesh.c
-@@ -0,0 +1,1722 @@
+@@ -0,0 +1,1730 @@
+#include <linux/module.h>
+
+#include <net/mptcp.h>
@@ -11282,10 +11568,10 @@ index 000000000000..3a54413ce25b
+static int inet6_addr_event(struct notifier_block *this,
+ unsigned long event, void *ptr);
+
-+static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
++static bool ipv6_dad_finished(const struct inet6_ifaddr *ifa)
+{
-+ return (ifa->flags & IFA_F_TENTATIVE) &&
-+ ifa->state == INET6_IFADDR_STATE_DAD;
++ return !(ifa->flags & IFA_F_TENTATIVE) ||
++ ifa->state > INET6_IFADDR_STATE_DAD;
+}
+
+static void dad_init_timer(struct mptcp_dad_data *data,
@@ -11304,14 +11590,22 @@ index 000000000000..3a54413ce25b
+{
+ struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
+
-+ if (ipv6_is_in_dad_state(data->ifa)) {
++ /* DAD failed or IP brought down? */
++ if (data->ifa->state == INET6_IFADDR_STATE_ERRDAD ||
++ data->ifa->state == INET6_IFADDR_STATE_DEAD)
++ goto exit;
++
++ if (!ipv6_dad_finished(data->ifa)) {
+ dad_init_timer(data, data->ifa);
+ add_timer(&data->timer);
-+ } else {
-+ inet6_addr_event(NULL, NETDEV_UP, data->ifa);
-+ in6_ifa_put(data->ifa);
-+ kfree(data);
++ return;
+ }
++
++ inet6_addr_event(NULL, NETDEV_UP, data->ifa);
++
++exit:
++ in6_ifa_put(data->ifa);
++ kfree(data);
+}
+
+static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
@@ -11376,7 +11670,7 @@ index 000000000000..3a54413ce25b
+ event == NETDEV_CHANGE))
+ return NOTIFY_DONE;
+
-+ if (ipv6_is_in_dad_state(ifa6))
++ if (!ipv6_dad_finished(ifa6))
+ dad_setup_timer(ifa6);
+ else
+ addr6_event_handler(ifa6, event, net);
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-12-16 17:29 Mike Pagano
0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-12-16 17:29 UTC (permalink / raw
To: gentoo-commits
commit: b40e4b7205dd73330cf29bf39590327f973a473b
Author: Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Tue Dec 16 17:29:50 2014 +0000
Commit: Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Tue Dec 16 17:29:50 2014 +0000
URL: http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=b40e4b72
Updating multipath tcp patch
---
0000_README | 2 +-
... => 5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch | 250 ++++++++++++---------
2 files changed, 139 insertions(+), 113 deletions(-)
diff --git a/0000_README b/0000_README
index 8719a11..7122ab1 100644
--- a/0000_README
+++ b/0000_README
@@ -118,7 +118,7 @@ Patch: 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
From: http://algo.ing.unimo.it/people/paolo/disk_sched/
Desc: BFQ v7r6 patch 3 for 3.16: Early Queue Merge (EQM)
-Patch: 5010_multipath-tcp-v3.16-075df3a63833.patch
+Patch: 5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch
From: http://multipath-tcp.org/
Desc: Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
diff --git a/5010_multipath-tcp-v3.16-075df3a63833.patch b/5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch
similarity index 98%
rename from 5010_multipath-tcp-v3.16-075df3a63833.patch
rename to 5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch
index 7520b4a..2858f5b 100644
--- a/5010_multipath-tcp-v3.16-075df3a63833.patch
+++ b/5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch
@@ -1991,7 +1991,7 @@ index 156350745700..0e23cae8861f 100644
struct timewait_sock_ops;
struct inet_hashinfo;
diff --git a/include/net/tcp.h b/include/net/tcp.h
-index 7286db80e8b8..ff92e74cd684 100644
+index 7286db80e8b8..2130c1c7fe6e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
@@ -2030,7 +2030,7 @@ index 7286db80e8b8..ff92e74cd684 100644
extern struct inet_timewait_death_row tcp_death_row;
/* sysctl variables for tcp */
-@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
+@@ -344,6 +366,108 @@ extern struct proto tcp_prot;
#define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
#define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
@@ -2040,6 +2040,7 @@ index 7286db80e8b8..ff92e74cd684 100644
+
+struct mptcp_options_received;
+
++void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited);
+void tcp_enter_quickack_mode(struct sock *sk);
+int tcp_close_state(struct sock *sk);
+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
@@ -2138,7 +2139,7 @@ index 7286db80e8b8..ff92e74cd684 100644
void tcp_tasklet_init(void);
void tcp_v4_err(struct sk_buff *skb, u32);
-@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -440,6 +564,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t len, int nonblock, int flags, int *addr_len);
void tcp_parse_options(const struct sk_buff *skb,
struct tcp_options_received *opt_rx,
@@ -2146,7 +2147,7 @@ index 7286db80e8b8..ff92e74cd684 100644
int estab, struct tcp_fastopen_cookie *foc);
const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
-@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
+@@ -493,14 +618,8 @@ static inline u32 tcp_cookie_time(void)
u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
u16 *mssp);
@@ -2163,7 +2164,7 @@ index 7286db80e8b8..ff92e74cd684 100644
#endif
__u32 cookie_init_timestamp(struct request_sock *req);
-@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
+@@ -516,13 +635,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
const struct tcphdr *th, u16 *mssp);
__u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
__u16 *mss);
@@ -2177,7 +2178,7 @@ index 7286db80e8b8..ff92e74cd684 100644
#endif
/* tcp_output.c */
-@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
+@@ -551,10 +663,17 @@ void tcp_send_delayed_ack(struct sock *sk);
void tcp_send_loss_probe(struct sock *sk);
bool tcp_schedule_loss_probe(struct sock *sk);
@@ -2195,7 +2196,7 @@ index 7286db80e8b8..ff92e74cd684 100644
/* tcp_timer.c */
void tcp_init_xmit_timers(struct sock *);
-@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
+@@ -703,14 +822,27 @@ void tcp_send_window_probe(struct sock *sk);
*/
struct tcp_skb_cb {
union {
@@ -2226,7 +2227,7 @@ index 7286db80e8b8..ff92e74cd684 100644
__u8 tcp_flags; /* TCP header flags. (tcp[13]) */
__u8 sacked; /* State flags for SACK/FACK. */
-@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
+@@ -1075,7 +1207,8 @@ u32 tcp_default_init_rwnd(u32 mss);
/* Determine a window scaling and initial window to offer. */
void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
__u32 *window_clamp, int wscale_ok,
@@ -2236,7 +2237,7 @@ index 7286db80e8b8..ff92e74cd684 100644
static inline int tcp_win_from_space(int space)
{
-@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
+@@ -1084,6 +1217,19 @@ static inline int tcp_win_from_space(int space)
space - (space>>sysctl_tcp_adv_win_scale);
}
@@ -2256,22 +2257,7 @@ index 7286db80e8b8..ff92e74cd684 100644
/* Note: caller must be prepared to deal with negative returns */
static inline int tcp_space(const struct sock *sk)
{
-+ if (mptcp(tcp_sk(sk)))
-+ sk = tcp_sk(sk)->meta_sk;
-+
- return tcp_win_from_space(sk->sk_rcvbuf -
- atomic_read(&sk->sk_rmem_alloc));
- }
-
- static inline int tcp_full_space(const struct sock *sk)
- {
-+ if (mptcp(tcp_sk(sk)))
-+ sk = tcp_sk(sk)->meta_sk;
-+
- return tcp_win_from_space(sk->sk_rcvbuf);
- }
-
-@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
+@@ -1115,6 +1261,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
ireq->wscale_ok = rx_opt->wscale_ok;
ireq->acked = 0;
ireq->ecn_ok = 0;
@@ -2280,7 +2266,7 @@ index 7286db80e8b8..ff92e74cd684 100644
ireq->ir_rmt_port = tcp_hdr(skb)->source;
ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
}
-@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
+@@ -1585,6 +1733,11 @@ int tcp4_proc_init(void);
void tcp4_proc_exit(void);
#endif
@@ -2292,7 +2278,7 @@ index 7286db80e8b8..ff92e74cd684 100644
/* TCP af-specific functions */
struct tcp_sock_af_ops {
#ifdef CONFIG_TCP_MD5SIG
-@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
+@@ -1601,7 +1754,33 @@ struct tcp_sock_af_ops {
#endif
};
@@ -2317,6 +2303,7 @@ index 7286db80e8b8..ff92e74cd684 100644
+ void (*time_wait)(struct sock *sk, int state, int timeo);
+ void (*cleanup_rbuf)(struct sock *sk, int copied);
+ void (*init_congestion_control)(struct sock *sk);
++ void (*cwnd_validate)(struct sock *sk, bool is_cwnd_limited);
+};
+extern const struct tcp_sock_ops tcp_specific;
+
@@ -2325,7 +2312,7 @@ index 7286db80e8b8..ff92e74cd684 100644
#ifdef CONFIG_TCP_MD5SIG
struct tcp_md5sig_key *(*md5_lookup) (struct sock *sk,
struct request_sock *req);
-@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
+@@ -1611,8 +1790,39 @@ struct tcp_request_sock_ops {
const struct request_sock *req,
const struct sk_buff *skb);
#endif
@@ -2572,20 +2559,20 @@ index 4db3c2a1679c..04cb17d4b0ce 100644
goto drop;
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
-index 05c57f0fcabe..811286a6aa9c 100644
+index 05c57f0fcabe..a1ba825c6acd 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -556,6 +556,38 @@ config TCP_CONG_ILLINOIS
For further details see:
http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
-+config TCP_CONG_COUPLED
-+ tristate "MPTCP COUPLED CONGESTION CONTROL"
++config TCP_CONG_LIA
++ tristate "MPTCP Linked Increase"
+ depends on MPTCP
+ default n
+ ---help---
-+ MultiPath TCP Coupled Congestion Control
-+ To enable it, just put 'coupled' in tcp_congestion_control
++ MultiPath TCP Linked Increase Congestion Control
++ To enable it, just put 'lia' in tcp_congestion_control
+
+config TCP_CONG_OLIA
+ tristate "MPTCP Opportunistic Linked Increase"
@@ -2618,8 +2605,8 @@ index 05c57f0fcabe..811286a6aa9c 100644
config DEFAULT_WESTWOOD
bool "Westwood" if TCP_CONG_WESTWOOD=y
-+ config DEFAULT_COUPLED
-+ bool "Coupled" if TCP_CONG_COUPLED=y
++ config DEFAULT_LIA
++ bool "Lia" if TCP_CONG_LIA=y
+
+ config DEFAULT_OLIA
+ bool "Olia" if TCP_CONG_OLIA=y
@@ -2637,7 +2624,7 @@ index 05c57f0fcabe..811286a6aa9c 100644
default "vegas" if DEFAULT_VEGAS
default "westwood" if DEFAULT_WESTWOOD
default "veno" if DEFAULT_VENO
-+ default "coupled" if DEFAULT_COUPLED
++ default "lia" if DEFAULT_LIA
+ default "wvegas" if DEFAULT_WVEGAS
+ default "balia" if DEFAULT_BALIA
default "reno" if DEFAULT_RENO
@@ -2815,7 +2802,7 @@ index c86624b36a62..0ff3fe004d62 100644
ireq->rcv_wscale = rcv_wscale;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
-index 9d2118e5fbc7..2cb89f886d45 100644
+index 9d2118e5fbc7..cb59aef70d26 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -271,6 +271,7 @@
@@ -2826,7 +2813,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
#include <net/tcp.h>
#include <net/xfrm.h>
#include <net/ip.h>
-@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+@@ -371,6 +372,25 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
return period;
}
@@ -2846,12 +2833,13 @@ index 9d2118e5fbc7..2cb89f886d45 100644
+ .retransmit_timer = tcp_retransmit_timer,
+ .time_wait = tcp_time_wait,
+ .cleanup_rbuf = tcp_cleanup_rbuf,
++ .cwnd_validate = tcp_cwnd_validate,
+};
+
/* Address-family independent initialization for a tcp_sock.
*
* NOTE: A lot of things set to zero explicitly by call to
-@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
+@@ -419,6 +439,8 @@ void tcp_init_sock(struct sock *sk)
sk->sk_sndbuf = sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sysctl_tcp_rmem[1];
@@ -2860,7 +2848,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
local_bh_disable();
sock_update_memcg(sk);
sk_sockets_allocated_inc(sk);
-@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+@@ -726,6 +748,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
int ret;
sock_rps_record_flow(sk);
@@ -2875,7 +2863,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
/*
* We can't seek on a socket input
*/
-@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
+@@ -821,8 +851,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
return NULL;
}
@@ -2885,7 +2873,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
{
struct tcp_sock *tp = tcp_sk(sk);
u32 xmit_size_goal, old_size_goal;
-@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+@@ -872,8 +901,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
{
int mss_now;
@@ -2901,7 +2889,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
return mss_now;
}
-@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+@@ -892,11 +926,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
* is fully established.
*/
if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
@@ -2935,7 +2923,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
mss_now = tcp_send_mss(sk, &size_goal, flags);
-@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+@@ -1001,8 +1056,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
{
ssize_t res;
@@ -2947,7 +2935,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
return sock_no_sendpage(sk->sk_socket, page, offset, size,
flags);
-@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
+@@ -1018,6 +1074,9 @@ static inline int select_size(const struct sock *sk, bool sg)
const struct tcp_sock *tp = tcp_sk(sk);
int tmp = tp->mss_cache;
@@ -2957,7 +2945,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
if (sg) {
if (sk_can_gso(sk)) {
/* Small frames wont use a full page:
-@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1100,11 +1159,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
* is fully established.
*/
if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
@@ -2977,7 +2965,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
if (unlikely(tp->repair)) {
if (tp->repair_queue == TCP_RECV_QUEUE) {
copied = tcp_send_rcvq(sk, msg, size);
-@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1132,7 +1198,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;
@@ -2989,7 +2977,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
while (--iovlen >= 0) {
size_t seglen = iov->iov_len;
-@@ -1183,8 +1251,15 @@ new_segment:
+@@ -1183,8 +1252,15 @@ new_segment:
/*
* Check whether we can use HW checksum.
@@ -3006,7 +2994,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
skb->ip_summed = CHECKSUM_PARTIAL;
skb_entail(sk, skb);
-@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
+@@ -1422,7 +1498,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
/* Optimize, __tcp_select_window() is not cheap. */
if (2*rcv_window_now <= tp->window_clamp) {
@@ -3015,7 +3003,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
/* Send ACK now, if this read freed lots of space
* in our buffer. Certainly, new_window is new window.
-@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+@@ -1587,7 +1663,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
/* Clean up data we have read: This will do ACK frames. */
if (copied > 0) {
tcp_recv_skb(sk, seq, &offset);
@@ -3024,7 +3012,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
}
return copied;
}
-@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1623,6 +1699,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
lock_sock(sk);
@@ -3039,7 +3027,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
err = -ENOTCONN;
if (sk->sk_state == TCP_LISTEN)
goto out;
-@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1761,7 +1845,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
}
}
@@ -3048,7 +3036,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
/* Install new reader */
-@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1813,7 +1897,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
if (tp->rcv_wnd == 0 &&
!skb_queue_empty(&sk->sk_async_wait_queue)) {
tcp_service_net_dma(sk, true);
@@ -3057,7 +3045,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
} else
dma_async_issue_pending(tp->ucopy.dma_chan);
}
-@@ -1993,7 +2076,7 @@ skip_copy:
+@@ -1993,7 +2077,7 @@ skip_copy:
*/
/* Clean up data we have read: This will do ACK frames. */
@@ -3066,7 +3054,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
release_sock(sk);
return copied;
-@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
+@@ -2070,7 +2154,7 @@ static const unsigned char new_state[16] = {
/* TCP_CLOSING */ TCP_CLOSING,
};
@@ -3075,7 +3063,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
{
int next = (int)new_state[sk->sk_state];
int ns = next & TCP_STATE_MASK;
-@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
+@@ -2100,7 +2184,7 @@ void tcp_shutdown(struct sock *sk, int how)
TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
/* Clear out any half completed packets. FIN if needed. */
if (tcp_close_state(sk))
@@ -3084,7 +3072,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
}
}
EXPORT_SYMBOL(tcp_shutdown);
-@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
+@@ -2125,6 +2209,11 @@ void tcp_close(struct sock *sk, long timeout)
int data_was_unread = 0;
int state;
@@ -3096,7 +3084,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
lock_sock(sk);
sk->sk_shutdown = SHUTDOWN_MASK;
-@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
+@@ -2167,7 +2256,7 @@ void tcp_close(struct sock *sk, long timeout)
/* Unread data was tossed, zap the connection. */
NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
@@ -3105,7 +3093,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
/* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
-@@ -2247,7 +2335,7 @@ adjudge_to_death:
+@@ -2247,7 +2336,7 @@ adjudge_to_death:
struct tcp_sock *tp = tcp_sk(sk);
if (tp->linger2 < 0) {
tcp_set_state(sk, TCP_CLOSE);
@@ -3114,7 +3102,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
NET_INC_STATS_BH(sock_net(sk),
LINUX_MIB_TCPABORTONLINGER);
} else {
-@@ -2257,7 +2345,8 @@ adjudge_to_death:
+@@ -2257,7 +2346,8 @@ adjudge_to_death:
inet_csk_reset_keepalive_timer(sk,
tmo - TCP_TIMEWAIT_LEN);
} else {
@@ -3124,7 +3112,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
goto out;
}
}
-@@ -2266,7 +2355,7 @@ adjudge_to_death:
+@@ -2266,7 +2356,7 @@ adjudge_to_death:
sk_mem_reclaim(sk);
if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
@@ -3133,7 +3121,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
NET_INC_STATS_BH(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
}
-@@ -2291,15 +2380,6 @@ out:
+@@ -2291,15 +2381,6 @@ out:
}
EXPORT_SYMBOL(tcp_close);
@@ -3149,7 +3137,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
int tcp_disconnect(struct sock *sk, int flags)
{
struct inet_sock *inet = inet_sk(sk);
-@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
+@@ -2322,7 +2403,7 @@ int tcp_disconnect(struct sock *sk, int flags)
/* The last check adjusts for discrepancy of Linux wrt. RFC
* states
*/
@@ -3158,7 +3146,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
sk->sk_err = ECONNRESET;
} else if (old_state == TCP_SYN_SENT)
sk->sk_err = ECONNRESET;
-@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
+@@ -2340,6 +2421,13 @@ int tcp_disconnect(struct sock *sk, int flags)
if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
inet_reset_saddr(sk);
@@ -3172,7 +3160,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
sk->sk_shutdown = 0;
sock_reset_flag(sk, SOCK_DONE);
tp->srtt_us = 0;
-@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+@@ -2632,6 +2720,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;
case TCP_DEFER_ACCEPT:
@@ -3185,7 +3173,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
/* Translate value in seconds to number of retransmits */
icsk->icsk_accept_queue.rskq_defer_accept =
secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
-@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+@@ -2659,7 +2753,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
inet_csk_ack_scheduled(sk)) {
icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
@@ -3194,7 +3182,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
if (!(val & 1))
icsk->icsk_ack.pingpong = 1;
}
-@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+@@ -2699,6 +2793,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
tp->notsent_lowat = val;
sk->sk_write_space(sk);
break;
@@ -3213,7 +3201,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
default:
err = -ENOPROTOOPT;
break;
-@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
+@@ -2931,6 +3037,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
case TCP_NOTSENT_LOWAT:
val = tp->notsent_lowat;
break;
@@ -3225,7 +3213,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
default:
return -ENOPROTOOPT;
}
-@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
+@@ -3120,8 +3231,11 @@ void tcp_done(struct sock *sk)
if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
@@ -3299,7 +3287,7 @@ index 9771563ab564..5c230d96c4c1 100644
WARN_ON(req->sk == NULL);
return true;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
-index 40639c288dc2..3273bb69f387 100644
+index 40639c288dc2..71033189797d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -74,6 +74,9 @@
@@ -3391,7 +3379,7 @@ index 40639c288dc2..3273bb69f387 100644
- if (tp->rcv_ssthresh < tp->window_clamp &&
- (int)tp->rcv_ssthresh < tcp_space(sk) &&
+ if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
-+ (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
++ (int)meta_tp->rcv_ssthresh < tcp_space(meta_sk) &&
!sk_under_memory_pressure(sk)) {
int incr;
@@ -5203,7 +5191,7 @@ index e68e0d4af6c9..ae6946857dff 100644
return ret;
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
-index 179b51e6bda3..efd31b6c5784 100644
+index 179b51e6bda3..267d5f7eb303 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -36,6 +36,12 @@
@@ -5559,6 +5547,15 @@ index 179b51e6bda3..efd31b6c5784 100644
/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
* As additional protections, we do not touch cwnd in retransmission phases,
+@@ -1402,7 +1448,7 @@ static void tcp_cwnd_application_limited(struct sock *sk)
+ tp->snd_cwnd_stamp = tcp_time_stamp;
+ }
+
+-static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited)
++void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited)
+ {
+ struct tcp_sock *tp = tcp_sk(sk);
+
@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
* But we can avoid doing the divide again given we already have
* skb_pcount = skb->len / mss_now
@@ -5680,7 +5677,17 @@ index 179b51e6bda3..efd31b6c5784 100644
/* Do MTU probing. */
result = tcp_mtu_probe(sk);
if (!result) {
-@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
+@@ -2004,7 +2055,8 @@ repair:
+ /* Send one loss probe per tail loss episode. */
+ if (push_one != 2)
+ tcp_schedule_loss_probe(sk);
+- tcp_cwnd_validate(sk, is_cwnd_limited);
++ if (tp->ops->cwnd_validate)
++ tp->ops->cwnd_validate(sk, is_cwnd_limited);
+ return false;
+ }
+ return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk));
+@@ -2099,7 +2151,8 @@ void tcp_send_loss_probe(struct sock *sk)
int err = -1;
if (tcp_send_head(sk) != NULL) {
@@ -5690,7 +5697,7 @@ index 179b51e6bda3..efd31b6c5784 100644
goto rearm_timer;
}
-@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
+@@ -2159,8 +2212,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
if (unlikely(sk->sk_state == TCP_CLOSE))
return;
@@ -5701,7 +5708,7 @@ index 179b51e6bda3..efd31b6c5784 100644
tcp_check_probe_timer(sk);
}
-@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
+@@ -2173,7 +2226,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
BUG_ON(!skb || skb->len < mss_now);
@@ -5711,7 +5718,7 @@ index 179b51e6bda3..efd31b6c5784 100644
}
/* This function returns the amount that we can raise the
-@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+@@ -2386,6 +2440,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
return;
@@ -5722,7 +5729,7 @@ index 179b51e6bda3..efd31b6c5784 100644
tcp_for_write_queue_from_safe(skb, tmp, sk) {
if (!tcp_can_collapse(sk, skb))
break;
-@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+@@ -2843,7 +2901,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
th->window = htons(min(req->rcv_wnd, 65535U));
@@ -5731,7 +5738,7 @@ index 179b51e6bda3..efd31b6c5784 100644
th->doff = (tcp_header_size >> 2);
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
-@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
+@@ -2897,13 +2955,13 @@ static void tcp_connect_init(struct sock *sk)
(tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
tp->window_clamp = tcp_full_space(sk);
@@ -5752,7 +5759,7 @@ index 179b51e6bda3..efd31b6c5784 100644
tp->rx_opt.rcv_wscale = rcv_wscale;
tp->rcv_ssthresh = tp->rcv_wnd;
-@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
+@@ -2927,6 +2985,36 @@ static void tcp_connect_init(struct sock *sk)
inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
inet_csk(sk)->icsk_retransmits = 0;
tcp_clear_retrans(tp);
@@ -5789,7 +5796,7 @@ index 179b51e6bda3..efd31b6c5784 100644
}
static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
-@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
+@@ -3176,6 +3264,7 @@ void tcp_send_ack(struct sock *sk)
TCP_SKB_CB(buff)->when = tcp_time_stamp;
tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
}
@@ -5797,7 +5804,7 @@ index 179b51e6bda3..efd31b6c5784 100644
/* This routine sends a packet with an out of date sequence
* number. It assumes the other end will try to ack it.
-@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
+@@ -3188,7 +3277,7 @@ void tcp_send_ack(struct sock *sk)
* one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
* out-of-date with SND.UNA-1 to probe window.
*/
@@ -5806,7 +5813,7 @@ index 179b51e6bda3..efd31b6c5784 100644
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
-@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
+@@ -3270,7 +3359,7 @@ void tcp_send_probe0(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
int err;
@@ -5815,7 +5822,7 @@ index 179b51e6bda3..efd31b6c5784 100644
if (tp->packets_out || !tcp_send_head(sk)) {
/* Cancel probe timer, if it is not required. */
-@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
+@@ -3301,3 +3390,18 @@ void tcp_send_probe0(struct sock *sk)
TCP_RTO_MAX);
}
}
@@ -7099,7 +7106,7 @@ index 000000000000..cdfc03adabf8
+
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
new file mode 100644
-index 000000000000..2feb3e873206
+index 000000000000..5c70e7cca3b3
--- /dev/null
+++ b/net/mptcp/Makefile
@@ -0,0 +1,21 @@
@@ -7113,7 +7120,7 @@ index 000000000000..2feb3e873206
+mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
+ mptcp_output.o mptcp_input.o mptcp_sched.o
+
-+obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
++obj-$(CONFIG_TCP_CONG_LIA) += mptcp_coupled.o
+obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
+obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
+obj-$(CONFIG_TCP_CONG_BALIA) += mptcp_balia.o
@@ -7126,7 +7133,7 @@ index 000000000000..2feb3e873206
+
diff --git a/net/mptcp/mptcp_balia.c b/net/mptcp/mptcp_balia.c
new file mode 100644
-index 000000000000..5cc224d80b01
+index 000000000000..565cb75e2cea
--- /dev/null
+++ b/net/mptcp/mptcp_balia.c
@@ -0,0 +1,267 @@
@@ -7156,8 +7163,9 @@ index 000000000000..5cc224d80b01
+ * if max_rate > 2^rate_scale_limit
+ */
+
-+static int rate_scale_limit = 30;
-+static int scale_num = 10;
++static int rate_scale_limit = 25;
++static int alpha_scale = 10;
++static int scale_num = 5;
+
+struct mptcp_balia {
+ u64 ai;
@@ -7210,7 +7218,6 @@ index 000000000000..5cc224d80b01
+ const struct tcp_sock *tp = tcp_sk(sk);
+ const struct mptcp_cb *mpcb = tp->mpcb;
+ const struct sock *sub_sk;
-+ int can_send = 0;
+ u64 max_rate = 0, rate = 0, sum_rate = 0;
+ u64 alpha = 0, ai = 0, md = 0;
+ int num_scale_down = 0;
@@ -7230,27 +7237,24 @@ index 000000000000..5cc224d80b01
+ if (!mptcp_balia_sk_can_send(sub_sk))
+ continue;
+
-+ can_send++;
-+
+ tmp = div_u64((u64)tp->mss_cache * sub_tp->snd_cwnd
+ * (USEC_PER_SEC << 3), sub_tp->srtt_us);
+ sum_rate += tmp;
+
++ if (tp == sub_tp)
++ rate = tmp;
++
+ if (tmp >= max_rate)
+ max_rate = tmp;
+ }
+
-+ /* No subflow is able to send - we don't care anymore */
-+ if (unlikely(!can_send))
++ /* At least, the current subflow should be able to send */
++ if (unlikely(!rate))
+ goto exit;
+
-+ rate = div_u64((u64)tp->mss_cache * tp->snd_cwnd *
-+ (USEC_PER_SEC << 3), tp->srtt_us);
+ alpha = div64_u64(max_rate, rate);
+
-+ /* Scale down max_rate from B/s to KB/s, MB/s, or GB/s
-+ * if max_rate is too high (i.e., >2^30)
-+ */
++ /* Scale down max_rate if it is too high (e.g., >2^25) */
+ while (max_rate > mptcp_balia_scale(1, rate_scale_limit)) {
+ max_rate >>= scale_num;
+ num_scale_down++;
@@ -7262,6 +7266,9 @@ index 000000000000..5cc224d80b01
+ struct tcp_sock *sub_tp = tcp_sk(sub_sk);
+ u64 tmp;
+
++ if (!mptcp_balia_sk_can_send(sub_sk))
++ continue;
++
+ tmp = div_u64((u64)tp->mss_cache * sub_tp->snd_cwnd
+ * (USEC_PER_SEC << 3), sub_tp->srtt_us);
+ tmp >>= (scale_num * num_scale_down);
@@ -7283,9 +7290,9 @@ index 000000000000..5cc224d80b01
+ if (unlikely(!ai))
+ ai = tp->snd_cwnd;
+
-+ md = ((tp->snd_cwnd >> 1) * min(mptcp_balia_scale(alpha, scale_num),
-+ mptcp_balia_scale(3, scale_num) >> 1))
-+ >> scale_num;
++ md = ((tp->snd_cwnd >> 1) * min(mptcp_balia_scale(alpha, alpha_scale),
++ mptcp_balia_scale(3, alpha_scale) >> 1))
++ >> alpha_scale;
+
+exit:
+ mptcp_set_ai(sk, ai);
@@ -16520,10 +16527,10 @@ index 000000000000..53f5c43bb488
+MODULE_VERSION("0.1");
diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
new file mode 100644
-index 000000000000..400ea254c078
+index 000000000000..e2a6a6d6522d
--- /dev/null
+++ b/net/mptcp/mptcp_output.c
-@@ -0,0 +1,1743 @@
+@@ -0,0 +1,1758 @@
+/*
+ * MPTCP implementation - Sending side
+ *
@@ -17181,11 +17188,9 @@ index 000000000000..400ea254c078
+ struct sock *subsk = NULL;
+ struct mptcp_cb *mpcb = meta_tp->mpcb;
+ struct sk_buff *skb;
-+ unsigned int sent_pkts;
+ int reinject = 0;
+ unsigned int sublimit;
-+
-+ sent_pkts = 0;
++ __u32 path_mask = 0;
+
+ while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
+ &sublimit))) {
@@ -17266,6 +17271,7 @@ index 000000000000..400ea254c078
+ * always push on the subflow
+ */
+ __tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++ path_mask |= mptcp_pi_to_flag(subtp->mptcp->path_index);
+ TCP_SKB_CB(skb)->when = tcp_time_stamp;
+
+ if (!reinject) {
@@ -17276,7 +17282,6 @@ index 000000000000..400ea254c078
+ }
+
+ tcp_minshall_update(meta_tp, mss_now, skb);
-+ sent_pkts += tcp_skb_pcount(skb);
+
+ if (reinject > 0) {
+ __skb_unlink(skb, &mpcb->reinject_queue);
@@ -17287,6 +17292,22 @@ index 000000000000..400ea254c078
+ break;
+ }
+
++ mptcp_for_each_sk(mpcb, subsk) {
++ subtp = tcp_sk(subsk);
++
++ if (!(path_mask & mptcp_pi_to_flag(subtp->mptcp->path_index)))
++ continue;
++
++ /* We have pushed data on this subflow. We ignore the call to
++ * cwnd_validate in tcp_write_xmit as is_cwnd_limited will never
++ * be true (we never push more than what the cwnd can accept).
++ * We need to ensure that we call tcp_cwnd_validate with
++ * is_cwnd_limited set to true if we have filled the cwnd.
++ */
++ tcp_cwnd_validate(subsk, tcp_packets_in_flight(subtp) >=
++ subtp->snd_cwnd);
++ }
++
+ return !meta_tp->packets_out && tcp_send_head(meta_sk);
+}
+
@@ -17299,6 +17320,7 @@ index 000000000000..400ea254c078
+{
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++ struct sock *meta_sk = mptcp_meta_sk(sk);
+ int mss, free_space, full_space, window;
+
+ /* MSS for the peer's data. Previous versions used mss_clamp
@@ -17308,9 +17330,9 @@ index 000000000000..400ea254c078
+ * fluctuations. --SAW 1998/11/1
+ */
+ mss = icsk->icsk_ack.rcv_mss;
-+ free_space = tcp_space(sk);
++ free_space = tcp_space(meta_sk);
+ full_space = min_t(int, meta_tp->window_clamp,
-+ tcp_full_space(sk));
++ tcp_full_space(meta_sk));
+
+ if (mss > full_space)
+ mss = full_space;
@@ -18751,10 +18773,10 @@ index 000000000000..93278f684069
+MODULE_VERSION("0.89");
diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
new file mode 100644
-index 000000000000..6c7ff4eceac1
+index 000000000000..4a578821f50e
--- /dev/null
+++ b/net/mptcp/mptcp_sched.c
-@@ -0,0 +1,493 @@
+@@ -0,0 +1,497 @@
+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
+
+#include <linux/module.h>
@@ -18979,8 +19001,12 @@ index 000000000000..6c7ff4eceac1
+ if (tp_it != tp &&
+ TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
+ if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
++ u32 prior_cwnd = tp_it->snd_cwnd;
++
+ tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
-+ if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
++
++ /* If in slow start, do not reduce the ssthresh */
++ if (prior_cwnd >= tp_it->snd_ssthresh)
+ tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
+
+ dsp->last_rbuf_opti = tcp_time_stamp;
^ permalink raw reply related [flat|nested] 26+ messages in thread
end of thread, other threads:[~2014-12-16 17:29 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-17 22:19 [gentoo-commits] proj/linux-patches:3.16 commit in: / Anthony G. Basile
-- strict thread matches above, loose matches on Subject: below --
2014-12-16 17:29 Mike Pagano
2014-11-29 18:11 Mike Pagano
2014-11-29 18:05 Mike Pagano
2014-11-29 18:05 Mike Pagano
2014-11-29 18:05 Mike Pagano
2014-10-30 19:29 Mike Pagano
2014-10-15 12:42 Mike Pagano
2014-10-09 19:54 Mike Pagano
2014-10-07 1:34 Anthony G. Basile
2014-10-07 1:28 Anthony G. Basile
2014-10-06 11:39 Mike Pagano
2014-10-06 11:38 Mike Pagano
2014-10-06 11:16 Anthony G. Basile
2014-10-06 11:16 Anthony G. Basile
2014-09-27 13:37 Mike Pagano
2014-09-26 19:40 Mike Pagano
2014-09-22 23:37 Mike Pagano
2014-09-09 21:38 Vlastimil Babka
2014-08-26 12:16 Mike Pagano
2014-08-19 11:44 Mike Pagano
2014-08-14 11:51 ` Mike Pagano
2014-08-08 19:48 Mike Pagano
2014-08-19 11:44 ` Mike Pagano
2014-07-15 12:23 Mike Pagano
2014-07-15 12:18 Mike Pagano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox