memcpy

Are you tired of hacking?, take some rest here. Just help me out with my small experiment regarding memcpy performance. after that, flag is yours.

ssh memcpy@pwnable.kr -p2222 (pw:guest)


The source:


/* memcpy.c */
// compiled with : gcc -o memcpy memcpy.c -m32 -lm
#include <math.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

unsigned long long rdtsc() { asm("rdtsc"); }

char* slow_memcpy(char* dest, const char* src, size_t len) {
  int i;
  for (i = 0; i < len; i++) {
    dest[i] = src[i];
  }
  return dest;
}

char* fast_memcpy(char* dest, const char* src, size_t len) {
  size_t i;
  // 64-byte block fast copy
  if (len >= 64) {
    i = len / 64;
    len &= (64 - 1);
    while (i-- > 0) {
      __asm__ __volatile__(
          "movdqa (%0), %%xmm0\n"
          "movdqa 16(%0), %%xmm1\n"
          "movdqa 32(%0), %%xmm2\n"
          "movdqa 48(%0), %%xmm3\n"
          "movntps %%xmm0, (%1)\n"
          "movntps %%xmm1, 16(%1)\n"
          "movntps %%xmm2, 32(%1)\n"
          "movntps %%xmm3, 48(%1)\n" ::"r"(src),
          "r"(dest)
          : "memory");
      dest += 64;
      src += 64;
    }
  }

  // byte-to-byte slow copy
  if (len) slow_memcpy(dest, src, len);
  return dest;
}

int main(void) {
  setvbuf(stdout, 0, _IONBF, 0);
  setvbuf(stdin, 0, _IOLBF, 0);

  printf("Hey, I have a boring assignment for CS class.. :(\n");
  printf("The assignment is simple.\n");

  printf("-----------------------------------------------------\n");
  printf("- What is the best implementation of memcpy?        -\n");
  printf("- 1. implement your own slow/fast version of memcpy -\n");
  printf("- 2. compare them with various size of data         -\n");
  printf("- 3. conclude your experiment and submit report     -\n");
  printf("-----------------------------------------------------\n");

  printf("This time, just help me out with my experiment and get flag\n");
  printf("No fancy hacking, I promise :D\n");

  unsigned long long t1, t2;
  int e;
  char* src;
  char* dest;
  unsigned int low, high;
  unsigned int size;
  // allocate memory
  char* cache1 = mmap(0, 0x4000, 7, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  char* cache2 = mmap(0, 0x4000, 7, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  src = mmap(0, 0x2000, 7, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

  size_t sizes[10];
  int i = 0;

  // setup experiment parameters
  for (e = 4; e < 14; e++) {  // 2^13 = 8K
    low = pow(2, e - 1);
    high = pow(2, e);
    printf("specify the memcpy amount between %d ~ %d : ", low, high);
    scanf("%d", &size);
    if (size < low || size > high) {
      printf("don't mess with the experiment.\n");
      exit(0);
    }
    sizes[i++] = size;
  }

  sleep(1);
  printf("ok, lets run the experiment with your configuration\n");
  sleep(1);

  // run experiment
  for (i = 0; i < 10; i++) {
    size = sizes[i];
    printf("experiment %d : memcpy with buffer size %d\n", i + 1, size);
    dest = malloc(size);

    memcpy(cache1, cache2, 0x4000);  // to eliminate cache effect
    t1 = rdtsc();
    slow_memcpy(dest, src, size);  // byte-to-byte memcpy
    t2 = rdtsc();
    printf("ellapsed CPU cycles for slow_memcpy : %llu\n", t2 - t1);

    memcpy(cache1, cache2, 0x4000);  // to eliminate cache effect
    t1 = rdtsc();
    fast_memcpy(dest, src, size);  // block-to-block memcpy
    t2 = rdtsc();
    printf("ellapsed CPU cycles for fast_memcpy : %llu\n", t2 - t1);
    printf("\n");
  }

  printf("thanks for helping my experiment!\n");
  printf("flag : ----- erased in this source code -----\n");
  return 0;
}


Let’s first compile it and run on the server with random input:


memcpy@pwnable:/tmp$ gcc -m32 memcpy.c -g -lm # -g for easier debug
memcpy@pwnable:/tmp$ ./a.out
Hey, I have a boring assignment for CS class.. :(
The assignment is simple.
-----------------------------------------------------
- What is the best implementation of memcpy?        -
- 1. implement your own slow/fast version of memcpy -
- 2. compare them with various size of data         -
- 3. conclude your experiment and submit report     -
-----------------------------------------------------
This time, just help me out with my experiment and get flag
No fancy hacking, I promise :D
specify the memcpy amount between 8 ~ 16 : 8
specify the memcpy amount between 16 ~ 32 : 16
specify the memcpy amount between 32 ~ 64 : 32
specify the memcpy amount between 64 ~ 128 : 64
specify the memcpy amount between 128 ~ 256 : 128
specify the memcpy amount between 256 ~ 512 : 256
specify the memcpy amount between 512 ~ 1024 : 512
specify the memcpy amount between 1024 ~ 2048 : 1024
specify the memcpy amount between 2048 ~ 4096 : 2048
specify the memcpy amount between 4096 ~ 8192 : 4096
ok, lets run the experiment with your configuration
experiment 1 : memcpy with buffer size 8
ellapsed CPU cycles for slow_memcpy : 2456
ellapsed CPU cycles for fast_memcpy : 288

experiment 2 : memcpy with buffer size 16
ellapsed CPU cycles for slow_memcpy : 334
ellapsed CPU cycles for fast_memcpy : 522

experiment 3 : memcpy with buffer size 32
ellapsed CPU cycles for slow_memcpy : 646
ellapsed CPU cycles for fast_memcpy : 642

experiment 4 : memcpy with buffer size 64
ellapsed CPU cycles for slow_memcpy : 1064
ellapsed CPU cycles for fast_memcpy : 192

experiment 5 : memcpy with buffer size 128
ellapsed CPU cycles for slow_memcpy : 1998
Segmentation fault (core dumped)


To find the cause of SEGFAULT, let’s launch gdb (it’s the only debugger on the server).


memcpy@pwnable:/tmp$ gdb -q a.out # -q for silent
Reading symbols from a.out...done.
(gdb) start < in # in is a file with the input
# program runs...
Program received signal SIGSEGV, Segmentation fault.
0x080487cc in fast_memcpy (dest=0x804d0a8 "", src=0xf7fca000 "", len=0)
29                              __asm__ __volatile__ (
(gdb) x/i 0x80487cc # this is where segfault occured
=> 0x80487cc <fast_memcpy+52>:  movntps %xmm0,(%edx)


What’s “movntps”?

Moves the packed single-precision floating-point values in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed single-precision, floating-pointing. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.

It says the memory operand must be aligned on a 16-byte boundary otherwise there will be an exception. That seems like the issue we are having. The “memcpy amount” we are giving decides which addresses and whether they happen to be aligned 16-bytes or not.


Let’s see addresses the memory operand happens to be, when we always input the least number (8, 16, 32, and so on) as in the file “in”.


(gdb) disass main
Dump of assembler code for function main:
  0x0804880c <+0>:     lea    0x4(%esp),%ecx
  0x08048810 <+4>:     and    $0xfffffff0,%esp
  # truncated
  0x08048ac7 <+699>:   push   %eax
  0x08048ac8 <+700>:   call   0x80485d0 <malloc@plt> # this gives the address
  # truncated
(gdb) break *0x8048ac8 
(gdb) start < in # in is a file with all inputs
Starting program: /tmp/a.out < in
Temporary breakpoint 3, main () at memcpy.c:49
49      int main(void){
(gdb) c
Continuing.
# program runs...
Breakpoint 2, 0x08048ac8 in main () at memcpy.c:102
102                     dest = malloc( size );
(gdb) layout regs # this will open a window with registers
(gdb) s # step over to next instruction


This is how it should look:



and we’ll keep repeating “c” (continue) and “s” (step over) to see how the eax has changed. Note that right after the call to malloc, eax stores the pointer to the allocated space.


For size 8, 16, 32, 64, 128, we have 0x804d010, 0x804d020, 0x804d038, 0x804d060, 0x804d0a8, respectively. We can see that perhaps each time it allocates (size + 8) bytes. Under this assumption, all we need to do to keep it 16-bytes-aligned is to ask for 8 bytes more each query other than the first one.

Payload

python2 -c 'print "\n".join(str(2**(i+3) + (8 if i > 0 else 0)) for i in range(10))'\
  | nc 0 9022