This function works without any problem but it's not gives me the speed that i expected! Learn more. Asked 2 days ago. Active 2 days ago. Viewed 40 times. Peter Cordes k 29 29 gold badges silver badges bronze badges.
issue in rewriting strlen from C to Assembly
You could handle that with AVX masking which suppresses faults or by doing an aligned load and then shuffling to discard bytes from before the start of the string. Or if you never need to use this on strings that might end within 64 bytes of the end of a page, you can skip that overhead. And return a pointer to the end instead of length, if you want.
What size strings did you test with? Obviously if you bottleneck on memory or L3-cache bandwidth, doubling the vector width won't help much.
Or if your strings are short like most strings in most programs arethe terminator will be in the first 32 bytes, or the loop will only run 1 iteration whether it does 4x32 or 4x64 bytes.
If you need a strlen optimized for long strings, if possible use explicit-length strings that track their own length and don't need scanning. Also, bit uops reduce max turbo, and shut down port 1 on Skylake-avx Didn't see your reply because you didn't me. Re: handling that initial startup. Or since it's asm, you can even disassemble closed source ones for ideas. They all have to solve this problem which is part of why implicit-length strings suck for SIMD when you have to be compatible with code that can't guarantee padding after the terminator.
Optimizing this can be a good learning exercise, but really try to use strlen less. Active Oldest Votes. Sign up or log in Sign up using Google. Sign up using Facebook. Sign up using Email and Password. Post as a guest Name. Email Required, but never shown. The Overflow Blog. The Overflow How many jobs can be done at home? Featured on Meta. Community and Moderator guidelines for escalating issues via new response…. Feedback on Q2 Community Roadmap. Technical site integration observational experiment live on Stack Overflow.
Triage needs to be fixed urgently, and users need to be notified upon….Disclaimer : The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable Pascal-style strings.
This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming. A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:. Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero.
That is what Agner Fog calls "vector operations in general purpose registers". Warning : this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string. The dwords may be unaligned, but x86 architecture allows access to unaligned data.
For small strings, the alignment will take more time than the penalty of unaligned reads. The code is not portable: you will have to add another 4 conditions if you use a bit processor. For big-endian architectures, the order of conditions should be reversed. The idea is the same, but he uses clever math tricks to avoid branches inside the loop.
The functions were tested on several short strings words in Gettysburg address and on a long string Lewis Carroll's Jabberwocky. Implementing strcmp, strlen, and strstr using SSE 4.
Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. Ten recent comments are shown below. Show all comments. Brad, thank you very much for noticing. Your version has one instruction less in the loop. As far as I remember, both versions have the same speed for this reason. I've found that standard strlen works faster than your implementation with Ubuntu Linux For the short string your implementation is the fastest 1.
For the long string your implementation is the slowest 7. Ubuntu The Netwide Assembler is an x86 and x assembler that uses syntax similar to Intel.
It supports a variety of object file formats, including:. Wikipedia has related information at Netwide Assembler. For example.
Subscribe to RSS
For those using gdb with nasm, you can set gdb to use Intel-style disassembly by issuing the command:. To pass the kernel a simple input command on Linux, you would pass values to the following registers and then send the kernel an interrupt signal.
To read in a single character from standard input such as from a user at their keyboarddo the following:. After the int 0x80eax will contain the number of bytes read. While on Linux you pass system call arguments in different registers, on BSD systems they are pushed onto stack except the system call number, which is put into eax, the same way as in Linux.
BSD version of the code above:. In this example we are going to rewrite the hello world example using Win32 system calls. There are several major differences:.
In order to assemble, link and run the program we need to do the following. This example was run under cygwin, in a Windows command prompt the link step would be different. In this example we use the -e command line option when invoking ld to specify the entry point for program execution.
One last note, WriteConsole does not behave well within a cygwin console, so in order to see output the final exe should be run within a Windows command prompt:. In this example we will rewrite Hello World to use printf 3 from the C library and link using gcc. This has the advantage that going from Linux to Windows requires minimal source code changes and a slightly different assemble and link steps.
In the Windows world this has the additional benefit that the linking step will be the same in the Windows command prompt and cygwin. There are several major changes:. From Wikibooks, open books for an open world. Category : Book:X86 Assembly. Namespaces Book Discussion. Views Read Edit View history.
It only takes a minute to sign up. I wrote my own implementation of strlen and strcmp from C in x86 FASM and I would like to know is there anything that should be changed or improved. Correctness Return Value : You are violating the convention for the strlen function, which is documented as returning the number of characters between the beginning of the string and the terminating null character without including the terminating NUL character.
Your code includes the terminating NUL, given the position of the inc ebx instruction. This may be fine if you control both the function's implementation and its usage, but it is confusing because it defies programmers' expectations and will be a recurring source of bugs.
If you're going to return a length that includes the terminating NUL, you should consider calling your function something different than strlen. Interface ABI : All x86 calling conventions return a function's result in the eax register.s4 pb 1 - implement strlen function in assembly
Although you have documented your function as returning the result in ebxthis is utterly bizarre and is guaranteed to trip up every programmer who ever uses your code. When writing everything in assembly, you are of course free to define your own custom calling conventions, but you should only do so when there is a good reason like an optimization possibility. I can't see a good reason here. It would be just as easy for you to arrange for your code to produce the result in eaxright where programmers will expect it to be.
It is also somewhat unusual to pass an argument in the eax register, but calling conventions vary in which registers they use to pass arguments, so this isn't flying in the face of every convention ever and is therefore more excusable. However, when you're writing in assembly and you have the opportunity to make these types of decisions, you should consider your choices carefully: what makes the most sense?
Have a good reason for your choice! In this case, passing a pointer in eax makes little sense, since eax is almost universally used for return values, and pointers are almost never going to be the return value of a function.
By choosing eax as the input register, you've virtually guaranteed that every caller will need an extra mov instruction to shuffle the input parameter into the appropriate register.
Why create this situation when you don't have to? Style Indentation : The way you've indented the code, with the labels at the same level as the instructions, makes it difficult to read because all of the instructions aren't lined up. Instead, consider outdenting the internal labels branch targets so that they match the function name external symbols.
That will allow all instructions in the function to be lined up at the same vertical column, and thus allow anyone reading the code to skim it easily. The only drawback of this is that it makes it a bit harder to determine what is a function label and what is an internal label.
Judicious use of whitespace is the most effective way to combat this. I also use a naming convention that allows me to recognize the difference at a glance. Also, use variable numbers of spaces between the opcode and the operands to ensure that all operands line up in vertical columns.The Unix programmer often works with C-style text strings, which consist of the text followed by a NULi.
The C library provides a strlen function to find the length of the string. In assembly language finding the length of a C-style string is a snap. The x86 family of microprocessors come with with the scasb instruction which searches for the first occurence of a byte whose value is equal to that of the AL register.
The address of the start of the string itself has to be in the EDI register. Technically, it is supposed to be in the extra segmentbut we do not need to worry about that in the flat bit memory mode anymore. When used along with the repne prefix, the scasb instruction goes up or down, depending on the direction flag the memory, looking for the match. It quits when it either finds a match, or ECX becomes equal to 0. To find the length of a string, we need to initialize ECX to the highest value possible, which is 4, in the bit mode.
When viewed as a signed value, it is the same as The easiest way to achieve that is by first setting ECX to 0then reversing all its bits, e. Now that we have found it, we can figure out the length of the string. We could take several approaches:. The most obvious would be by subtraction of the position of the NUL from the start of the string.
This is how the typical C library does it. But in assembly language we have a faster way. Note that while we initialized ECX to 4,it was the same as But the microprocessor decreased its value with every scan, including when it found the NUL.
We could then subtract the 2 :. But this is still not the most efficient way. We know from digital electronics that reversing all bits of a negative number results in its absolute value - 1.
We can, therefore, replace the first two lines of the above with not ecx :. To put it all together, to find the length of the string whose starting address is in EDIwe only need a few lines of assembly language code:. If we were to write a C library, naturally, we would write the strlen function in assembly language for greater speed.It is essentially the floating-point equivalent of the MMX instructions.
The SSE registers are bits, and can be used to perform operations on a variety of data sizes and types. Originally, an SSE register could only be used as four bit single precision floating point numbers the equivalent of a float in C. The instruction takes three parameters, arg1 an xmm register, arg2 an xmm or a bit memory location and IMM8 an 8-bit immediate control byte. The lower two elements will come from arg1 and the higher two elements from arg2.
IMM8 control byte is split into four group of bit fields that control the output into arg2 as follows:. The 2-bit values shown above are used to determine which elements are copied to arg2. Bits are "indexes" into arg2and bits are "indexes" into the arg1. Note that since the first and second arguments are equal in the following example, the mask 0x1B will effectively reverse the order of the floats in the XMM register, since the 2-bit integers are 0, 1, 2, 3.
Had it been 3, 2, 1, 0 0xE4 it would be a no-op. Had it been 0, 0, 0, 0 0x00 it would be a broadcast of the least significant 32 bits. SSE 4. These instructions take three parameters, arg1 an xmm register, arg2 an xmm or a bit memory location and IMM8 an 8-bit immediate control byte.
These instructions will perform arithmetic comparison between the packed contents of arg1 and arg2. The results of stage 1 and stage 2 of intermediate processing will be referred to as IntRes1 and IntRes2 respectively. Compares strings of implicit length and generates index in ECX. Compares strings of implicit length and generates a mask stored in XMM0. Compares strings of explicit length and generates index in ECX. Compares strings of explicit length and generates a mask stored in XMM0. For more in-depth references take a look at the resources chapter of this book.
These suffixes differentiate between different versions of the operation. The first letter describes whether the instruction should be P acked or S calar. Packed operations are applied to every member of the register, while scalar operations are applied to only the first value. For example, in pseudo-code, a packed add would be executed as:.
The second letter refers to the data size: either S ingle or D ouble. This simply tells the processor whether to use the register as four bit floats or two bit doubles, respectively.
From Wikibooks, open books for an open world. There are 5 pending changes awaiting review. S -o shufps. This admititedly looks odd, but we ; can now use edx to index into s1 and s2. As we adjust edx to move ; forward into s2, we can then add edx to eax and this will give us ; the comparable offset into s1 i. Category : Book:X86 Assembly. Namespaces Book Discussion.
Views Read Latest draft Edit View history. Policies and guidelines Contact us. In other languages Add links.
Fast strlen function
Iput string is ended with "0h". Output length is without char "0". Input is alway ended with "0h" I have something like that but it doesnt work:.
Argument is a pointer to a string. With push dword [str] you don't pass a pointer, but the first 4 bytes of the string. Change it to push str. You correctly pushed the argument but don't clear the stack. Add a add esp, 4 after call strlen. This is the maximal count of the repetitions. So ECX can be used to determine the counts of the repne-run - with a little adjust. To define a string you can use the. Other than that, your code works fine. The value in R8 at the time your program crashes is the file descriptor returned by the open syscall.
Its value is probably 3 which isn't a valid address. You'll need to stores these values in a range of memory you've properly allocated. You can create a buffer in your There simply is no such form of je. What you can do is put a relative conditional jump based on the opposite condition, followed by an unconditional register-indirect jump: jne skip jmp eax skip: You could make a macro out of this to save you from writing the same thing Looking at your code I was able to find the infinite loop.
WriteFile expects a value, not an address pointer. Change it to Opcode: 11 Remaining 26 bits: Bits of the address of label Explanation: The machine language equivalent that you know so far is: 11xx xxxx xxxx xxxx xxxx xxxx xxxx x represents not-known-at-this-point. To find bit Machine Language representation of jal func, first thing you'd need is the There is no simple answer, so I won't delve into it here.
The immediate problem is that your prints destroys bx because it sets bl and bh so your printmem loop which requires bx to be preserved blows up. However, it also destroys al so your input loop won't be storing the correct value in memory to start with, either. Furthermore, while