0%

COMP9319 week1 notes

The difference betweeen online and physical is that you can talk with your peers after lecture.

Raymond Wong also don’t like the course name. The course misses a lecturer and he becomes the one.
compression - bad
search - ok

compression and search means after making the data small, you can still search.

the teacher of 6714 also resign and Raymond Wong becomes the lecturer.

Even though you may did really bad in other cse courses, but if you are interested in it, you have the potential to learn it well.

Compression

There are two types of compression, one is lossless and another is lossy.
Lossy does not mean bad, in some time, we like information loss, such as a photo blur the backgroudn.
But in this course, we consider all info is valuable and focus on lossless compression.

What is compression?
Firstly, we code the data into a special format, which reduce the total number of bits needed to represent the data.

Input data -> Encode -> Store in networks -> Decode -> Ouput data
After then, we encode it and get the data back.

compression ratio is uncompressed size divide compressed size

space savings = 1 - 1/compression ratio

Example:
There is a file with 30bits, which is also the uncompressed size.
After compression, it becomes 10bits, which is also the compressed size.
Then the compression ratio is 30/10=3:1
Space savings is 1-1/3=2/3

first assignment

raaabbccccdabbbbeee$

encoder part of teacher’s solution is 150 lines and the decoder part is 160 lines.

$ represents the end of file, you cannot see it in a file. Here, teacher likes $ and he uses $ to mark the end.

So it is r1a3b2c4d1a4e3

But it is still not enough, because the original only has one r which takes up 1bit and after compression, the result is r1 taking up 2 bits.
It is not compression but expanding.

And 2 is also funny, rr becomes r2. You save no space but only make the text unreadable.

The rule is that you don’t change things unless you do some stuff.

So you do nothing unless there are 3 duplicate characters.

But here is another problem. What if the orignal text is rrccr2?
It can’t be r2c2r2.

You have to make a rule. The start of duplicate symbol and end of it.

Solution:
Every character is stored in bytes.
If the first bit is 0, then the remaining bits represents a character. Otherwise, the remaining bits is the length of previous character.

So the text becomes ByteByteByteByteByte.

Here is another problem, we have only 7bits to represent the length. So the value of length ranges from 0~127. Then what if the length is 100000000000000000?

We can use two or more bytes to represent length.
But why not just say 100 billion instead?

Scheme2:
AAA -> A=3
Because A and AA won’t be changed.
Then if the length is 130, you are safe. Because 130->127

Sometimes, If you consider further improve, you go backward because of the complexity will influence the efficiency of the algorithm.

problem: runs are usally small

Because in a word, the duplicate character seldom appears. There is few word that have 3 duplicate characters in it.

a glimpse of BWT+RLE

it will be introduced in week3.
It may only cost you 5 minutes to implement the BWT. But it will cost you a week to know why it works.

HTTP compression

When the browser request a page, the server compresses the html page using gzip and send the compressed data to the browser.
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip

Then the browser using gzip to decode it, get the content and render the page.

technology

Technology has a very short lifetime, you either make the money or you just get away.

If you work in Amazon, you will use compression techniques because there are messaging queues over there. If you can make the messges 10 times smaller then the queue can hold 10 times more messages.

If you’re the application programmer, you never use it in your career. But if you’re a platform programmer, you do care bout it.

Storwize

The big companies make efforts to develop green energy not because they love the earth. But the electricity is expensive. For labor cost, you can fire them but for electricity, you have to mantain the equipment every month. The clients’ data is stored in those equipments.

Less to store, less power used.

But if you ask a student not studying 9319, he probably will say that who cares about compression? I can double my portable hard drives.

Server -> switches -> IBM real-time compression equipment -> storage

For a hard drive, it is considered as good if it can fill every 50000 during test.
But for google who must own 50000 hard drives, every second the hard drive may fail even they are brand new.
And this is why you need big data to store it.

Anti-virus

There is a company called cement tech.
It stores anti-virus defnition in their amazon cloud.
Every time the user downloads from the server, the server will check if the content has virus. So it has to be compressed otherwise the bandwith is huge, costing too much money.

You may think why not they build their own cloud? Because it will be more expensive. You just pay for the short-time use for a storage service.

So this is why when you download a software or update operating system, after downloading successfully, you have to wait for a long time. Because they decompress their softwares small to save money and expanding the data from decompression takes a long time.

SD card

Now the mobile phone has 256g, then you need to search the information up to 256g. Then computation cost is huge.

Microsoft

STAC is a company to invent the compression. If you were born 30 years ago, you could found that company easily.
Their product is less than 5000 lines, including interfaces and pages.
The Microsoft dispatched technical staffs to check if the company really had something. And they were shocked because the product only had 5000 lines. So the Microsoft don’t pay for it but copy it after they came back.
Although the staff in STAC are just undergraduate, but they found a good lawyer and finally won the case.

gzip

Sometimes, you add extra info during compression.

After first run of gzip, the file gets smaller.
But after we changed the compressed file and change its name, we use gzip to compress it again. The file gets bigger.

It is because if it cannot make it smaller, so it add overhead only.

similarity measure

If things are more repetitive or redundant, you will compress more.
If we compress two images together getting less compression ratio than compressing them seperately, it means there are lots of overlap between them, just like the duplicate character.

Some people came out these ideas 20 years ago. They built a company but failed because they couldn’t make a theory into a good product.

NOTE

Change your perspective for information, which is vital for this course.
You can go to ~cs9319/wk1 on cse.
There are two txt, eg1.txt and eg2.txt, both with the same size 80B.
eg1.txt has more repetitive characters then eg2.txt.
So the previous file’s compressed size is smaller.

sunny day and rainy day

We can use bit 1 to represent rainy day and 0 for sunny day.
This is what eg1.bin does and you can use xxd -b eg1.bin to look at it.

you can make it more efficient using running and coding

Running and coding is the only algorithm you have ever learnt.
If bit 0 represents sunny day, we use the first bit to represent if it is sunny or rainy.
So 00000000 00000111 can be marked as 0[13] 0[3].

even even even more efficient by losing info

You can just say S80. Because in normal case it is sunny. You just need to record the main info.

sunny day, cloudy day and rainy day

Sunny day is marked as 00.
Rainy day is marked as 01.
Cloudy day is marked as 10.

probability

more sunny days than rainy and cloudy

Then we should use less bits to represent sunny and more bit to represent rainy.

So they can be marked as:
S: 0
R: 10
C: 11

ASCII Code

Since there is more A than Z in an English book. Why they are encoded with the same bit length?
A: 65
Z: 90

UTF-8

It uses between 1 to 4 byte.
If you use xxd to see the file with Chinese characters. It will use less bits to represent common Chinese characters.

But i test it, all characters are 6*4=24bit in utf-8.

use bits to mark the end of file

If you don’t use EOF, then the last byte may be 00000000. But this byte does mean 8 sunny days, it does not represent anything.

So there is the solution:
S: 0
R: 10
C: 110
EOF: 111

problem of RLE

RLE stands for run-length encoding.
Most documents don’t have many runs.

BWT

If you use BWT to compress a file. The file is smaller.
And then you use gzip to compress it again. The file gets bigger, which means the gzip also uses BWT.

BWT cannot only do compression but also can do search after BWT compression.
And its search only costs constant time, which means the search is independent of file size.
If after BWT, the file is 1MB the search costs 1 sec.
Then if another file is 1GB after BWT, the search still costs only 1 sec.

what is COMP9319?

  • how different compression algorithms work

    We only learn classic algorithms because all new algorithms are derived from those. After you learn the classic ones, you can learn others new by yourself easily.

    We also cares about small device not only care about big companies devices. Because the final end is the user, is our computer. If you can make your program quicker in your device, then you can make millons of computers running faster.

    Algorithm is the money. If you figure out an algorithm which can make file smaller, then what you are doing is saving money.

  • how to search without index?

    If the search is slow, you can use database to create index using btree or hash table. You just only to type one command to do it.
    But what is the cost?
    You have to maintain the index when you update data you need to update the index as well. Index is not cheap. It works well only when you are rich and can buy many servers.
    Don’t build index unless it is necessary because you need to load more from the disk to memory and CPU need to compute more to decode the index.
    So how to use minimum index and do the maximum efficiency for search? That’s what the course will discuss.

course info

Friday 2 hours live

  • 3 hours pre-recorded

Raymond Wong

Areas: database, information retrieval, big data SYSTEM

  • TEXT mining

He is a system person but not in theory.

live lectures

He will interact with students except week1 because in week1 students have no idea about what to ask.
Possible questions:

  • How do you implement this?

  • exercises
    Exams are similar to exercises.

consulations

You can see the schedule in webcms.

Online:
Tuesday 13:30 and Wednesday 20:00 for week2-5, 7-11
moodle collaborate blackboard
In person:
Friday 14:00 for week3-5, 7-11
K-E12-114-BUS114

two assignments

If you think the assignment is not difficult enough, you can ask for an advanced project with 2 sub assignments.