- Continuing from last time, we learnt how files are compressed into a smaller file and then decompressed back to the original content
- A compression model that builds on this is the idea of
*self-extracting bits*:- From an original input
B , let us use the compression algorithm from last time to create a smaller, compressed bitstream - Let us create a program
`SelfExtractExample.java`

that contains this compressed bitstream, along with the decompression trie needed to uncompress it - Let the bits that comprise this entire program (including the compressed bitstream and decompression trie) be
CB - Running the program in an interpreter will output the original file
B . Thus, running the bits ofCB createsB . In other words, theCB bits self-extract themselves to generateB

- From an original input
- Question: given a target input bitstream
B , what is the shortest possible compressed bitstreamCB that outputsB ?- We want some compression algorithm that generates the shortest possible self-extracting set of bits
CB from an inputB , such that when we runCB , we get backB - Answer: see rest of lecture, we will find that such an algorithm is an
NP class problem, which has unknown difficulty

- We want some compression algorithm that generates the shortest possible self-extracting set of bits

- We define Java-Kolmogorov complexity,
KJ(B) , as the length of the shortest Java program possible (in bytes) thatB - An answer definitively exists, it may just be very difficult to find
- Note: Kolmogorov complexity is independent of language
- For example, the Python-Kolmogorov complexity of
B ,KP(B) is no more than a constant factor larger thanKJ(B) - This is because it takes a fixed, constant amount of bytes to write a Java interpreter in Python, so
KP(B) would be no more thanKJ(B) plus the additional constant number of bytes to write the interpreter (this constant term for the interpreter allows us to generalizeKJ(B) toKP(B) ) - We can generalize this to any language, thus Kolmogorov complexity for any input bitstream is language independent

- For example, the Python-Kolmogorov complexity of
- Also note: it is impossible to create a program that calculates the Kolmogorov complexity of any bitstream

- We know the following runtimes for the following problems:
- Finding the median of an array takes
Θ(N) - Comparison based sorting:
Θ(NlogN) - Checking if an array contains a pair of elements that satisfy some criteria:
Θ(N2) - Calculating Kolmogorov complexity,
K(B) of bitstreamB : impossible

- Finding the median of an array takes
- Goal of this lecture is to figure out the complexity of a good compression algorithm, but first some detours

- For a graph consisting of nodes and edges, the independent set problem determines if there exists a set of verticies of size
k that are independent - no two vertices in the set ofk elements are adjacent on the graph - We can think of having elements in our set of
k vertices as colored, and the other elements uncolored - For a graph of
N vertices, there are2N total coloring posibilites, and for each possibility:- It takes
O(k) runtime to check if the number of colored vertices is equal tok - For each colored vertex, we need to check that all neighbors are uncolored. This takes
O(kN) - If for any of the
2N possibilities, both these conditions are met, we have solved the independent set problem for this graph

- It takes
- Thus, combining this all gives a worst-case runtime of
O(kN∗2N) for the independent set problem - Since
k≤N , this simplifies to a worst case runtime ofO(N2∗2N) - Intuition tells us that the best-case runtime of the independent set problem is
Ω(N) , since we must do something with at least each vertex once - Thus we have a complexity of
O(N2∗2N) andΩ(N) . The following sidenote sections are to tighten these bounds

- We say that problem
X reduces to problemY ifY can be used to solveX - For example, if
Y is sorting andX is finding the median element ignoring all even elements of an array,X can be solved usingY . Thus,X reduces toY . - Another example: the 3SAT problem
- Let us have a boolean formula, that is a set of three-variable disjunctive constraints (something like
`(x1 or x2 or x3) and (! x1 or ! x2 or x4) and (! x1 or x3 or ! x4) and (x1 or x3 or x4)`

, where each parenthesis group represents a single three-variable disjunctive constraint) - 3SUM problem: given such a formula, assign values to the variables that satisfies the formula
- We can use the independent set problem to solve the 3SUM problem, by constructing a graph as follows:
- For each disjunctive constraint clause, add three vertices to the graph in a triangle (each vertex contains a variable of the clause, such as
`x3`

or`!x1`

) - Between each occurence of a variable and its negation (for example, all
`x1`

,`!x1`

pairs) in the graph, add an additional edge (thus all`x1`

,`!x1`

pairs are connected)

- For each disjunctive constraint clause, add three vertices to the graph in a triangle (each vertex contains a variable of the clause, such as
- We solve 3SUM by finding an independent set in this graph of size
k , wherek is the number of variables in the boolean formula (in the above example,4 )

- Let us have a boolean formula, that is a set of three-variable disjunctive constraints (something like
- Summary: 3SUM reduces to the independent set problem. Thus, a worst case lower bound on 3SAT also applies to the optimal algorithm for solving the independent set problem

- A
**decision problem**is one that has a yes or no answer- Example: determine if an array has a duplicate pair of elements
- Not example: count number of duplicate pairs in an array

- A problem is said to be
**in complexity class**if:P - It is a decision problem
- It can be
*solved*inO(Nk) time, for somek - Note:
P stands for polynomial, referring to polynomial time - Minor note:
N is the number of bits needed to specify the problem's input

- For example, the problem of determining if an array of numbers has a pair of elements that sum to zero has runtime
O(N2) (sort, then use two pointers), and is in complexity classP - A problem is said to be
**in complexity class**if:NP - It is a decision problem
- A 'yes' answer to the problem can be
*verified*inO(Nk) time for somek - More precicely, a specific 'yes' example for the problem can be verified in
O(Nk) time

- For example, the independent set problem is in complexity class
NP , as it takesO(kN)≤O(N2) time to verify that there indeed exists a set of independent vertices of some sizek≤N - About the complexity classes:
- Problems in
P are considered "easy" - The runtime,
O(Nk) is closed under addition (run aP algorithmN times, and the runtime is still inP - still in thP complexity class) - For practical problems, the
k exponent is usually small - Most practical problems are in
NP - Example 1: given an integer
Z , does there exist two primesX andY whose product isZ ? - Example 2: is there a way to route an airline's airplanes such that the total cost is lower than my operating expense's annual budget?

- Example 1: given an integer

- Problems in

- Let a problem
π be defined asif:NP -completeπ is inNP π solves every other problem inNP (all problems inNP reduce toπ )

- A 1971 proof by Stephen Cook proved that the 3SAT problem is
NP -complete- 3SAT (described above) is in
NP , and if we can efficiently solve 3SAT, we can efficiently solve*any*decision problem whose answer can be efficiently verified (decision problem inNP )

- 3SAT (described above) is in
- High level understanding of Cook's proof:
- Create an enormous boolean logic expression that represents the entire state of a computer at every timestep
- For example, we could have a boolean variable in this expression that is true if the
kth bit of memory is true, and we are on thejth line of code and on thelth execution cycle - If the solution to this giant logic expression takes polynomial time, the boolean logic circuit is polynomial in size

- In 1972, Dick Karp showed that 21 famous
NP problems solve 3SAT, and are thus alsoNP -complete- Since then, thousands more problems have been proven to be
NP -complete - Some
NP -complete problems: 3SAT, independent set, hamiltonian path finding, hamiltonian cycle finding, finding longest paths in graph, and even determining if a Super Mario Bros. level has a solution

- Since then, thousands more problems have been proven to be
- We prove problem
X isNP -complete by solving anyNP -complete problem withX - The worst and best case runtime bound for any
NP -complete problem applies to*all*otherNP -complete problems (including the independent set problem) - There is no known difficulty/complexity for
NP problems - Connection to last lecture:
- Finding a good compression algorithm is an
NP class problem, and for allNP problems there is no definite known complexity.NP problems are conjectured to be quite hard (complexity somewhere between a runtime likeΘ(N2) and impossible (like Kolmogorov complexity))

- Finding a good compression algorithm is an

- Review:
P is the set of decision problems that can be solved inO(Nk) timeNP is the set for which a yes answer can be verified inO(Nk) timeNP -complete problems areNP problems that solve all otherNP problems

P problems andNP -complete problems are a subset of allNP problems- This is intuitively because if you can solve a problem, you can verify it

- Question posed by Cook in 1971: are all
NP problems also inP ?- are efficiently verifiable problems also efficiently solvable (
P=NP=NP -complete)?

- are efficiently verifiable problems also efficiently solvable (
- It is unknown if
P=NP - to date, nobody has provided a valid proof - If in fact
P=NP , then we would be able to determine statements like "does there exist a program of less thanX bytes that compressesB in less thanC cycles?"